A new interaction model for real-time, on-display language translation on Meta's first smart glasses with in-lens display — patent filed, launched at Meta Connect 2025.
Lead Product Designer
(sole designer)
1 PM
2 engineers
2 content designers
1 UX researcher
Interaction Model team
Education team
Dec 2024 - Oct 2025
Visual Translation is a real-time written language translation feature on Meta's first display glasses. You can look at foreign text in the world — a menu, a street sign, a poster — and read the translation on the glasses' in-lens display directly. With the multimodal AI silent-entry strategy I co-authored, you don't even have to say "Hey Meta" out loud to trigger the visual translation experience.
・・・・・・・・・・・・
It's a feature with no precedent on the form factor. Our in-lens display is transparent — the user is looking through the lens at the real-world text, and the translation has to land somewhere that doesn't compete with reality. The model's understanding of where the text is doesn't always match what the user is looking at.
・・・・・・・・・・・・
I was the sole designer on Visual Translation from 2024 through its Day 0 launch on Meta's firstn display glasses and its post-launch refinement. I designed the interaction model from scratch with my engineer Jiaqian Wu, refining it together for easy consumption of real-world text and its translation. The feature shipped as one of the launch experiences for multimodal AI on the display glasses, announced at Meta Connect 2025 keynote, has a patent filed for its interaction model.
Visual Translation was scoped around 2 Day0 scenarios:
As sole designer, I owned the interaction model, on-display rendering & navigation, gesture vocabulary, internationalization redesign, and the design POV for silent multimodal AI trigger. I partnered with the Interaction Model and Education teams to shape and define the platform-wide gesture patterns and contextual gesture tutorials.
When I joined the project, three problems were compounding:
Each problem on its own felt like a fix. Together, they were a trust problem. If the user can't discover, can't navigate, and can't read the translation, the AI feels broken regardless of whether the model is right.
I anchored the work on three principles drawn directly from how users described what they needed when translating on the go:
These principles funneled into a clean framing:
Optimizing for legibility of short texts and understanding of long texts.
Two scenarios, two different priorities, one consistent interaction language.
From there, I broke the experience into 3 questions that guided the focus of design decisions:



The biggest design decision was the gesture vocabulary. The existing model — Captouch zoom + Head IMU pan — was producing too much friction for the user and too much instability for the visual experience. I proposed a new sets of interaction model that work together and compliment each other:


Reducing manual manipulation and increasing automation through the smart auto-zoom snap selection was the answer, both visually for the user and technically for the system. Well for the most part, that's why we still enable free panning and zooming, which is more manual for the users because nothing worse than user feeling stuck in the experience.

There were a lot of detailed considerations and guidelines needed in place for engineering to implement a seamless smart auto-zoom snap aexperience, from defining the max zoom level to requesting suitable image resolution. Some of these tradeoffs were real, including the tension between legibility and latancy, and how we could mask the latency with transition animation. There were also signigicant iteration that went into having the system read documents like a human, so that when user swipes, the next selection aligns with what user intends and expects.
However, the text blocks are constained by the OCR (optical character recognition) grouping, plus the 600x600px display constraint, it is obviously not conducive for long-form text consumption, so I designed a solution for users to read paragraphs on the in-lens display if necessary.
I redesigned and iterated on this launch experience based on executive, UXR, internal employee and design system feedback, refining the interaction logic in lock-step with Jiaqian.
This proposal didn't just live inside Visual Translation. Working with the Interaction Model team — in collaboration with Alex Gerrese — I shaped the broader proposal to unify wrist-roll zoom and pan gestures across other features, such as Map and Gallery. The interaction language I designed for translation became platform language.
Visual Translation along with other multimodal AI features I standardized the interaction model for were all launched with the announcement of Meta's first consumer display glasses, Meta Ray-Ban Display with Neural band at Meta Connect 2025 by Mark Zuckerberg. Jiaqian and I also filed the patent for it on the same day!




After Visual Translation is being shipped alongside all the other multimodal AI experiences I designed for the Meta Ray-Ban Display on Day 0, the first round of post-launch user research and dogfooding surfaced 3 issues:
I designed two solutions, in partnership with the Interaction model and User Education team:
I also refined the animation timing based on feedback from internal demo events and UXR, and I split the Follow Up Actions timing into two paths — what happens when the user takes an action before the summarization TTS has finished versus after it finishes. Day 0 had committed to a single behavior; Day 90 let the user act when they were ready instead of waiting on TTS.
The Day 90 release closed loops that Day 0 had left open. It also gave me a sharper picture of the gesture-discoverability problem on glasses generally — a problem I'd seen coming at the launch but hadn't fully solved.

To make Let It Out as dreamy and as healing as possible, I incorporated these three elements - acrylic 3D fonts, the yellow-orange, or blue-gray-ish fog appearing and disappearing, and the clouds that the text transformed in the end.



