Interaction Model

Meta Ray-Ban Display + Neural Band: Visual Translation

A new interaction model for real-time, on-display language translation on Meta's first smart glasses with in-lens display — patent filed, launched at Meta Connect 2025.

Role

Lead Product Designer
(sole designer)

XFN Partners

1 PM
2 engineers
2 content designers
1 UX researcher
Interaction Model team
Education team

Timeline

Dec 2024 - Oct 2025

Abstract

Overview.

Visual Translation is a real-time written language translation feature on Meta's first display glasses. You can look at foreign text in the world — a menu, a street sign, a poster — and read the translation on the glasses' in-lens display directly. With the multimodal AI silent-entry strategy I co-authored, you don't even have to say "Hey Meta" out loud to trigger the visual translation experience.

・・・・・・・・・・・・

It's a feature with no precedent on the form factor. Our in-lens display is transparent — the user is looking through the lens at the real-world text, and the translation has to land somewhere that doesn't compete with reality. The model's understanding of where the text is doesn't always match what the user is looking at.

・・・・・・・・・・・・

I was the sole designer on Visual Translation from 2024 through its Day 0 launch on Meta's firstn display glasses and its post-launch refinement. I designed the interaction model from scratch with my engineer Jiaqian Wu, refining it together for easy consumption of real-world text and its translation. The feature shipped as one of the launch experiences for multimodal AI on the display glasses, announced at Meta Connect 2025 keynote, has a patent filed for its interaction model.

Context × Scope

Translation on a transparent display the size of a postage stamp

Visual Translation was scoped around 2 Day0 scenarios:

  • Short-form texts — menus, street signs. Fast, glanceable.
  • Long-form texts — posters, paragraphs. More text than fits in one glance.

As sole designer, I owned the interaction model, on-display rendering & navigation, gesture vocabulary, internationalization redesign, and the design POV for silent multimodal AI trigger. I partnered with the Interaction Model and Education teams to shape and define the platform-wide gesture patterns and contextual gesture tutorials.

Pain Points × Problem Statement

Discovery, gesture friction, and legibility were all breaking user trust

When I joined the project, three problems were compounding:

  • Discovery — Users didn't know they could zoom or pan. They'd miss half a translated menu and walk away thinking the feature was broken.
  • Gesture friction — Captouch zoom (on the temple of the glasses) was easily mistaken for a tap when users wiped up or down. Head IMU panning was jittery and too sensitive — a small head movement sent the view flying. Both inputs felt like they were fighting the user instead of serving them.
  • Legibility — On a transparent display competing against real-world light, contrast is fragile. The image resolution of the camera capture made small text harder to read than it had to be as well.

Each problem on its own felt like a fix. Together, they were a trust problem. If the user can't discover, can't navigate, and can't read the translation, the AI feels broken regardless of whether the model is right.

Design Principles x problem spaces

Intuitive, Easy, Quick

I anchored the work on three principles drawn directly from how users described what they needed when translating on the go:

  • Intuitive — the whole end-to-end interaction should feel natrual and instinctive without the need for excessive explanation
  • Easy — the gesture vocabulary should reward small, natural movements, not demand precise manipulation.
  • Quick — every step from invocation to translation consumption should compound toward speed.

These principles funneled into a clean framing:
Optimizing for legibility of short texts and understanding of long texts.
Two scenarios, two different priorities, one consistent interaction language.

From there, I broke the experience into 3 questions that guided the focus of design decisions:

  • Translation Result Format — how might we show the translated result in AI's audio response and on the display?
  • Navigation — how might we enable the user to move through the translated content (zoom and pan)?
  • Follow Up Actions — what's the relationship between the AI's audio response and the display, and what can the user do after the translation lands?
Translation result adapts to how specificly user asks: if user doesn't specify what to translate, AI gives a summarization; if specified with their finger, AI gives direct translation of that specific word.
When I am in that inactive and low state, I usually end up abandoning myself in my rumination in bed for days without food and water.
I also do not want to leave any records to be accidentally reminded later on of the bad place I was in if I journaled.

Interaction Model 0 -> 1 solution

Made it SMART

The biggest design decision was the gesture vocabulary. The existing model — Captouch zoom + Head IMU pan — was producing too much friction for the user and too much instability for the visual experience. I proposed a new sets of interaction model that work together and compliment each other:

  • Selection: Auto-zoom snap
    On invocation, the system intelligently auto-zooms to the bounding box of the biggest font, assuming it is the title — short-form texts get framed for a single glance, long-form texts get framed for reading flow. This closed the discovery problem: the user doesn't have to find zoom because the system zooms first. User then can consume the translation on the in-lens display by D-pad swiping where we auto zoom to the legible level with minimum moving of the image so that user doesn't get dizzy or overwhelmed by continuous zooming in and out.
  • Pan: EMG pinch hold + move
    If the user wants to navigate to other parts of the translated document/image, they can pinch and drag to the exact spot quickly with free panning.
  • Zoom: wrist roll
    A small, intentional rotation of the wrist with the neural band, scales the translated image up or down freely. The gesture is small enough to do without arm fatigue and precise enough to land at a legible zoom level on the first try.
Mobile: To solve that low energy problem. Though I was unlikely to get up to grab a pen and paper to write, most of the time, I would be holding my phone scrolling in bed already.
Speech-to-text: Since I’m addressing on the isolated low energy state. I thought why don’t I talk with myself instead and still saying my thoughts and feelings out loud just like talking with my friends.
Free panning + zooming (optimized for long-form text consumption)

Reducing manual manipulation and increasing automation through the smart auto-zoom snap selection was the answer, both visually for the user and technically for the system. Well for the most part, that's why we still enable free panning and zooming, which is more manual for the users because nothing worse than user feeling stuck in the experience.

Smart auto-zoom snap (optimized for short-form text consumption)

There were a lot of detailed considerations and guidelines needed in place for engineering to implement a seamless smart auto-zoom snap aexperience, from defining the max zoom level to requesting suitable image resolution. Some of these tradeoffs were real, including the tension between legibility and latancy, and how we could mask the latency with transition animation. There were also signigicant iteration that went into having the system read documents like a human, so that when user swipes, the next selection aligns with what user intends and expects.

However, the text blocks are constained by the OCR (optical character recognition) grouping, plus the 600x600px display constraint, it is obviously not conducive for long-form text consumption, so I designed a solution for users to read paragraphs on the in-lens display if necessary.

  • Reader mode for long-form translated text
    When the user zooms into a long-form translation, such as paragraphs on a poster, the rendering shifts into standard system text template UI optimized for legibility, where texts are already optimized for the 600x600px display in terms of sizing and scrolling capability. Reduced UXR feedback on legibility at launch.
Reader mode on-device prototype

I redesigned and iterated on this launch experience based on executive, UXR, internal employee and design system feedback, refining the interaction logic in lock-step with Jiaqian.

This proposal didn't just live inside Visual Translation. Working with the Interaction Model team — in collaboration with Alex Gerrese — I shaped the broader proposal to unify wrist-roll zoom and pan gestures across other features, such as Map and Gallery. The interaction language I designed for translation became platform language.

Visual Translation along with other multimodal AI features I standardized the interaction model for were all launched with the announcement of Meta's first consumer display glasses, Meta Ray-Ban Display with Neural band at Meta Connect 2025 by Mark Zuckerberg. Jiaqian and I also filed the patent for it on the same day!

Post-Launch Refinement

Closing the gap between what I designed and how users actually used it

After Visual Translation is being shipped alongside all the other multimodal AI experiences I designed for the Meta Ray-Ban Display on Day 0, the first round of post-launch user research and dogfooding surfaced 3 issues:

  • Gesture difficulty.
    Even with the wrist-roll model — designed to be more intuitive than the previous Captouch and Head IMU pairing — users were having trouble with zoom and swipe-to-select. The gestures were new enough that discovery and confidence weren't where they needed to be.
  • Accidental exits.
    The "back" gesture was too close to the gestures used inside Visual Translation. Users were swiping in ways that exited the feature when they meant to navigate within it, losing the translation mid-read.
  • Gesture conflict.
    Volume and zoom were both mapped to gestures that overlapped — a single intentional motion was producing two unintended outcomes.

I designed two solutions, in partnership with the Interaction model and User Education team:

  • Contextual EDU.
    I designed educational tooltips that surface at the right moment — when the user enters Visual Translation for the first time and would otherwise have to discover the gesture vocabulary on their own. I followed the contextual EDU playbook for display behavior: show the tooltip on first use, and show it a second time if the user dismisses it prematurely. The trade-off is real — too much EDU is intrusive, too little leaves users stranded. Anchoring to the playbook kept the behavior consistent across features and respected the user's attention.
  • Revised zoom mode spec.
    Working with the Input team, I drove a revised spec for how zoom mode behaved at the system level, resolving the volume/zoom conflict and giving the gesture vocabulary cleaner boundaries. The fix wasn't just inside Visual Translation — it changed how zoom worked across the device, so other features couldn't accidentally collide with the same input.

I also refined the animation timing based on feedback from internal demo events and UXR, and I split the Follow Up Actions timing into two paths — what happens when the user takes an action before the summarization TTS has finished versus after it finishes. Day 0 had committed to a single behavior; Day 90 let the user act when they were ready instead of waiting on TTS.

The Day 90 release closed loops that Day 0 had left open. It also gave me a sharper picture of the gesture-discoverability problem on glasses generally — a problem I'd seen coming at the launch but hadn't fully solved.

Design Element Theme

Dreamy.

To make Let It Out as dreamy and as healing as possible, I incorporated these three elements - acrylic 3D fonts, the yellow-orange, or blue-gray-ish fog appearing and disappearing, and the clouds that the text transformed in the end.

Acrylic 3D fonts
Foggy Atmosphere
Cloud
Acrylic 3D fonts
Foggy Atmosphere
Cloud