AI-powered multimodal interfaces will redefine customer experience. And the best ones won’t just respond—they’ll intuit: when to speak, when to listen, and when to get out of the way.
But we’re not there yet—and technology alone won’t get us there. We need a fundamental recalibration of how voice, touch, type, and sound choreograph a truly seamless experience.
Below I explore why this shift could be as revolutionary as the iPhone’s multi-touch breakthrough in 2007—and why, as automotive UX reminds us, it’s far harder than it looks.
Like any technology, AI isn’t the answer—it’s the enabler. But what it will enable is nothing short of a revolution in digital interface and customer experience. That starts with the multimodal ways we’ll interface with it.
The promise is huge - easily as big as the revolution the iPhone started.
Critical to all of this will be voice. But let’s be honest: we’re all still recovering from the collective hangover of really bad voice interfaces. Tech that has overpromised, underdelivered, and lacked the visual or auditory cues that make human interaction feel natural.
We’ve all been there—asking Alexa to play the “right” song, getting absurdly off-base responses from Siri, or grappling with in-car voice commands that have been so bad most of us have forgotten they exist. For any of us, these voice driven assistants have became nothing more than ways for us to set timers and see if I need a jacket.
But things are changing. The underlying AI tech has grown leaps and bounds in the last several years. Now it’s time for the interface to catch up—and I believe multimodality is the unlock.
We’re talking about interfaces that move fluidly between voice, sound, text, and visuals. Experiences that feel human, respond to nuance, and span channels without friction. Not just intuitive—inevitable.
So, how do we get there? Let’s first look a few early examples of AI powered experiences and multimodal UI in the world of automotive.
No question—the early days were bumpy, but things are getting better. In this use case, voice is proving to be a massive unlock, and companies like SoundHound, Cerence, Amazon, and Google are making meaningful progress. Multimodal interfaces in cars, with voice at its center, is becoming more reliable and is seeing increased—albeit limited—usage. I say “limited” because widespread adoption across the full spectrum of potential functionality is still not there. So while I may have confidence in asking my car to “get directions to Wrigley Field”, I don’t even know where to start when it comes to asking if my left front tire pressure is low.
While AI powered natural language interface will unlock some of this, these experiences still lack foundational interface patterns that are intuitive and easy to grasp.
Voice interaction needs more than a brief pause or a glowing blue orb. Users need to intuitively understand the basic function—how to go back, move forward, make a selection, or revise a command. Today’s voice-first experiences are still patchy and often frustrating. What’s missing are the consistent audio and visual cues that teach users the grammar of interaction.
That will become table stakes for the next generation of interfaces—and right now, we’re still at the starting line. Apps like ChatGPT and Google Maps hint at what’s possible. Immersive platforms like Apple Vision Pro go further, pointing to a future that’s richer and more responsive—but that’s only made possible due to high-end hardware doing the heavy lifting.
If that future is going to scale, it has to be accessible and designed for everyone. Right now, we don’t have universal interface patterns, and consumer expectations are still coalescing. But we do have a blueprint—and ironically, it starts with our own everyday frustrations.
To create the next generation of AI powered UI, I believe we need to solve four key challenges:
1. Clear Cues: Make Engagement Intuitive
We’ve all had that moment of uncertainty—wondering when it’s your turn to speak during a voice interaction or how to build on what’s just been said. The solution? Subtle, intentional sound cues that signal when to respond—and how long the window lasts. Think of it as an auditory interface language, as natural as tapping a button or swiping a screen.
When there is a screen and the ability to be multimodal, visual cues should reinforce the rhythm. Audio and visual feedback must harmonize. That’s the heart of good multimodal design.
2. Complementarity: Let Modalities Amplify, Not Echo
Each mode—voice, touch, gesture, visual—has its strengths. They shouldn’t simply mirror each other; they should specialize. Let visuals structure information. Let voice convey tone or urgency. When done right, they don’t overlap—they amplify.
Imagine a smart display guiding you through a recipe: narrating steps aloud while showing the ingredients and techniques. That’s not redundancy—it’s clarity with dimension.
3. Context Awareness: Adapt to the Moment
The voice command that works perfectly at home might fail on a noisy subway. Environmental awareness isn’t just a nice-to-have—it’s a real unlock. AI-powered interfaces—especially multimodal ones—can and should adapt to your environment, your device, and even the task at hand.
In quiet spaces, voice might take the lead. In a crowded café, text should take over as the default. The best systems don’t just respond—they pivot in real time.
4. Fluid Modality Switching: Flow, Not Friction
This is where most experiences break down. You should be able to start with voice, shift to typing, then tap to confirm—without starting over or losing your place.
True multimodality means the system meets you where you are. It keeps up as you move between modes—seamlessly, effortlessly. No resets, no friction. Just fluid interaction on your terms.
AI-powered UI—and the customer experiences they’ll deliver—promise to spark a revolution every bit as profound as multi-touch and the iPhone. And like that seismic shift, patterns will emerge, new ideas will be born, and a transformation will unfold.
The future of interfaces will know when to speak, when to listen, and when to get out of the way. If we get the cues, context, and choreography right, we won’t just improve experiences—we’ll make them feel like the inevitable next step in our digital evolution. Because the best tech doesn’t just work. It disappears.
Designers, it’s time to get wildly creative again.


No Comments.