Google Introduces Gemini Omni: How To Turn Image, Text, Video And Audio Into Single Output

The first version of Gemini Omni has been named Gemini Omni Flash, and it is available on the Gemini app, Google Flow and YouTube Shorts.

Advertisement
Read Time: 2 mins

At Google I/O 2026, the tech giant unveiled Gemini Omni, a new multimodal AI model that can create and edit videos using text, images, audio and video prompts. Google says the new model is a major step toward making Gemini a fully creative AI platform capable of understanding and generating different forms of media. 

Gemini Omni is designed to turn different types of content into one video output. Users can upload photos, drawings, existing videos, voice references or simply type prompts. The AI will then combine all the inputs to create a single video. One of the biggest standout features of Gemini Omni is editing using conversation, where users can simply describe the changes they want in plain language.

Advertisement

ALSO READ | Google to Create AI Cloud Business With Blackstone, WSJ Says

The first version in the Omni family has been named Gemini Omni Flash and is now rolling out through the Gemini app, Google Flow and YouTube Shorts. 

Step-By-Step Guide To Use Gemini Omni

The first and foremost step is to log in to the Gemini app or Google Flow account and access the latest Gemini interface featuring Google's new, fluid "Neural Expressive" design layout.  

Advertisement

The next step is to upload images, scenes, or artwork as visual references. One can also add existing video clips or use text prompts to describe a scene, camera movement, effects or animation style. Include voice references for audio guidance and tone. Then, ask Gemini Omni to blend all the inputs into a single video, as per your liking.

Users can then continue editing the video through conversation rather than using any editing tools. For example, one can either type or ask Gemini Omni commands such as “turn a mirror into liquid,” “change the background to a futuristic city,” or “add cinematic lighting,” and it will update the video while keeping the characters and movement consistent.

Advertisement

Once the video is done, export it. Google says every AI-generated video includes an invisible SynthID watermark for verification and transparency across Google platforms. 

For now, only voice references are supported for audio inputs, but Google plans to add more audio features later. 

Essential Business Intelligence, Continuous LIVE TV, Sharp Market Insights, Practical Personal Finance Advice and Latest Stories — On NDTV Profit.

Loading...