News Report Technology

May 29, 2023

Google Taught AI Model Flamingo to Write Descriptions for YouTube Videos

by Damir Yalalov

Published: May 29, 2023 at 2:00 pm Updated: May 29, 2023 at 10:55 am

by Karolina Gaszcz

Edited and fact-checked: May 29, 2023 at 2:00 pm

In Brief

Flamingo solves the problem of short videos being difficult to find through search by automatically creating descriptions.

Google DeepMind, the AI research laboratory, has developed a visual language model called Flamingo capable of writing descriptions for short videos on YouTube. The problem that Flamingo addresses is that short videos are often difficult to locate via search due to the lack of necessary information in the description. The Flamingo model solves this problem by automatically generating texts for millions of short video clips on video hosting sites, which are used “behind the scenes” to enable easy search. Although the video authors won’t see the metadata, it helps the viewers to find and navigate the shorts. Currently, Flamingo has been working on new clips and processing older videos uploaded to YouTube for a long time.

Google Taught AI Model Flamingo to Write Descriptions for YouTube Videos — deepmind.com

In the past, Google introduced an algorithm that enables people to search for information inside videos using the search bar. Recently, TwelveLabs raised $12 million from investors for a similar development. These tools create new opportunities for video content creators to increase their reach and visibility. By leveraging AI to improve and simplify the search process and discovery of short-form content, DeepMind, and similar startups, are revolutionizing video streaming services. They are contributing to the development of more intelligent and efficient search technologies, making it even simpler for viewers to find content that truly interests them.

Artificial intelligence is playing a significant role in upgrading search technologies. By leveraging AI, the Flamingo model can scan and serialize the content and generate texts that summarize the content to help users navigate. The Flamingo model uses deep neural networks to generate textual descriptions of a video clip based on the video’s audio and visual content. It can capture the auditory and visual components of short-form content and transform them into a summary that is easy for users to search for and access.

The use of AI can help identify important information for the users, which might get missed in the manual efforts of creators while adding descriptions. The time-consuming effort to manually capture every detail is not always practical, especially with the constant flow of short-form video content uploaded on platforms like YouTube. This can lead to user confusion and frustration when searching for specific short-form content. However, with the use of visual language models, such as Flamingo, the metadata can be automatically generated to provide a summary for easy access, thus saving time and making the search process more efficient and accurate.

Flamingo Sets New State-of-the-Art Visual Language Models For Open-ended Tasks

The most important details are the introduction of Flamingo, a single visual language model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended multimodal tasks. Flamingo is a single visual language model (VLM) that redefines few-shot learning across a wide range of open-ended multimodal activities. It receives a prompt consisting of interleaved images, videos, and text as input and outputs the associated language. Flamingo’s visual and text interface, like those of large language models (LLMs), can lead the model toward accomplishing a multimodal goal. The model can be asked a question with a fresh image or video and then construct an answer, given a few example pairs of visual inputs and expected text responses composed in Flamingo’s prompt.

Flamingo is a visual language model that fuses large language models with powerful visual representations and is trained on a mixture of complementary large-scale multimodal data coming only from the web without using any data annotated for machine learning purposes. It beats all previous few-shot learning approaches when given as few as four examples per task and outperforms methods that are fine-tuned and optimized for each task independently and use multiple orders of magnitude more task-specific data. It also tested the model’s qualitative capabilities beyond its current benchmarks, such as captioning images related to gender and skin color and running its generated captions through Google’s Perspective API, which evaluates the toxicity of text. Flamingo makes it possible to efficiently adapt to these examples and other tasks on-the-fly without modifying the model and demonstrates out-of-the-box multimodal dialogue capabilities.

Flamingo is a general-purpose family of models that can be applied to image and video understanding tasks with minimal task-specific examples. It is an effective and efficient general-purpose family of models that can be applied to image and video understanding tasks with minimal task-specific examples. Flamingo’s abilities pave the way towards rich interactions with learned visual language models that can enable better interpretability and exciting new applications, like a visual assistant.

Read more about AI:

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet.

Damir Yalalov

Hot Stories

Business News Report Technology

Sonic Labs Unveils Spawn: An AI‑Driven Platform For Fast Natural‑Language Web3 App Development

by Alisa Davidson

February 20, 2026

Markets News Report Technology

When Should We Expect The Next ‘Greed’ Zone? Crypto Sentiment And Timing In 2026

by Alisa Davidson

February 20, 2026

Business News Report Technology

New AI‑Powered Shopping Feature Marks Reddit’s First Major Step Toward Community‑Driven Commerce Integration

by Alisa Davidson

February 20, 2026

Hack Seasons Interview Business Lifestyle Markets Technology

Tokenization, Transparency, And Institutional Demand Dominate Discussion At HSC’s ‘Capital Is Selective Again’ Panel

by Alisa Davidson

February 20, 2026

Google Taught AI Model Flamingo to Write Descriptions for YouTube Videos

Flamingo Sets New State-of-the-Art Visual Language Models For Open-ended Tasks

Disclaimer

About The Author

Sonic Labs Unveils Spawn: An AI‑Driven Platform For Fast Natural‑Language Web3 App Development

When Should We Expect The Next ‘Greed’ Zone? Crypto Sentiment And Timing In 2026

New AI‑Powered Shopping Feature Marks Reddit’s First Major Step Toward Community‑Driven Commerce Integration

Tokenization, Transparency, And Institutional Demand Dominate Discussion At HSC’s ‘Capital Is Selective Again’ Panel

Sonic Labs Unveils Spawn: An AI‑Driven Platform For Fast Natural‑Language Web3 App Development

When Should We Expect The Next ‘Greed’ Zone? Crypto Sentiment And Timing In 2026

New AI‑Powered Shopping Feature Marks Reddit’s First Major Step Toward Community‑Driven Commerce Integration

Circle Expands USDC Infrastructure With Nanopayments Launch, Aiming At AI Agents And Digital Payments

The Calm Before The Solana Storm: What Charts, Whales, And On-Chain Signals Are Saying Now

Crypto In April 2025: Key Trends, Shifts, And What Comes Next