Mumbai: In 2022, when general purpose large language models and image synthesis models were released, their ability to be creative and generate human-like text and highly creative images captured our imagination. Since then, Generative AI, the technology buzzword of 2023, has empowered creators, disrupted industry workflows, and sharply reduced the effort and cost of content creation in a range of formats. Through the year that followed, we witnessed rapid emergence of newer and more sophisticated models, driven by innovations in AI models, availability of datasets and access to massive compute capabilities that advanced cloud GPUs offer to innovators.
As we step into 2024, the technology is now reaching a point where it can have myriad real-world use-cases for production studios in film, music and video domains. Reports also predict that AI will play an increasingly important role in the M & E domain. Generative AI in this sector is projected to scale from USD 1463.91 million in 2023 to USD 14,779.10 million by 2032, growing at a CAGR of 29.3%.
To understand the transformative impact, one just has to look at traditional media production workflows. In a traditional production setup, one needs a team of writers to construct the story, a photography and cinematography department to create and edit visuals, a visual effects (VFX) team to enhance it, sound production for music and background scoring, and other elements like costume and set design also come into play. Along with the effort that goes into producing the central narrative, studios also spend a disproportionate amount of time acquiring stock footage, supplemental footage or b-roll, sound effects, visual effects, storyboarding, audio, image and video editing, and more.
However, Generative AI technologies can reduce cost and effort in content production in a number of key ways. A select set of Generative AI technologies carefully fine-tuned on the right set of content, and deployed on advanced cloud GPUs, can empower media production studios to build a media creation pipeline that augments their existing workflow, improve efficiency, drive down costs, and discover new revenue streams.
One such technology is Stable Diffusion, a deep learning AI model that can synthesize new images based on text prompts. It is primarily used to generate detailed creative images, though it also has the capability of taking on tasks like inpainting, outpainting, and generating image-to-image translations guided by a text prompt. The most recent Stable Diffusion model, known as SDXL, is highly capable of producing high quality images with simple text prompts.
A key aspect of Stable Diffusion is that it can be fine-tuned to create more ‘on-brand’ images. This allows one to drastically cut down the costs of commercial photo-shoots, without sacrificing creativity. For example, by training Stable Diffusion on a product image, one can then generate thousands of high-quality product variations, complete with diverse styles, colors, and features. This not only facilitates rapid prototyping and design exploration but also enhances the creative potential for marketing materials and presentations. Stable Diffusion models can also be harnessed to create storyboards, so that one can pre-visualize a script before it goes into production.
Along to text-to-image AI models, a number of image-to-video models have also started emerging. Stable Video Diffusion (or, SVD) is an AI model that takes a single image as input, and generates a video clip from it. In the near future, this would allow studios to drastically cut down the cost of acquiring stock footage and b-rolls, and help democratize media creation further. Then there is LaVie, an open source text to video model, that helps with generation of video clips from simple text prompts as well.
Images and visuals are only one of the many possibilities that Generative AI unlocks in the M&E industry. Audio is the other domain where a major shift is underway, with powerful Generative AI technologies simplifying music and voice synthesis.
AudioCraft, for instance, is a suite of AI models developed by Meta that generates high-quality audio and music from simple text inputs. It consists of three models: MusicGen, AudioGen, and EnCodec, each designed to make sound creation accessible and innovative. MusicGen generates musical compositions and melodies from text prompts, while AudioGen generates audio from text prompts, enabling users to create custom soundscapes and atmospheres for various applications. This has enormous potential in music production.
For voice synthesis, text to speech (TTS) models like xTTS, Bark, Tortoise, FastSpeech can handle text to speech generation. When combined with Wav2Lip, it is possible to create a full workflow for dubbing voices easily.
Generative AI has even simplified audio transcription. OpenAI’s open source model Whisper is an automatic speech recognition model that is very potent at multilingual transcription from audio. This has applications in streamlining subtitle generation for films and video.
There are numerous examples already of Generative AI being used in media production workflow. Al Jazeera, for instance, recently launched a ‘History Illustrated’ series where the writer uses graphics generated by Generative AI to depict stories from history. A Detroit-based video creation company recently showcased the potential of pure Generative AI powered media creation through a 12 minute short film called The Frost. To create this film, they first generated the images using AI, and then used these images to create video clips, which were then stitched together to form an entire short film. It is uncanny and strange, but points to interesting possibilities.
Future film, media and music production studios would incorporate the technology deep into its workflow, not only to drive efficiency, but to unlock new creative formats that we are yet to witness. We are only beginning to witness the creative possibilities of this potent technology.
The author of this article is E2E Networks Ltd chief revenue officer Kesava Reddy.