Introduction

In 2024, the discipline of Generative AI takes a big step forward with the launch of revolutionary models that convert text into dynamic films, altering the landscape of digital content creation. Notably, EMU VIDEO and Genie, each pioneering a separate aspect of video production. EMU VIDEO, with its unique factorization methodology, creates colorful, coherent narratives from text prompts. Genie, pushing boundaries further, leverages deep learning to synthesise video content that blurs the lines between virtual and reality. These achievements highlight a significant shift toward more immersive, creative AI applications, while also pushing us into complicated ethical debates regarding authenticity, copyright, and the impact of AI on content authenticity and ownership in the digital era.

The Leap into Video: Breakthrough in 2024

Genie: Generative Interactive Environments:

Generative Interactive Environments model developed by Google DeepMind team is trained in an unsupervised manner from unlabelled Internet videos. This model enables the creation of action-controllable virtual worlds from text, synthetic images, photographs, and sketches. Gene can be considered a foundation world model because it offers a versatile and scalable platform for generating and interacting with a diverse array of virtual environments. It includes three main components such as video tokenizer, an autoregressive dynamics model, and a latent action model. Genie allows users to interact with generated environments on a frame-by-frame basis, training without any ground-truth action labels or domain-specific requirements.

Figure 2: Genie model training by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

The Genie system processes T video frames taking a video clip containing T frames as input, converting them into tokens using a video tokenizer. These tokens represent the essential visual information in the frame, similar to how words represent meaning in a sentence. The system then uses a latent action model to predict what actions (or tasks) might be happening between each pair of video frames. These actions are latent, meaning they are not directly observed but inferred from the visual information. These predictions, along with the tokens, are inputted into a dynamics model to generate future frame predictions iteratively.

VQ-VAE (Vector Quantized-Variational AutoEncoder) takes sequences of video frames as input and compresses them into a set of discrete tokens. These tokens represent the essence of the video data in a compact form.

Figure 3: ST(*spatiotemporal*)-transformer architecture by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

ST Transformer (spatiotemporal) is an extension of the transformer model that is designed to handle spatiotemporal data(involves both space and time), specifically video. Videos depict moving objects and changing scenes, which inherently involve both spatial information (what’s in the frame) and temporal information (how things change over time). It is used in both the encoder and decoder of the VQ-VAE to capture the dynamics and complexity of video sequences, enhancing the model’s ability to generate high-quality video representations.

Figure 4: VQ-VAE and ST Transformer by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

The video tokenizer in Genie uses the VQ-VAE equipped with an ST transformer to efficiently process and encode video data. By combining these two components, the tokenizer can capture the intricate spatial (e.g., objects within a frame) and temporal (e.g., movement across frames) patterns in videos and compress them into a series of discrete tokens, which is a unique code or symbol that represents a specific pattern or piece of information in a video frame.

Figure 5: Latent Action Model by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

Encoder Input: The encoder takes all previous video frames (x1 to xt) and also the next frame (xt+1).
Encoder Output: It then outputs a series of continuous latent actions (a1 to at) that represent the actions occurring between these frames.
Decoder Input: The decoder receives both the previous frames and these latent actions as its input.
Decoder Output: Based on this information, the decoder predicts the next frame (x^t+1).

Figure 6: Dynamics model by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

The dynamics model combines these video tokens and latent actions to predict what the next set of video tokens should be, effectively guessing the appearance of future frames in the video. In simple words, the dynamics model is like a storyteller that, given a summary of the story so far (video tokens) and a description of the characters’ actions (latent actions), tries to predict the next part of the story.

Figure 7: Genie Inference by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

Gene Inference: The prompt frame is tokenized and paired with the user’s latent action, then supplied to the dynamics model for iterative creation. The anticipated frame tokens are decoded back to image space using the tokenizer’s decoder.

This process repeats for each new frame. The model uses the most recent frame (in token form), applies a new latent action, and generates the next frame. This iterative process continues for as long as you want the video to be.

Finally, the tokens for each frame are decoded back into images, creating a sequence of frames that form a video.

Figure 8: Playing from Image Prompts by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

In the above result, we can input various types of images to Genie: those created by text-to-image models, hand-drawn sketches, or real-world photos. For each input, we display the initial image and another after applying a specific action four times. Despite differences from the dataset, we observe noticeable character movement in each case.

Furthermore, Genie’s model demonstrates the potential for agent training by generating diverse trajectories in new environments and mimicking expert behaviors using latent actions, paving the way for learning from internet videos without direct action labels.

Figure 9: Playing from RL (Reinforcement Learning) environments by Carter et al. (2024) Article link: https://doi.org/10.48550/arXiv.2402.15391

Input Prompt: Genie begins with an image of an RL environment it hasn’t seen before.
Diverse Trajectories: By specifying different latent actions, Genie can simulate various future possibilities or trajectories within that environment. These actions guide the model in predicting what happens next.
Playability: playing in them based on the initial prompt and the latent actions applied.

EMU VIDEO: Factorizing Text-to-Video Generation by Explicit Image

Conditioning: This model introduced by Meta in early 2024, designed to generate videos directly from textual prompts. This work addresses the challenge of text-to-video generation by proposing a two-step factorized approach. Firstly, it generates a starting image from a text prompt by leveraging SDXL: Improving Latent Diffusion Models in Figure 2. Secondly, it uses this image, along with the original text prompt, to generate the subsequent video frames. This factorization significantly strengthens the conditioning signal for generating more coherent and contextually relevant video outputs.

Figure 10: Left: Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1. Right: Visualisation of the two-stage pipeline: We generate initial latents of size vector 128 × 128 using *SDXL*. Afterward, we utilise a specialised high-resolution *refinement model that helps in maintaining the fidelity and coherence of the generated images across both stages* and apply SDEdit (Stochastic Differential Equation Editing) to remove noise and enhance details and textures. The latents generated in the first step, using the same prompt. By Podell et al. (2023) Article link: https://doi.org/10.48550/arXiv.2307.01952

Figure 11: Factorized text-to-video generation By Gu et al. (2023) Article link: https://doi.org/10.48550/arXiv.2311.10709

It involves first generating an image I conditioned on the text p, and then using stronger conditioning–the generated image and text–to generate a video V. Videos consist of multiple frames displayed in sequence to create the illusion of motion. Since the generated image is static, it needs to be adapted for video generation. To condition model F on the image, we zero-pad the image temporally which essentially creates a “pseudo-video” with the image repeated across multiple frames. It is then concatenated with a binary mask. A binary mask (containing 0s and 1s) is created with the same dimensions as the padded video. This mask indicates which frames are the actual image content (marked as 1), and which are the empty padded frames (marked as 0).

Furthermore, Key design decisions about important choices are made to improve how the model works. These include changing how it adds and removes ‘noise’ to make videos clearer and training the model in steps to better handle high-quality videos.

Example:

Figure 12: Illustrates the design decisions made in EMU VIDEO. By Gu et al. (2023) Article link: https://doi.org/10.48550/arXiv.2311.10709

In the top row, direct text-to-video generation results in low-quality and inconsistent videos. Moving to the second row, employing a factorized text-to-video method enhances video quality and consistency. The third row demonstrates the impact of not employing a zero terminal-SNR noise schedule (controls the level of noise or randomness introduced during the generation process) during 512px generation, which leads to notable inconsistencies. Finally, the bottom row shows the effect of fine-tuning the model from the second row with high-quality data, resulting in increased motion in the generated videos.

Extending video length with longer text: The model can take a short video and a text description to create a longer video by enhancing the model to generate future frames by conditioning on both past frames and text prompts. This adds new events, like spilling beer or catching fire, making it look like these things really happen after the original clip ends.

**Figure 13: Extending to longer videos.** By Gu et al. (2023)Article link: https://doi.org/10.48550/arXiv.2311.10709

Qualitative Results

Qualitative results from EMU VIDEO

Figure 14: showcases diverse prompts above each row of frames, displaying example text-to-video generations produced by EMU VIDEO. By Gu et al. (2023) Article link: https://doi.org/10.48550/arXiv.2311.10709

Ethical Horizon: Navigating the Challenges

Generate Inappropriate content

We know AI generative model works; Before there were text-to-video generators, there were text-to-image generators. With advancements in generative AI, there’s a risk of creating inappropriate content, such as deepfakes or non-consensual imagery. For example, the AI avatar app Lensa, generated sexualized and nude avatars of a woman based on her selfies, disproportionately affecting her as a woman of Asian heritage. It highlights the biases and stereotypes ingrained in AI models due to their training on internet-scraped data and other datasets such as, LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes that emphasises the industry-wide issue of AI perpetuating sexism and racism (MIT Technology Review).

For more details, please refer to this.

AI-generated disinformation infiltrates global politics

AI-generated disinformation is increasingly influencing elections. For example, In Argentina, candidates manipulated opponents with fake media. Slovakia witnessed deepfakes influencing elections. Trump endorsed AI-generated memes perpetuating racist and sexist stereotypes in the US. This trend, powered by the ease of creating realistic deepfakes, raises concerns about distinguishing reality online, especially in a politically charged atmosphere. Efforts to combat such disinformation, including watermarking techniques, face significant challenges, indicating a critical period ahead for democracy and information integrity.

For more details, please refer to this.

AI-generated copyright Issue

Businesses using AI for content creation encounter legal issues regarding copyright and intellectual property due to AI’s reliance on copyrighted materials. Defining ownership and fair use of AI-generated content raises questions about legal responsibility. For instance, Getty Images sued Stability AI for using their copyrighted images without permission, highlighting the complex relationship between AI innovation and copyright law.

For more details, please refer to this.

Societal Bias; Racial and Gender Issue

Generative AI video systems, when trained on existing data, can adopt biases and prejudices from their training material. Consequently, these models may produce videos that perpetuate existing biases, stereotypes, or discriminatory content when converting text to images and later rendering them into videos.

Research shows that AI-generated images reflect significant gender and racial biases: men are thrice as likely to represent all job categories, and predominantly occupy images of high-paying jobs, while women are more associated with lower-wage roles. Racial biases are evident as lighter skin tones are more represented in high-paying professions, contrasting with darker skin tones in lower-wage jobs. These patterns reveal the AI’s tendency to perpetuate societal biases in occupational representation, highlighting the need for ethical AI practices.

For more details, please refer to this.

Figure 15: Depicting gender bias by Nicoletti & Bass (2023) Link: https://www.bloomberg.com/graphics/2023-generative-ai-bias/

Figure 16: Depicting Racial Bias by Nicoletti & Bass (2023) Link: https://www.bloomberg.com/graphics/2023-generative-ai-bias/

Conclusion

The advancements in Generative AI for video production in 2024, exemplified by EMU VIDEO and Genie, have revolutionised digital storytelling, blending creativity with technology. However, this leap forward brings ethical challenges, including concerns over copyright infringement, societal biases, and the potential for generating disinformation. Addressing these issues is imperative to ensure that the benefits of these innovations are realised without compromising ethical standards and societal values.

References

Heikkilä, M., & Heaven, W. D. (2024, January 4). What’s next for AI in 2024. MIT Technology Review. https://www.technologyreview.com/2024/01/04/1086046/whats-next-for-ai-in-2024/
Gu, S., Hertzmann, A., Li, Z., Luo, J., & Wang, O. (2023). Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv. https://doi.org/10.48550/arXiv.2311.10709
Carter, S., Akkiraju, R., Barrington, L., Brandt, J., Dilkina, B., Engel, J., … & Tamkin, A. (2024). Genie: Generative Interactive Environments. arXiv. https://doi.org/10.48550/arXiv.2402.15391
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv. https://doi.org/10.48550/arXiv.2307.01952
Heaven, W. D. (2023, December 19). These six questions will dictate the future of generative AI. MIT Technology Review. https://www.technologyreview.com/2023/12/19/1084505/generative-ai-artificial-intelligence-bias-jobs-copyright-misinformation/
Metz, R. (2022, May 25). The dark secret behind those cute AI-generated animal images. MIT Technology Review. https://www.technologyreview.com/2022/05/25/1052695/dark-secret-cute-ai-animal-images-dalle-openai-imagen-google/
Talvola, E. (2023, July 3). The Ethics of AI Video Generators: Pros and cons. Animoto Blog. https://animoto.com/blog/video-tips/ai-video-generator-ethics
Nicoletti, L., Bass, D. (2023). Humans are biased. Generative ai is even worse. Bloomberg. https://www.bloomberg.com/graphics/2023-generative-ai-bias/
Heikkilä, M. (2022, December 14). The viral AI avatar app Lensa undressed me — without my consent. MIT Technology Review. https://www.technologyreview.com/2022/12/12/1064751/the-viral-ai-avatar-app-lensa-undressed-me-without-my-consent/
Bleyleben, M. (2024, January 25). Does AI really have a copyright problem? https://www.linkedin.com/pulse/does-ai-really-have-copyright-problem-maximilian-bleyleben-o0npf/
Bloomberg — Are you a robot? (n.d.). https://www.bloomberg.com/graphics/2023-generative-ai-bias/

Catch the latest version of this article over on Medium.com. Hit the button below to join our readers there.

Learn more on Medium

Generative AI’s Leap into Video in 2024 and its Ethical Horizon

Introduction

The Leap into Video: Breakthrough in 2024

Genie: Generative Interactive Environments:

EMU VIDEO: Factorizing Text-to-Video Generation by Explicit Image

Qualitative Results

Qualitative results from EMU VIDEO

Ethical Horizon: Navigating the Challenges

Generate Inappropriate content

AI-generated disinformation infiltrates global politics

AI-generated copyright Issue

Societal Bias; Racial and Gender Issue

Conclusion

References

Read Next

Data and Analytics with AI

What is Agentic AI?

Introduction to Roboflow and Object Detection with AI

How AI Helps Business Owners Understand Their Businesses’ Needs Through RAG and Automation

How to Revolutionise Game Asset Creation Using AI Image Generation Models

How AI Can Help Transform Developer Productivity Through Code Assistants

How AI Transforms Mental Wellness: From Applications to Essential Questions

From Silicon Valley to the Global South: Who’s Being Left Behind in the AI Revolution?

Code Generation: GPT vs Llama 4

From Zero to Classifier: Building a Lightweight LLM with OLMo 2 on Your Laptop

Subscribe to Newsletter