Knowledge Augmented FMs

RAG

In this session, our readings cover:

Required Readings:

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey

More Readings:

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

A Comprehensive Study of Knowledge Editing for Large Language Models

Even More

A Survey of Table Reasoning with Large Language Models

Blog: Retrieval-Augmented Generation for​ AI-Generated Content: A Survey​

Motivation and the RAG Process

Artificial Intelligence Generated Content(AIGC) refers to the texts and code generated by Large Language Model, the images generated by DALL-E and Stable-Diffusion, and video generated by Sora. Besides the recent success of AIGC, it continues to face a number of challenges. For example, it is difficult to maintain up-to-date knowledge for these models, because model training is required in order for the model to generate answers based on new knowledge. In addition, these models suffer from the inability to provide long-tail knowledge, and they are at risk of leaking private training data. Retrieval-Augmented Generation(RAG) serves as a mitigation to these problems, because it has an adaptive data repository. With such data repository, when the new knowledge or long-tail knowledge is included, or when the sensitive private data is encoded, the above challenge can be straightforwardly allievated.

The figure below shows the standard Retrieval-Augmented Generation process. The user’s prompt (in any modalities) is taken as input for both the retriever and the generator. The retriever has access to database and retrieve the data relavent to the prompt for the generator. The generator then takes both the user prompt and the data retrieved as input and eventually generates the results.

Taxonomy of RAG Foundations

The figure below shows the four major categories of RAG.

Taxonomy of RAG Enhancements​

The performance of RAG can be further enhanced by the following techniques shown in the below figure.

Taxonomy of RAG Applications

RAG is a general purpose method that can be effectively applied in different domains. The figure below shows the areas of its application, ranging from question answering, code generation, to text-to-3D and drug discovery.

Sora: A review on Background, Technology, Limitations, and Opportunities of Large Vision Models

What is Sora?

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Figure below is an example of the input and output of Sora.

What can Sora do?

The implications of Sora extend far beyond mere video creation, offering transformative potential for tasks ranging from automated content generation to complex decision-making processes. Figure below is an overview of practical deployment scenarios.

History of Generative Video

Overview

Sora is a diffusion transformer with flexible sampling dimensions as shown in figure below. It has three parts:

  1. A time-space compressor first maps the original video into latent space.
  2. A ViT then processes the tokenized latent representation and outputs the denoised latent representation.
  3. A CLIP-like conditioning mechanism receives LLM-augmented user instructions and potentially visual prompts to guide the diffusion model to generate styled or themed videos.

Data Pre-processing

Variable Durations, Resolutions, Aspect Ratios

Sora can generate images in flexible sizes or resolutions ranging from 1920x1080p to 1080x1920p and anything in between.

Sora is trained on data in their native sizes which significantly improves composition and framing in the generated videos. The comparison between Sora and a model trained on uniformly cropped square videos demonstrates a clear advantage as shown in figure below. Videos produced by Sora exhibit better framing, ensuring subjects are fully captured in the scene.

Unified Visual Representation

To effectively process diverse visual inputs including images and videos with varying durations, resolutions, and aspect ratios, Sora patchifies videos by initially compressing videos into a lower-dimensional latent space, followed by decomposing the representation into spacetime patches, as shown in the figure below.

Video Compression Network

Sora’s video compression network (or visual encoder) aims to reduce the dimensionality of input data. It is typically built upon VAE or Vector Quantised-VAE (VQ-VAE). To solve the problem that it is challenging for VAE to map visual data of any size to a unified and fixed-sized latent space, there are two implementations.

Spacetime Latent Patches

A remaining concern in compression network part is: How to handle the variability in latent space dimensions (i.e., the number of latent feature chunks or patches from different video types) before feeding patches into the input layers of the diffusion transformer.

Patch n’ pack (PNP) is a possible the solution. PNP packs multiple patches from different images in a single sequence as shown in figure below.

Modeling

Image Diffusion Transformer

DiT and U-ViT are among the first works to employ vision transformers for latent diffusion models. DiT employs a multi-head self-attention layer and a pointwise feed-forward network interlaced with some layer norm and scaling layers. DiT incorporates conditioning via adaptive layer norm (AdaLN) with an additional MLP layer for zero-initializing, which initializes each residual block as an identity function and thus greatly stabilizes the training process.

Video Diffusion Transformer

Imagen Video developed by Google Research, utilizes a cascade of diffusion models, which consists of 7 sub-models that perform text-conditional video generation, spatial super-resolution, and temporal super-resolution, to transform textual prompts into high-definition videos as shown in figure below.

Some points that worth noting:

Language Instruction Following

Another question is: How does Sora follow user instructions?

Prompt Engineering

Text Prompt

Prompt engineering can leverage model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. Figure below is an example.

Image Prompt

An image prompt serves as a visual anchor for the to-be-generated video’s content. The use of image prompts allows Sora to convert static images into dynamic, narrative-driven videos by leveraging both visual and textual information. Figure below is an example

Video Prompt

Work like Moonshot and Fast-Vid2Vid demonstrate that a good video prompt requires being specific and flexible so that the model gets a clear direction and objectives.

Limitations

A Comprehensive Study of Knowledge Editing for Large Language Models

Large Language Models (LLMs) are the maestros of modern text generation, strikingly mimicking the nuances of human communication. Yet, their brilliance comes with a challenge – the heavyweight computational cost of their expansive learning capacity. As our world shifts, so must our models; their knowledge is a race against time, continuously needing updates to stay accurate and relevant. Enter the realm of knowledge editing – a promising avenue where the agility of model modifications is not just a desire but a necessity for applications demanding precision post-training. This paper journeys through the emerging landscape of knowledge editing techniques, offers a fresh benchmark for evaluating their efficacy, and invites us to peer deeper into the cognitive framework of LLMs, setting the stage for innovations with the groundbreaking EasyEdit framework. We stand on the cusp of an era where the adaptability of AI could redefine its role across industries.

Knowledge Editing

Efficiently modify LLMs’ behaviors within specific domains while preserving overall performance across various inputs. For an original model 𝛳, knowledge k and knowledge editing function F, the post-edited model is defined as,

  1. Knowledge Insertion

  2. Knowledge Modification

  3. Knowledge Erasure

Benchmark Data: KnowEdit

6 datasets on knowledge editing are curated. These encompass a range of editing types, i.e., fact manipulation, sentiment manipulation and hallucination generation.

Knowledge Editing Evaluation

Also termed as Reliability. It is the average accuracy of the edit cases

Whether the edited model can address the effect of an edit

The edited model should not modify the irrelevant examples in out-of-scopes

Generalization ability of the model after editing. Also, termed ‘fluency’.

Error and Case Analysis

Limitations of Knowledge Editing

A Survey of Table Reasoning with Large Language Models

Introduction to Table Reasoning​

Table reasoning aims to generate accurate answers from tables based on users requirements​. And table reasoning task improves the efficiency of obtaining and processing data from massive amounts of tables​.

The Rise of LLMs and their Advantages​

​Traditional methods relied on rule-based systems or neural networks. With LLMs’ vast knowledge and language understanding capabilities, LLMs excel at table reasoning​.

There are some key advantages of LLMs in Table Reasoning:​

Techniques for Improving Performance in LLM era​

The authors proposed some techniques for improving performance in LLM era​:

For Supervised Fine-tuning:

For Result Ensemble:

For In-context Learning:

One Example of In-context Learning:ODIS

The aboving figure shows an example prompt of 2-shot in-domain text-to-SQL​

Two in-domain demonstrations are present prior to the test question

For Instruction Design:

One Example of Instruction Design: DATER

(Decompose evidence And questions for effective Table-basEd Reasoning)​

For Step-by-step Reasoning:

One Example of Step-by-step Reasoning: Chain-of-Table

Future Research Directions