Domain Centered FMs

DomainAdapt

In this session, our readings cover:

Required Readings:

Large Language Models for Software Engineering: A Systematic Literature Review

More Readings:

Large language models generate functional protein sequences across diverse families

Large Language Models in Law: A Survey

ChemLLM: A Chemical Large Language Model

FunSearch: Making new discoveries in mathematical sciences using Large Language Models

Transforming the future of music creation

Segment Anything

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

BloombergGPT: A Large Language Model for Finance

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Blog: In this session, our blog covers:

Large Language Models for Software Engineering: A Systematic Literature Review

1     Overview

1.1   Software Engineering

  1. SE is a discipline focused on the development, implementation, and maintenance of software systems.
  2. The utilization of LLMs in SE emerges from the perspective where numerous SE challenges can be effectively reframed into data, code, or text analysis tasks.

1.2   Main Contributions

  1. It covers 229 papers published between 2017 and 2023.
  2. It summarizes usage and trends of different LLM categories within the SE domain.
  3. It describes the data processing stages.
  4. It discusses optimizers and evaluationg metrics used.
  5. It analyzes key applications of LLMs in SE encompassing a diverse range of 55 specific SE tasks, grouped into six core SE activities.
  6. It presents key challenges and potential research directions.

2     What LLMs have been employed?

2.1   Models Distribution

  1. There are more than 50 different LLMs used for SE tasks in the papers collected.
  2. They are grouped into 3 categories based on their underlying architecture, i.e., encoder-only, encoder-decoder, and decoder-only LLMs.
  3. Encoder-only models: Bert has been referenced in 41 of the papers, and its variants are also widely employed
  4. Encoder-decoder models: there are fewer models and applications. CodeT5 is the most popular one.
  5. Decoder-only models: Codex is used the most frequently.
  6. Models that are specialized for code-related tasks are the most popular, because these models have shown efficacy in tasks requiring a nuanced understanding of the entire code snippet, which is very important in software engineering.
  1. Evolution of LLM architectures in 2021: We see the emergence of decoder-only and encoder-decoder models in 2021.
  2. Diversity of LLM architectures in 2022: 2022 experienced a significant increase in diversity, with more varied LLM architectures finding representation.
  3. Dominance of the decoder-only architecture in 2023: 2023 signaled a strong shift towards decoder-only LLMs.
  4. We see an increasing number of studies utilizing LLMs for software engineering.
  5. There is a shift in focus and resources toward exploring and harnessing the decoder-only architecture as the primary approach.

3     What types of SE datasets have been used in existing LLM4SE studies?

  1. There are 5 categories based on data types: code-based, text-based, graph-based, software repository-based, and combined data types.
  2. Most of the studies used text-based datasets, accounting for a total of 104.
  3. Prompts dataset is the most common among all the text-based datasets, as prompt engineering is largely utilized.
  4. Source code is the most abundant data type in code-based datasets, since source codes serve as the foundation of any software project.
  5. There is a noticeable scarcity of graph-based datasets. Exploring graph-based datasets could be important for addressing complex code scenarios since graphs can better capture the structural relationships and dependencies in code.

4     What techniques are used to optimize and evaluate LLM4SE?

  1. Fine-tuning emerges as the most widely used optimization algorithm in LLM studies, appearing in 87 research works, which actually signifies the dominance of fine-tuning in adapting pre-trained models to specific downstream tasks.
  2. Among the learning rate optimization algorithms, Adam stands out with 25 occurrences in the studies. It is an adaptive optimization algorithm that combines adaptive learning rates with momentum, facilitating faster convergence and reducing the risk of getting stuck in local minima during training.
  3. Prompt engineering has shown to be particularly advantageous in providing task-relevant knowledge and enhancing LLMs’ versatility and efficacy across different code intelligence tasks.

5     What SE tasks have been efficiently addressed by LLMs?

  1. Based on the six phases of the Software Development Life Cycle (SDLC), the tasks are grouped into requirements engineering, software design, software development, software quality assurance, software maintenance, and software management.
  2. The highest number of studies is observed in software development, which underscores the primary focus on utilizing LLMs to enhance coding and development processes.
  3. Software maintenance tasks account for about 24.89% of the research share, highlighting the significance of LLMs in aiding software updates and improvements.
  4. Based on the types of problems, the studies are classified into generation, classification, recommendation, and regression.
  5. The majority of studies, about 64.34%, center around generation tasks, showing the significance of LLMs in producing code or text.
  6. Following this, around 24.48% of studies fall under classification tasks, which indicates the relevance of LLMs in categorizing software elements.

5     Distribution of SE Tasks Over Six SE activities

5.1   SE Activity 1: Requirements Engineering

5.1.1 Anaphoric Ambiguity

5.2   SE Activity 2: Software Design

5.2.1 Rapid Prototyping

5.2.2 Traceability Automation

5.2.2 Software Specification Synthesis

Natural Language Specification: Users can upload photos to their profile, but only JPG and PNG files are allowed. Each photo must be less than 5MB in size.

Formal Specification:
∀Photo(upload(Photo) → (fileType(Photo, JPG) ∨ fileType(Photo, PNG)))
∀Photo (upload(Photo) → fileSize(Photo, Size) < 5)

5.3   SE Activity 3: Software Development

5.3.1 Code Generation with LLMs

5.3.2 Control Flow Graph Generation with LLMs

5.4   SE Activity 4: Software Testing

5.4.1 Test Generation
5.4.2 Failure-Inducing Test Identification.

5.5   SE Activity 5: Software Maintenance

5.5.1 Program Repair with LLMs.

5.6   SE Activity 6: Software Management

5.6.1 Effort estimation.

6     Summary

7     Challenges

Exploring the Impact of Large Language Models (LLMs) on Bioengineering

1     Motivation

Understanding biological trajectories can be applied to medicine, biotechnology, bioinformatics, and environmental sciences.

2     Basic Terms

3     AlphaFold

Principles about the Folding of Protein Chains: A protein’s amino acid sequence should fully determine its structure.

4     Pre-training process

Control tage can be (partial) 3D structure of protein or protein family or specific function of target protein.

5     Protein Objective

We can categorize three different objective for protein generation in terms of existence for protein and target function. Each objective might deploy a different model structure.

6     Topology of Protein Design

Protein generation models can be categorized into three different types 1) Sequence based models: Encompass sequence generation. 2) Sequence-label models: Encompass label (e.g., target function). 3) Structure based models: Encompass 3D structure generation.

7     A Genomic Foundation Model

This model is trained on nucleotide level (smaller than protein).

8     A Genomic Foundation Model

This model is trained on single cell level (larger than protein).

9     Design of Full-atom Ligand-binding Protein Pockets

Drug design can benefit from protein generation.

10     Protein Structure Generation

This model is based on diffusion framework.

11     Molecular to Genome

This model is based on dilated CNN architecture.

12     Chem LLM

LLMs can be applied to specific domain in bioengineering with well curated training process.

Large Language Models in Law: A Survey

1     Overview

The following figure gives an overview of the survey.

2     Contributions

Main contributions of this survey:

3     Evolution of Judicial Technology

3.1   Characteristics of Traditional Judicial System

When looking at the traditional judicial system that has been in use since before AI was a thing, we see a number of characteristics:

In order to effectively use AI in legal judgement, it is imperative to have a large amount of legal big data. However, examining the nature of the legal data that is available shows a number of characteristics that make the task difficult. Some legal big data characteristics:

The following figure shows the main characteristics of LLMs in Judiciary:

Some important use cases include:

4     Recent Applications

The following are ten popular legal LLMs that are examined by the survey. They are fine-tuned, mainly on question-answer legal data.

In August 2023, several institutions and universities developed a comprehensive evaluation system for legal AI systems. The evaluation system combines subjective and objective measures. There are four primary indicators:

There are also further subindicators for each category, which can be seen in the following figure:

5     Challenges

5.1   Defects in Datasets

Legal LLMs still face a variety of challenges when it comes to widespread and accurate use. Some important challenges to consider are:

1) Inadequate Data Aquisition

2) Inaccurate Interpretation of Legal Concepts

3) Dataset Characteristics

5.2   Shortcomings in Algorithms

1) Interpretability

2) Ethics, bias, and fairness

5.3.1 Neglecting Judicial Independence

  a) In terms of legal enforcement: it includes

  b) In terms of fact-finding: use of discretion, subjective judgment, experiential

Legal LLMs lead to a) Overly relying on AI b) Form preconceived notions

For Example, In assessing the compensation amount in civil litigation, judges can comprehensively consider factors such as the extent of the victim’s financial loss and the defendant’s ability to compensate. In contrast, the algorithms of legal LLMs struggle to measure the extent of loss

Legal LLMs can assist judges. However, it does not possess professional judicial experience and cannot independently make judgments in cases

5.3.2 Impact on Judicial System

Legal LLMs have restrained the subjective initiative of judges and the development of traditional trial systems as reflected in:

1) Court idleness:

2) Crisis in the hierarchy of trial: Legal AI systems will impact the judicial process in the hierarchical system.

For example, Any party dissatisfied with any judgment of a lower court can appeal to a higher court which with legal AI system remains same.

5.4   Issues Arising from Specific Judicial Practice

5.4.1   The lack of universality in applications

Legal LLMs often extract feature values from cases and search for similar cases within existing multidimensional datasets to find the “optimal solution” Legal regulations may vary across different countries or regions, leading to inconsistent decision outcomes for the same case under different legal rules, so, the “optimal solution” proposed by the large model may not apply to a particular case.

5.4.2   The lack of subjective thinking, emotions, and experience

Legal LLMs lack autonomous thinking abilities and professional experience, among other things. Judicial decision making process is not merely a logical reasoning process on a single layer but also involves moral, ethical, and practical considerations in the legal system.

5.4.3   Contradiction with the presumption of innocence principle

Various systems are used which predicts probability of crime without those even occurring like COMPAS system for crime prediction and risk assessment, PredPol for iterative calculation of potential crime locations and PRECOBS system in Germany is used for burglary prevention and violence crime prediction.

Figure: Futuristic System that apprehends people based on their probability of committing Crime.

5.5   Ethical Views Impacting Human Society

5.5.1   Disregard for human subjectivity:

Human subjectivity is susceptible to algorithmic bullying.

5.5.2   Misleading user comments:

In testing certain LLMs, such as ChatGPT, AI has displayed behaviors such as inducing users to divorce, making inappropriate comments, and even encouraging users to disclose personal privacy or engage in illegal activities

5.5.3   Ethical value consistency:

There may be situations where AI misleads or harms human interests. Team 2 Domain Centered FMs March 23, 2024

6   Future Directions

6.1   Data and Infrastructure

6.2   Algorithm Level:

6.3   Dealing with Traditional Judiciary

6.4   Judicial Practice:

7   Conclusions

This paper synthesized various technologies and ideas regarding the opportunities, challenges, and recommendations for the application of AI in the judicial field. Team 2 Domain Centered FMs March 23, 2024

  REFERENCES

https://arxiv.org/abs/2308.10620 https://arxiv.org/abs/2312.03718 https://arxiv.org/abs/2306.15794 https://arxiv.org/abs/2402.06852 https://www.nature.com/articles/s41586-021-03819-2 https://www.nature.com/articles/s41587-023-02115-w https://www.nature.com/articles/s41587-022-01618-2 https://www.nature.com/articles/s41587-024-02127-0 https://www.biorxiv.org/content/10.1101/2024.02.10.579791v2 https://www.biorxiv.org/content/10.1101/2024.02.25.581968v1 https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1 https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1 https://www.biorxiv.org/content/10.1101/2024.02.27.582234v1