Commit bb4d7285 authored by Lea Kyveli Chrysanthopoulou's avatar Lea Kyveli Chrysanthopoulou
Browse files

Add full text project plan

parent aca57227
Loading
Loading
Loading
Loading
+135 KiB
Loading image diff...
+207 KiB
Loading image diff...
+1 −1
Original line number Diff line number Diff line
# Project Plan
# Project Plan Key Points

## Objective

+132 −0
Original line number Diff line number Diff line
<h1 style="text-align: center"> 
SWP 2023/24 Research Plan 
</h1>

<div style="text-align: center">
Chris Pracht, Finn Hillengass, Lea Kyveli Chrysanthopoulou <br>
12.11.2023
</div>

## Objective


Our project aims to leverage different prompting architectures to extract Task Oriented Dialogue (TOD) datasets from Large Language Models (LLMs) that are available free of charge, e.g., Llama-2 (7B, 13B or 32B). This has the aim of bulding high-quality TOD datasets at low cost, which could then be used to train smaller chatbots on, enhancing their performance. To achieve this aim, we intend to deploy a variety of automated metrics, such as GRUEN, DEAM, GRADE and FactScore. Based on the performance of the prompting architectures on the metrics, we then intend to optimize and refine them as needed.

## Flow Chart

<!-- TODO: Add Flow Chart -->

## Methods

We generate our TODs via variations of the prompt architecture introduced Labruna et al. (2022) and inspired the insertion of commonsense knowledge through knowledge triplets as detailed by Kim et al. (2023).

We initially perform Dialogue Generation with the **One Shot Approach**, i.e., asking one LLM to generate both the conversational turns of the user and the system. In the initial prompt the model will be given a triplet of dialogue states the to-be generated dialogue is to include, e.g. (hotel, Italian, expensive). Then, the dialogue is generated as described above. 

The **labelling of the dialogue states** on the generated dialogue can happen in two manners, the more suitable of which we will decide on during the course of our project:
    
1. The labels are generated together with the dialogue 
2. The dialogue is generated first and then annotated by the model

We furthermore want to implement **Retrieval Augmented Generation** (RAG), i.e., based on the triplet of requirements described above (hotel, Italian, expensive), the model should fetch an option fullfilling these parameters from a knowledge base.

Finally, if there is time, we want to try a **MultiAgent Approach**, in which two models simulate a dialogue between each other, wherein the one takes the part of the system and the other the part of the user in the Task Oriented Dialogue (TOD).

## Models

While selecting the LLM for our project, one of our concerns was for it to be open-source, i.e., for it to be available to download and for us to host it ourselves, so we would not have to pay for an API or interact with it manually over its web-interface, as would be the case with a GPT model. 

Of the open-source models available, we decided on LLama-2 [(Touvron et al. 2023)](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288) in its fine-tuned chat versions. We were able to run the 7B version on the CLuster and locally. By utilizing both GPU-nodes of the CLuster, we should be able to run Llama-2-13B. Should we get access to the bwForCluster, even the 70B model would be possible to use for inference.

We chose LLama-2, as it performs very well in comparison to other open-source models on number of benchmarks (see [figure 1](#figure-1-llama-2-overall-performance-on-grouped-academic-benchmarks-compared-to-open-source-base-models-from-the-paper-by-touvron-et-al-2023)).

![](llama_comparison_other_models.png) 
#### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))

Additionally, a lot of emphasis has been placed on respectful and non-discriminatory language use in the fine-tuning of the LLama-2 models to the chat-versions through **Reinforcement learning from Human Feedback** (RLHF). We consider this to be of great importance in generating data points as potential training data for other chatbots (our [objective](#objective)). This is why we preferred Llama-2 over Mistral-7B, even though **Mistral-7B** [(Jiang et al. 2023)](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825) in its base version outperforms the Llama-2-7B base version and sometimes even the Llama-2-13B base version on a number of benchmarks (see [figure 2](#figure-2-comparison-of-mistral-7b-with-llama-from-the-paper-by-jiang-et-al-2023)).

![](comparison_mistral_llama.png)
#### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))

If we have time, we might attempt a comparison of the performance on our metrics with Mistral-7B as well and explore the adaptability of each model to the dialogue generation tasks.



## Dataset

As far as datasets go, we will use the **MultiWOZ 2.2** [(Zang et al. 2020)](#9-zang-xiaoxue-abhinav-rastogi-srinivas-sunkara-raghav-gupta-jianguo-zhang-and-jindong-chen-2020-multiwoz-22--a-dialogue-dataset-with-additional-annotation-corrections-and-state-tracking-baselines-in-proceedings-of-the-2nd-workshop-on-natural-language-processing-for-conversational-ai-edited-by-tsung-hsien-wen-asli-celikyilmaz-zhou-yu-alexandros-papangelis-mihail-eric-anuj-kumar-iñigo-casanueva-and-rushin-shah-109–17-online-association-for-computational-linguistics-httpsdoiorg1018653v12020nlp4convai-113) dataset as a foundational resource. This entails both using the provided dialogues as gold-standard reference-dialogues, as well as extracting the Knowledge-Graph-Triplets with a script for the prompts and to build the Knowledge-Base. We chose MultiWOZ 2.2, as it is a widely used TOD dataset with over 10,000 dialogues annotated for dialogue states, as well as the values the slots take. Additionally, the version 2.2 has significantly reduced the noise present in the earlier versions by correcting erroneous annotations and user utterances.

We will also leverage the prompting architecture described in the **data of Labruna et al.** [(2023)](#4-labruna-tiziano-sofia-brenna-andrea-zaninello-and-bernardo-magnini-2023-unraveling-chatgpt-a-critical-analysis-of-ai-generated-goal-oriented-dialogues-and-annotations-arxiv-httparxivorgabs230514556), as well as use the dialogues they generated (if available) as preliminary data for the first implementation of our metrics or potentially to compare to the dialogues generated by us at the end of our project. 

## Evaluation Metrics

An essential part of our project are the **evaluation metrics** we intend to deploy. They are divided in two main categories: **automated metrics** and **manual assessment**. 

Our **automated metrics** are the following: We will utilise **GRUEN** [(Zhu and Suma 2020)](#10-zhu-wanzheng-and-suma-bhat-2020-gruen-for-evaluating-linguistic-quality-of-generated-text-arxiv-httparxivorgabs201002498) to assess the linguistic quality of the generated dialogues. Specifically, this metric is designed to provide separate scores for the **grammaticality**, **non-redundancy**, **focus**, as well as **structure and coherence** of the generated utterances. 

Further information regarding **coherence** is provided by **GRADE** [(Huang et al. 2020)](#2-huang-lishan-zheng-ye-jinghui-qin-liang-lin-and-xiaodan-liang-2020-grade-automatic-graph-enhanced-coherence-metric-for-evaluating-open-domain-dialogue-systems-arxiv-httpsdoiorg1048550arxiv201003994) and **DEAM** [(Ghazarian et al. 2022)](#1-ghazarian-sarik-nuan-wen-aram-galstyan-and-nanyun-peng-2022-deam-dialogue-coherence-evaluation-using-amr-based-semantic-manipulations-arxiv-httpsdoiorg1048550arxiv220309711). **GRADE** specifically evaluates topic coherence, i.e., whether the transition between topics are sufficiently coherent and natural, not too abrupt. **DEAM** tests for subtler forms of incoherence through AMR-based semantic manipulations. As both of these metrics were initially developed with Open Dialogues in mind, we will attempt their implementation and then evaluate their relevance to our task throughout the course of our project.

In the section on our [methods](#methods) we outlined how we intend to label our generated dialogues with dialogue states. In order to evaluate the accuracy of these labels we will employ **Slot Accuracy** and **Joint-Goal Accuracy**.

<!-- potentially add formula for the accuracies -->

For the overall evaluation of **semantic similarity** of our generated dialogues to their reference dialogues from the MultiWOZ dataset, we will use **S3BERT** [(Opitz and Frank 2022)](#7-opitz-juri-and-anette-frank-2022-sbert-studies-meaning-representations-decomposing-sentence-embeddings-into-explainable-semantic-features-arxiv-httpsdoiorg1048550arxiv220607023). 

Our final automated metric is **FactScore** [(Min et al. 2023)](#6-min-sewon-kalpesh-krishna-xinxi-lyu-mike-lewis-wen-tau-yih-pang-wei-koh-mohit-iyyer-luke-zettlemoyer-and-hannaneh-hajishirzi-2023-factscore-fine-grained-atomic-evaluation-of-factual-precision-in-long-form-text-generation-arxiv-httpsdoiorg1048550arxiv230514251) with which we will evaluate our **Retrieval Augmented Generation**, i.e., the extent to which the model is able to retrieve the correct information to match with the requests posed by the (modelled) user. This metric is designed to evaluate the factuality of statements generated by LLMs against a knowledge-base.

Our **Manual Assessment** will be predominantly designed to evaluate the same linguistic categories as in **GRUEN** to provide good comparability. It will be divided into two categories:

1. Prompting an LLM (perhaps the same as we used for generating the data or perhaps another, such as GPT-3.5 or GPT-4) to evaluate our generated data for the **GRUEN** metrics, imitating an annotator. If the results for a manual evaluation via an LLM are comparable to that of a human annotator that would be of great interest, as such annotation is very costly to be performed by human annotators.

1. Conducting a regular manual assessment by human annotators, for the **GRUEN** metrics as well.

We intend to measure inter-annotator agreement between the human annotators, the manual LLM annotation, and the automated GRUEN metric by calculating the Spearmann and Pearson Correlation between the different annotators. Thus we can evaluate both the efficacy of the automated annotation, as well as the manual LLM annotation, by operating under the assumption that the human annotation is the gold-standard.
   
## Tools

The tools we will be using for our project are the following:

- **GitLab** for code management and collaboration.
- The **bwForCluster/Helix** and **CoLi-CLuster** as computational resources.
- **ChatGPT** potentially for manual LLM feedback.
- Potentially the [**Redis**](https://python.langchain.com/docs/integrations/vectorstores/redis) Database system for our knowledge base (this will be decided during the course of our project).
- **Docker** for the deployment of the database and the processing pipelines.
- [**LangChain**](https://python.langchain.com/docs/modules/agents/) as a LLM framework for RAG and Multi-Agent pipeline.

## Project Timeline

1. Preparation and Setup (TBD: Target Date): Parallel Processes
    - Set up knowledge database and API
        - Extract knowledge from WOZ slots into Common Knowledge (CK) for Prompt Injection (PI) and Specific Knowledge (SK) for Knowledge Graph Retrieval (KGR)
        - Import into $DB
        - Embed datapoints with SBERT
        - Implement LangChain Retrieval Augmented Generation (RAG) adapter
    - Select and prepare models and access to computing power (CLuster & bwForCluster)
        - Evaluate memory requirements and hardware availability
        - Select model
        - Test workflows
    - Implement automated metrics on preliminary data
2. Join Parallel Processes (TBD: Target Date)
    - Generate dialogues using varied methods.
        - One Shot approach
        - RAG approach
    - Apply evaluation metrics for quality assessment.
        - Develop processing pipeline for automated evaluation
        - Plotting of results
3. Comparison and Optimization (TBD: Target Date)
    - Analyze results from different methods based on the metrics.
    - Potentially adjust prompting architecture based on metric insights.
4. Given Spare Time:
​​​​​​​​​​  - ​Multi-Agent Approach

## References

#### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.
#### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.
#### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.
#### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.
#### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.
#### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.
#### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.
#### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.
#### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.
#### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.