Commit 9d9ac388 authored by pracht's avatar pracht
Browse files

Ladies first, Name fix, Auto format

parent 0087a914
Loading
Loading
Loading
Loading
+42 −35
Original line number Diff line number Diff line
@@ -5,13 +5,12 @@ SWP 2023/24 Research Plan
</div>

<div align ="center">
Chris Pracht, Finn Hillengass, Lea Kyveli Chrysanthopoulou <br>
Lea Kyveli Chrysanthopoulou, Christoph Pracht, Finn Hillengass<br>
12.11.2023
</div>

## Objective


Our project aims to leverage different prompting architectures to extract Task Oriented Dialogue (TOD) datasets from Large Language Models (LLMs) that are available free of charge, e.g., Llama-2 (7B, 13B or 32B). This has the aim of bulding high-quality TOD datasets at low cost, which could then be used to train smaller chatbots on, enhancing their performance. To achieve this aim, we intend to deploy a variety of automated metrics, such as GRUEN, DEAM, GRADE and FactScore. Based on the performance of the prompting architectures on the metrics, we then intend to optimize and refine them as needed.

## Visualisation of Project Plan
@@ -42,17 +41,17 @@ Of the open-source models available, we decided on LLama-2 [(Touvron et al. 2023
We chose LLama-2, as it performs very well in comparison to other open-source models on number of benchmarks (see [figure 1](#figure-1-llama-2-overall-performance-on-grouped-academic-benchmarks-compared-to-open-source-base-models-from-the-paper-by-touvron-et-al-2023)).

![](llama_comparison_other_models.png)

##### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))

Additionally, a lot of emphasis has been placed on respectful and non-discriminatory language use in the fine-tuning of the LLama-2 models to the chat-versions through **Reinforcement learning from Human Feedback** (RLHF). We consider this to be of great importance in generating data points as potential training data for other chatbots (our [objective](#objective)). This is why we preferred Llama-2 over Mistral-7B, even though **Mistral-7B** [(Jiang et al. 2023)](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825) in its base version outperforms the Llama-2-7B base version and sometimes even the Llama-2-13B base version on a number of benchmarks (see [figure 2](#figure-2-comparison-of-mistral-7b-with-llama-from-the-paper-by-jiang-et-al-2023)).

![](comparison_mistral_llama.png)

##### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))

If we have time, we might attempt a comparison of the performance on our metrics with Mistral-7B as well and explore the adaptability of each model to the dialogue generation tasks.



## Dataset

As far as datasets go, we will use the **MultiWOZ 2.2** [(Zang et al. 2020)](#9-zang-xiaoxue-abhinav-rastogi-srinivas-sunkara-raghav-gupta-jianguo-zhang-and-jindong-chen-2020-multiwoz-22--a-dialogue-dataset-with-additional-annotation-corrections-and-state-tracking-baselines-in-proceedings-of-the-2nd-workshop-on-natural-language-processing-for-conversational-ai-edited-by-tsung-hsien-wen-asli-celikyilmaz-zhou-yu-alexandros-papangelis-mihail-eric-anuj-kumar-iñigo-casanueva-and-rushin-shah-109–17-online-association-for-computational-linguistics-httpsdoiorg1018653v12020nlp4convai-113) dataset as a foundational resource. This entails both using the provided dialogues as gold-standard reference-dialogues, as well as extracting the Knowledge-Graph-Triplets with a script for the prompts and to build the Knowledge-Base. We chose MultiWOZ 2.2, as it is a widely used TOD dataset with over 10,000 dialogues annotated for dialogue states, as well as the values the slots take. Additionally, the version 2.2 has significantly reduced the noise present in the earlier versions by correcting erroneous annotations and user utterances.
@@ -117,19 +116,27 @@ The tools we will be using for our project are the following:
3. Comparison and Optimization (Target Date: 23.01) (Chris & Finn & Lea)
   - Analyze results from different methods based on the metrics
   - Potentially adjust prompting architecture based on metric insights
4. Given Spare Time:
    - ​Multi-Agent Approach
4. Given Spare Time: - ​Multi-Agent Approach
   ​​​​​

## References

##### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.

##### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.

##### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.

##### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.

##### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.

##### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.

##### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.

##### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.

##### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.

##### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.