Loading

.gitignore

0 → 100644
+2 −0
Original line number Diff line number Diff line
.DS_Store
.venv
 No newline at end of file
+2 −2
Original line number Diff line number Diff line
@@ -83,7 +83,7 @@ Code: https://github.com/pluslabnlp/accent
- Coherence with focus on sentiment
- Event common-sence reasoning

### GRADE [OPEN]
### GRADE [BOTH]

Code: https://github.com/li3cmz/GRADE

@@ -103,7 +103,7 @@ Code: https://github.com/li3cmz/GRADE
- Similarity score
- ! Needs Gold-Standard?
- AF: Sie meinen hier mit Goldstandard Referenzdialog, ja?
- LFC: Ja, nur wenn wir die "gleichen" Dialoge wie im WOZ-Datensatz generiert, mit den selben Problemstellungen, und den selben Informationen (KG-Triplets), kann man die generierten Dialoge direkt vergleichen.
- LFC: Ja, nur wenn wir die "gleichen" Dialoge wie im WOZ-Datensatz generieren, mit den selben Problemstellungen, und den selben Informationen (KG-Triplets), kann man die generierten Dialoge direkt vergleichen.

### FactScore

+32 −24
Original line number Diff line number Diff line
@@ -5,13 +5,12 @@ SWP 2023/24 Research Plan
</div>

<div align ="center">
Chris Pracht, Finn Hillengass, Lea Kyveli Chrysanthopoulou <br>
Lea Kyveli Chrysanthopoulou, Christoph Pracht, Finn Hillengass<br>
12.11.2023
</div>

## Objective


Our project aims to leverage different prompting architectures to extract Task Oriented Dialogue (TOD) datasets from Large Language Models (LLMs) that are available free of charge, e.g., Llama-2 (7B, 13B or 32B). This has the aim of bulding high-quality TOD datasets at low cost, which could then be used to train smaller chatbots on, enhancing their performance. To achieve this aim, we intend to deploy a variety of automated metrics, such as GRUEN, DEAM, GRADE and FactScore. Based on the performance of the prompting architectures on the metrics, we then intend to optimize and refine them as needed.

## Methods
@@ -30,6 +29,7 @@ slots rein, sind die supported oder nicht

slots: von domäne eingeschränkt
folgende relationen, folgende intents --> 

2. The dialogue is generated first and then annotated by the model
<!-- erst das probieren, dann das andere; getrennt halten, weil wir es ja quasi mit den slots prompten -->

@@ -58,17 +58,17 @@ Of the open-source models available, we decided on LLama-2 [(Touvron et al. 2023
We chose LLama-2, as it performs very well in comparison to other open-source models on number of benchmarks (see [figure 1](#figure-1-llama-2-overall-performance-on-grouped-academic-benchmarks-compared-to-open-source-base-models-from-the-paper-by-touvron-et-al-2023)).

![](llama_comparison_other_models.png)

##### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))

Additionally, a lot of emphasis has been placed on respectful and non-discriminatory language use in the fine-tuning of the LLama-2 models to the chat-versions through **Reinforcement learning from Human Feedback** (RLHF). We consider this to be of great importance in generating data points as potential training data for other chatbots (our [objective](#objective)). This is why we preferred Llama-2 over Mistral-7B, even though **Mistral-7B** [(Jiang et al. 2023)](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825) in its base version outperforms the Llama-2-7B base version and sometimes even the Llama-2-13B base version on a number of benchmarks (see [figure 2](#figure-2-comparison-of-mistral-7b-with-llama-from-the-paper-by-jiang-et-al-2023)).

![](comparison_mistral_llama.png)

##### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))

If we have time, we might attempt a comparison of the performance on our metrics with Mistral-7B as well and explore the adaptability of each model to the dialogue generation tasks.



## Dataset

<!-- Datenbank wäre interessant ob da was kohärentes rauskommt, wenn wir da einen knowledge graph dazu machen; konsistenz der einzelnen dialoge-->
@@ -121,7 +121,7 @@ The tools we will be using for our project are the following:
        - Embed datapoints with SBERT
        - Implement LangChain Retrieval Augmented Generation (RAG) adapter
        <!-- langchain, umstritten, aber mal schauen was passt da -->
        <!-- welche arten von intents schließen sich an einen vorherigen intent an; schauen ob diese intent abfolgen leistung verbessern wenn man das dazu gibt, neoforjay, networkx-->
        <!-- welche arten von intents schließen sich an einen vorherigen intent an; schauen ob diese intent abfolgen leistung verbessern wenn man das dazu gibt, neo4j, networkx-->
    - Select and prepare models and access to computing power (CLuster & bwForCluster) (Finn)
        - Evaluate memory requirements and hardware availability
        - Select model
@@ -138,19 +138,27 @@ The tools we will be using for our project are the following:
3. Comparison and Optimization (Target Date: 23.01) (Chris & Finn & Lea)
   - Analyze results from different methods based on the metrics
   - Potentially adjust prompting architecture based on metric insights
4. Given Spare Time:
    - ​Multi-Agent Approach
4. Given Spare Time: - ​Multi-Agent Approach
   ​​​​​

## References

##### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.

##### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.

##### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.

##### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.

##### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.

##### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.

##### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.

##### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.

##### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.

##### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.