Commit 578fcafc authored by Lea Kyveli Chrysanthopoulou's avatar Lea Kyveli Chrysanthopoulou
Browse files

Make some text smaller

parent 926a8326
Loading
Loading
Loading
Loading
+12 −12
Original line number Diff line number Diff line
@@ -40,12 +40,12 @@ Of the open-source models available, we decided on LLama-2 [(Touvron et al. 2023
We chose LLama-2, as it performs very well in comparison to other open-source models on number of benchmarks (see [figure 1](#figure-1-llama-2-overall-performance-on-grouped-academic-benchmarks-compared-to-open-source-base-models-from-the-paper-by-touvron-et-al-2023)).

![](llama_comparison_other_models.png) 
#### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))
##### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))

Additionally, a lot of emphasis has been placed on respectful and non-discriminatory language use in the fine-tuning of the LLama-2 models to the chat-versions through **Reinforcement learning from Human Feedback** (RLHF). We consider this to be of great importance in generating data points as potential training data for other chatbots (our [objective](#objective)). This is why we preferred Llama-2 over Mistral-7B, even though **Mistral-7B** [(Jiang et al. 2023)](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825) in its base version outperforms the Llama-2-7B base version and sometimes even the Llama-2-13B base version on a number of benchmarks (see [figure 2](#figure-2-comparison-of-mistral-7b-with-llama-from-the-paper-by-jiang-et-al-2023)).

![](comparison_mistral_llama.png)
#### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))
##### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))

If we have time, we might attempt a comparison of the performance on our metrics with Mistral-7B as well and explore the adaptability of each model to the dialogue generation tasks.

@@ -120,13 +120,13 @@ The tools we will be using for our project are the following:

## References

#### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.
#### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.
#### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.
#### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.
#### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.
#### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.
#### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.
#### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.
#### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.
#### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.
##### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.
##### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.
##### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.
##### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.
##### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.
##### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.
##### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.
##### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.
##### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.
##### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.