Make some text smaller (578fcafc) · Commits · hillengass / SynDRA

documentation/project_plan_full_text.md

+12 −12

Original line number	Diff line number	Diff line
		@@ -40,12 +40,12 @@ Of the open-source models available, we decided on LLama-2 [(Touvron et al. 2023
		We chose LLama-2, as it performs very well in comparison to other open-source models on number of benchmarks (see [figure 1](#figure-1-llama-2-overall-performance-on-grouped-academic-benchmarks-compared-to-open-source-base-models-from-the-paper-by-touvron-et-al-2023)).

		![](llama_comparison_other_models.png)
		#### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))
		##### Figure 1: Llama-2: overall performance on grouped academic benchmarks compared to open-source base models (from the paper by [Touvron et al. 2023](#8-touvron-hugo-louis-martin-kevin-stone-peter-albert-amjad-almahairi-yasmine-babaei-nikolay-bashlykov-et-al-2023-llama-2-open-foundation-and-fine-tuned-chat-models-arxiv-httparxivorgabs230709288))

		Additionally, a lot of emphasis has been placed on respectful and non-discriminatory language use in the fine-tuning of the LLama-2 models to the chat-versions through Reinforcement learning from Human Feedback (RLHF). We consider this to be of great importance in generating data points as potential training data for other chatbots (our [objective](#objective)). This is why we preferred Llama-2 over Mistral-7B, even though Mistral-7B [(Jiang et al. 2023)](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825) in its base version outperforms the Llama-2-7B base version and sometimes even the Llama-2-13B base version on a number of benchmarks (see [figure 2](#figure-2-comparison-of-mistral-7b-with-llama-from-the-paper-by-jiang-et-al-2023)).

		![](comparison_mistral_llama.png)
		#### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))
		##### Figure 2: Comparison of Mistral 7B with Llama (from the paper by [Jiang et al. 2023](#3-jiang-albert-q-alexandre-sablayrolles-arthur-mensch-chris-bamford-devendra-singh-chaplot-diego-de-las-casas-florian-bressand-et-al-2023-mistral-7b-arxiv-httpsdoiorg1048550arxiv231006825))

		If we have time, we might attempt a comparison of the performance on our metrics with Mistral-7B as well and explore the adaptability of each model to the dialogue generation tasks.

		@@ -120,13 +120,13 @@ The tools we will be using for our project are the following:

		## References

		#### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.
		#### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.
		#### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.
		#### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.
		#### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.
		#### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.
		#### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.
		#### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.
		#### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.
		#### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.
		##### 1. Ghazarian, Sarik, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022. ‘DEAM: Dialogue Coherence Evaluation Using AMR-Based Semantic Manipulations’. arXiv. https://doi.org/10.48550/arXiv.2203.09711.
		##### 2. Huang, Lishan, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020. ‘GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems’. arXiv. https://doi.org/10.48550/arXiv.2010.03994.
		##### 3. Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. 2023. ‘Mistral 7B’. arXiv. https://doi.org/10.48550/arXiv.2310.06825.
		##### 4. Labruna, Tiziano, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. ‘Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations’. arXiv. http://arxiv.org/abs/2305.14556.
		##### 5. Kim, Hyunwoo, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, et al. 2023. ‘SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization’. arXiv. http://arxiv.org/abs/2212.10465.
		##### 6. Min, Sewon, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. ‘FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation’. arXiv. https://doi.org/10.48550/arXiv.2305.14251.
		##### 7. Opitz, Juri, and Anette Frank. 2022. ‘SBERT Studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features’. arXiv. https://doi.org/10.48550/arXiv.2206.07023.
		##### 8. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’. arXiv. http://arxiv.org/abs/2307.09288.
		##### 9. Zang, Xiaoxue, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. ‘MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines’. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, edited by Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah, 109–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlp4convai-1.13.
		##### 10. Zhu, Wanzheng, and Suma Bhat. 2020. ‘GRUEN for Evaluating Linguistic Quality of Generated Text’. arXiv. http://arxiv.org/abs/2010.02498.