@@ -14,10 +14,6 @@ Chris Pracht, Finn Hillengass, Lea Kyveli Chrysanthopoulou <br>
Our project aims to leverage different prompting architectures to extract Task Oriented Dialogue (TOD) datasets from Large Language Models (LLMs) that are available free of charge, e.g., Llama-2 (7B, 13B or 32B). This has the aim of bulding high-quality TOD datasets at low cost, which could then be used to train smaller chatbots on, enhancing their performance. To achieve this aim, we intend to deploy a variety of automated metrics, such as GRUEN, DEAM, GRADE and FactScore. Based on the performance of the prompting architectures on the metrics, we then intend to optimize and refine them as needed.
## Visualisation of Project Plan

## Methods
We generate our TODs via variations of the prompt architecture introduced Labruna et al. (2022) and inspired the insertion of commonsense knowledge through knowledge triplets as detailed by Kim et al. (2023).
@@ -27,12 +23,32 @@ We initially perform Dialogue Generation with the **One Shot Approach**, i.e., a
The **labelling of the dialogue states** on the generated dialogue can happen in two manners, the more suitable of which we will decide on during the course of our project:
1. The labels are generated together with the dialogue
<!-- label selbst generieren einmal per hand, dann schonmal metrik untersuchen
schauen ob der dialog die informationen tatsächlich enthält
slots rein, sind die supported oder nicht
slots: von domäne eingeschränkt
folgende relationen, folgende intents -->
2. The dialogue is generated first and then annotated by the model
<!-- erst das probieren, dann das andere; getrennt halten, weil wir es ja quasi mit den slots prompten -->
We furthermore want to implement **Retrieval Augmented Generation** (RAG), i.e., based on the triplet of requirements described above (hotel, Italian, expensive), the model should fetch an option fullfilling these parameters from a knowledge base.
<!-- vielleicht erst zum annotieren verwenden; zugriff vom system auf die datenbank; beim planen gucken, dass wir dialoge haben bei denen solche bestimmten werte; system mit dynamischer datenbank-->
Finally, if there is time, we want to try a **MultiAgent Approach**, in which two models simulate a dialogue between each other, wherein the one takes the part of the system and the other the part of the user in the Task Oriented Dialogue (TOD).
## Visualisation of Project Plan

<!-- in den prompts des multiagent, dialog prompt, sagen, du hast nur eine der beiden rollen
zwei separaten architekturen
und ein prompt beispiel dazu tun-->
## Models
While selecting the LLM for our project, one of our concerns was for it to be open-source, i.e., for it to be available to download and for us to host it ourselves, so we would not have to pay for an API or interact with it manually over its web-interface, as would be the case with a Chat-GPT model.
@@ -55,6 +71,8 @@ If we have time, we might attempt a comparison of the performance on our metrics
## Dataset
<!-- Datenbank wäre interessant ob da was kohärentes rauskommt, wenn wir da einen knowledge graph dazu machen; konsistenz der einzelnen dialoge-->
As far as datasets go, we will use the **MultiWOZ 2.2**[(Zang et al. 2020)](#9-zang-xiaoxue-abhinav-rastogi-srinivas-sunkara-raghav-gupta-jianguo-zhang-and-jindong-chen-2020-multiwoz-22--a-dialogue-dataset-with-additional-annotation-corrections-and-state-tracking-baselines-in-proceedings-of-the-2nd-workshop-on-natural-language-processing-for-conversational-ai-edited-by-tsung-hsien-wen-asli-celikyilmaz-zhou-yu-alexandros-papangelis-mihail-eric-anuj-kumar-iñigo-casanueva-and-rushin-shah-109–17-online-association-for-computational-linguistics-httpsdoiorg1018653v12020nlp4convai-113) dataset as a foundational resource. This entails both using the provided dialogues as gold-standard reference-dialogues, as well as extracting the Knowledge-Graph-Triplets with a script for the prompts and to build the Knowledge-Base. We chose MultiWOZ 2.2, as it is a widely used TOD dataset with over 10,000 dialogues annotated for dialogue states, as well as the values the slots take. Additionally, the version 2.2 has significantly reduced the noise present in the earlier versions by correcting erroneous annotations and user utterances.
We will also leverage the prompting architecture described in the **data of Labruna et al.**[(2023)](#4-labruna-tiziano-sofia-brenna-andrea-zaninello-and-bernardo-magnini-2023-unraveling-chatgpt-a-critical-analysis-of-ai-generated-goal-oriented-dialogues-and-annotations-arxiv-httparxivorgabs230514556), as well as use the dialogues they generated (if available) as preliminary data for the first implementation of our metrics or potentially to compare to the dialogues generated by us at the end of our project.
@@ -102,11 +120,14 @@ The tools we will be using for our project are the following:
<!-- langchain, umstritten, aber mal schauen was passt da -->
<!-- welche arten von intents schließen sich an einen vorherigen intent an; schauen ob diese intent abfolgen leistung verbessern wenn man das dazu gibt, neoforjay, networkx-->
- Select and prepare models and access to computing power (CLuster & bwForCluster) (Finn)
- Evaluate memory requirements and hardware availability
- Select model
- Test workflows
- Implement automated metrics on preliminary data (Lea)
<!-- wie gut funktionieren die metriken schonmal auf den referenzdialogen-->
2. Join Parallel Processes (Target Date: 02.01)
- Generate dialogues using varied methods (Chris & Finn)