Commit 39e70f3b authored by finn's avatar finn
Browse files

Add project plan

parent f8a4f22d
Loading
Loading
Loading
Loading
+98 −0
Original line number Diff line number Diff line
# Project Plan

## Objective

- Develop advanced dialogue datasets to enhance chatbot capabilities.
- Conduct automated quality assessments of generated dialogues.
- Compare various prompting architectures using specific metrics to optimize responses.

## Flow Chart

<!-- TODO: Add Flow Chart -->

## Methods

- Dialogue Generation with one shot approach, i.e., asking one model to generate both the conversational turns of the user and the system and labelling the dialogue states
    - during the course of the project we will decide whether to have the labels be generated together with the dialogue or whether to first generate the dialogue and then annotate it
- Utilize Retrieval Augmented Generation (RAG), i.e., fetching the correct information for the system from a knowledge base
- If there is time: MultiAgent Approach, i.e., having two systems one of which takes the part of the system, one the part of the user in the Task Oriented Dialogue (TOD)

## Methods

- **Dialogue Generation with One-Shot Approach**:
    - This involves asking one model to generate both the conversational turns of the user and the system and labeling the dialogue states.
    - *Decision Point*: During the project, we will determine whether to generate the labels with the dialogue or annotate post-generation.
- **Retrieval Augmented Generation (RAG)**:
    - Fetching correct information for the system from a knowledge base.
- **MultiAgent Approach (If Time Allows)**:
    - Employing two systems; one simulates the user and the other, the system in Task-Oriented Dialogue (TOD).


_further notes dazu: warum werden diese Lösungsansätze helfen bessere dialoge zu generieren_

## Models

- Llama-2
- We chose Llama-2 as a model for dialogue generation. We were able to run the 7B version on the CLuster and locally. By utilizing both gpu-nodes of the CLuster we should be able to run Llama-2-13B. Should we get access to the bwForCluster even the 70B model would be possible to use for inference.
- If there is time, try a range of models including GPT-3.5, GPT-4 (if available), Falcon, and Mistral.​​​​​​​
- Explore the adaptability of each model to dialogue generation tasks.

_further notes: Was zum Model schreiben, warum für unsere task gut, auf eigenschaften beziehen und welche vorteile wir damit haben werden_

## Dataset

- Use the Wizard of Oz (WOZ) dataset as a foundational resource. This entails both using the provided dialogues as gold-standard reference-dialogues, as well as extracting the Knowledge-Graph-Triplets with a script for the prompts and to build the Knowledge-Base.
- Labruna Data if available for prompt architecture inspiration

## Evaluation Metrics

- Automated Metrics:
    - GRUEN for linguistic quality assessment.
        - Grammaticality
        - Non-redundancy
        - Focus
        - Structure and Coherence
    - GRADE (considered) for testing topic coherence, i.e., whether the transition between topics is abrupt/sufficiently coherent and natural
    - Slot Accuracy and Joint-Goal Accuracy to evaluate the quality of the dialogue state labels created by the model
    - S3BERT for semantic similarity.
    - FactScore for factual accuracy.
    - DEAM (considered) testing for subtler forms of incoherence through AMR-based semantic manipulations
- Manual Assessment:
    - Utilize a Large Language Model (LLM) for manual quality checks.
    - Conduct separate manual assessment by human annotators.
    - Compute inter-annotator agreement between the automated metrics, the LLM-manual assessments and the human-manual assessment.

## Tools

- Utilize GitLab for code management and collaboration.
- Use bwForCluster/Helix and CLuster as computational resources.
- Leverage ChatGPT for dialogue interaction simulations and potentially for manual LLM feedback
- Redis Database system for knowledge base
- Docker for deployment of database and processing pipelines
- LangChain a LLM framework for RAG and Multi-Agent pipeline

## Project Timeline

1. Preparation and Setup (TBD: Target Date): Parallel Processes
    - Set up knowledge database and API
        - Extract knowledge from WOZ slots into Common Knowledge (CK) for Prompt Injection (PI) and Specific Knowledge (SK) for Knowledge Graph Retrieval (KGR)
        - Import into $DB
        - Embed datapoints with SBERT
        - Implement LangChain Retrieval Augmented Generation (RAG) adapter
    - Select and prepare models and access to computing power (CLuster & bwForCluster)
        - Evaluate memory requirements and hardware availability
        - Select model
        - Test workflows
    - Implement automated metrics on preliminary data
2. Join Parallel Processes (TBD: Target Date)
    - Generate dialogues using varied methods.
        - One Shot approach
        - RAG approach
    - Apply evaluation metrics for quality assessment.
        - Develop processing pipeline for automated evaluation
        - Plotting of results
3. Comparison and Optimization (TBD: Target Date)
    - Analyze results from different methods based on the metrics.
    - Potentially adjust prompting architecture based on metric insights.
4. Given Spare Time:
​​​​​​​​​​  - ​Multi-Agent Approach