Commit 8ed673d5 authored by finn's avatar finn
Browse files

Merge branch 'master' of gitlab.cl.uni-heidelberg.de:chrysanthopoulou/provocate

parents 4a38a6b1 69c33b57
Loading
Loading
Loading
Loading
+6 −0
Original line number Diff line number Diff line
@@ -8,6 +8,12 @@
.local
.npm

# Ignore virtual environments
grade_venv/

# Ignore older python / other program versions
python=3.6

# Ignore models-folder except readme
models/*
!models/README.md
+60 −0
Original line number Diff line number Diff line
# TODOListe

Das Projekt ist in drei Stufen aufgeteilt. Das erlaubt es uns ambitionierte Ziele ohne das Risiko eines gescheiterten Projekts zu erreichen.

Unser Ziel ist es die dritte Stufe zu erreichen, sollten wir das zeitlich nicht schaffen, erarbeiten wir nur die resten beiden Stufen. Wenn es bei Stufe 2 zu viele unvorhergesehene Probleme gibt, haben wir zumindest die erste Stufe bearbeitet und können dort noch mehr ins Detail gehen, um sicher ein gutes, erfolgreiches Projekt abgeschlossen zu haben.

**Szenarien**

- Stufe [#1] Synthetische Dialoge mit einem Modell generieren
- Stufe [#2]: Synthetische Dialoge mit einem von Datenbank gestützten Modell generieren
- Stufe [#3]: Synthetische Dialoge mit interagierenden, Datenbank gestützten Modellen generieren

## TODOs

**Übersicht**

| Lea                           | Finn                            | Chris                           | Target Dates |
| ----------------------------- | ------------------------------- | ------------------------------- | ------------ |
| Annotation der HD Daten       | Annotation der HD Daten         | Annotation der HD Daten         |              |
|                               | Prompt tuning + Testing (llama) | Datenbank erstellen und füllen  | 12.12.2023   |
| Metriken vorbereiten + testen | Cluster einrichten für one-shot |                                 |              |
|                               | one-shot auf Cluster ausführen  | WOZ Schema mappen               | 19.12.2023   |
|                               | Extract Dialogue profiles       |                                 |              |
| Metrik einsetzen (one shot)   | Metrik einsetzen (one shot)     | DatenbankEmbeddings vorbereiten | 22.12.2023   |


| TODOS (nach Weihnachten ca. bis zum 23. Jan)                                                                  | Target Dates |
| ------------------------------------------------------------------------------------------------------------- | ------------ |
| FactScore zum Zerlegen der Fragen                                                                             |              |
| SBERT 2-Varianten                                                                                             |              |
| Fragen aus Dialog mit SBERT auf Datenbank Fakten abbilden. Werden relevante Daten gefunden? (slot-similarity) |              |
| - SBERT similarity                                                                                            |              |
| - SBERT QA                                                                                                    |              |
| - Is retrieved fact coherent in KB/Dialogue                                                                   |              |
| 5-10 Dialoge manuell annotieren und mit Labruna vergleichen                                                   |              |
| Präsentation                                                                                                  | 06.02.2023   |





**Task Liste**

| TaskID | Task Beschreibung                                | Status | Deadline | Dependency | Person |
| ------ | ------------------------------------------------ | ------ | -------- | ---------- | ------ |
| 1      | Annotationen der HD-Daten                        | TODO   | 12.12.   |            | LCF    |
| 2      | Metriken vorbereiten                             | TODO   | 31.12    |            | L      |
| 3      | Metriken auf 1Shot anwenden                      | TODO   | 10.01    | 2,9        | LF     |
| 4      | Datenbank fertig machen                          | TODO   | 12.12    | 1          | C      |
| 5      | WOZ Schema mappen                                | TODO   | 31.12    | 4          | C      |
| 6      | Datenbank Embedden                               | TODO   | 31.12    | 4          | C      |
| 7      | Prompt tuning + Testing                          | TODO   | 12.12    |            | F      |
| 8      | Cluster einrichten                               | TODO   | 12.12    |            | F      |
| 9      | 1Shot auf Cluster ausführen                      | TODO   | 31.12    | 7,8        | LCF    |
| 10     | FActScore zum Zerlegen der Fragen                | TODO   | 23.01    |            |        |
| 11     | SBERT (CoSim, QA)                                | TODO   | 23.01    |            |        |
| 12     | Embeddings für Retrieval evaluieren              | TODO   | 23.01    | 11         |        |
| 13     | 5 Dialoge annotieren und mit Labruna vergleichen | TODO   | 06.02    |            |        |
| 14     | Abschluss Präsentation                           | TODO   | 06.02    |            |        |
| 15     | Extract Dialogue profiles                        | TODO   | 22.12    | 4          |        |
+5 −0
Original line number Diff line number Diff line
output/
*.pyc
__pycache__
tools/
.vscode
 No newline at end of file
+108 −0
Original line number Diff line number Diff line
# **GRADE**: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems


This repository contains the source code for the following paper:


[GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems](https://arxiv.org/abs/2010.03994)   
Lishan Huang, Zheng Ye, Jinghui Qin, Xiaodan Liang; EMNLP 2020

## Model Overview
![GRADE](images/GRADE.png)

## Prerequisites
Create virtural environment (recommended):
```
conda create -n GRADE python=3.6
source activate GRADE
```
Install the required packages:
```
pip install -r requirements.txt
```

Install Texar locally:
```
cd texar-pytorch
pip install .
```

Note: Make sure that your environment has installed **cuda 10.1**.

## Data Preparation
GRADE is trained on the DailyDialog Dataset proposed by ([Li et al.,2017](https://arxiv.org/abs/1710.03957)).

For convenience, we provide the [processed data](https://drive.google.com/file/d/1sj3Z_GZfYzrhmleWazA-QawhUEhlNmJd/view?usp=sharing) of DailyDialog. And you should also download it and unzip into the `data` directory. And you should also download [tools](https://drive.google.com/file/d/1CaRhHnO0YsQHOnJsmMUJuL4w9HXJZQYw/view?usp=sharing) and unzip it into the root directory of this repo.

If you wanna prepare the training data from scratch, please follow the steps:
1. Install [Lucene](https://lucene.apache.org/);
2. Run the preprocessing script:
```
cd ./script
bash preprocess_training_dataset.sh
```


## Training
To train GRADE, please run the following script:
```
cd ./script
bash train.sh
```

Note that the [checkpoint](https://drive.google.com/file/d/1v9o-fSohFDegicakrSEnKNcKliOqhYfH/view?usp=sharing) of our final GRADE is provided. You could download it and unzip into the root directory.

## Evaluation
We evaluate GRADE and other baseline metrics on three chit-chat datasets (DailyDialog, ConvAI2 and EmpatheticDialogues). The corresponding evaluation data in the `evaluation` directory has the following file structure:
```
.
└── evaluation
    └── eval_data
    |   └── DIALOG_DATASET_NAME
    |       └── DIALOG_MODEL_NAME
    |           └── human_ctx.txt
    |           └── human_hyp.txt
    └── human_score
        └── DIALOG_DATASET_NAME
        |   └── DIALOG_MODEL_NAME
        |       └── human_score.txt
        └── human_judgement.json
```
Note: the entire human judgement data we proposed for metric evaluation is in `human_judgement.json`.


To evaluate GRADE, please run the following script:
```
cd ./script
bash eval.sh
```

## Using GRADE
To use GRADE on your own dialog dataset:
1. Put the whole dataset (raw data) into `./preprocess/dataset`;
2. Update the function **load_dataset**  in `./preprocess/extract_keywords.py` for loading the dataset;
3. Prepare the context-response data that you want to evaluate and convert it into the following format:
```
.
└── evaluation
    └── eval_data
        └── YOUR_DIALOG_DATASET_NAME
            └── YOUR_DIALOG_MODEL_NAME
                ├── human_ctx.txt
                └── human_hyp.txt
```
4. Run the following script to evaluate the context-response data with GRADE:
```
cd ./script
bash inference.sh
```
5. Lastly, the scores given by GRADE can be found as below:
```
.
└── evaluation
    └── infer_result
        └── YOUR_DIALOG_DATASET_NAME
            └── YOUR_DIALOG_MODEL_NAME
                ├── non_reduced_results.json
                └── reduced_results.json
```
+80 −0
Original line number Diff line number Diff line
import copy
init_embd_file = './tools/numberbatch-en-19.08.txt'
pickle_data_dir = './data/convai2'
max_keyword_length = 16
max_seq_length = 128
num_classes = 2
num_test_data = 150

vocab_file = './data/DailyDialog/keyword.vocab'
train_batch_size = 8
max_train_epoch = 20
pretrained_epoch = -1
display_steps = 50  # Print training loss every display_steps; -1 to disable


eval_steps = 100  # Eval on the dev set every eval_steps; -1 to disable
# Proportion of training to perform linear learning rate warmup for.
# E.g., 0.1 = 10% of training.
warmup_proportion = 0.1
eval_batch_size = 32
test_batch_size = 32


feature_types = {
    # Reading features from pickled data file.
    # E.g., Reading feature "input_ids" as dtype `int64`;
    # "FixedLenFeature" indicates its length is fixed for all data instances;
    # and the sequence length is limited by `max_seq_length`.
    "input_ids_raw_text": ["int64", "stacked_tensor", max_seq_length],
    "input_mask_raw_text": ["int64", "stacked_tensor", max_seq_length],
    "segment_ids_raw_text": ["int64", "stacked_tensor", max_seq_length],

    "input_ids_raw_context": ["int64", "stacked_tensor", max_seq_length],
    "input_mask_raw_context": ["int64", "stacked_tensor", max_seq_length],
    "segment_ids_raw_context": ["int64", "stacked_tensor", max_seq_length],

    "input_ids_raw_response": ["int64", "stacked_tensor", max_seq_length],
    "input_mask_raw_response": ["int64", "stacked_tensor", max_seq_length],
    "segment_ids_raw_response": ["int64", "stacked_tensor", max_seq_length],
}


test_hparam = {
    "allow_smaller_final_batch": True,
    "batch_size": test_batch_size,
    "datasets": [
        {
            "files": "{}/test/pair-1/test_text.pkl".format(pickle_data_dir),
            'data_name': 'pair_1',
            'data_type': 'record',
            "feature_types": feature_types,
        },
        {
            "files": "{}/test/pair-1/original_dialog_merge.keyword".format(pickle_data_dir),
            'data_name': 'keyword_pair_1',
            'vocab_file': vocab_file,
            "embedding_init": {
                "file": init_embd_file,
                'dim':300,
                'read_fn':"load_glove"
            },
            "max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
        },
        {
            "files": "{}/test/pair-1/original_dialog_merge.ctx_keyword".format(pickle_data_dir),
            'data_name': 'ctx_keyword_pair_1',
            "vocab_share_with":1,
            "embedding_init_share_with":1,
            "max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
        },
        {
            "files": "{}/test/pair-1/original_dialog_merge.rep_keyword".format(pickle_data_dir),
            'data_name': 'rep_keyword_pair_1',
            "vocab_share_with":1,
            "embedding_init_share_with":1,
            "max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
        }
    ],
    "shuffle": False,
}
Loading