Merge branch 'master' of gitlab.cl.uni-heidelberg.de:chrysanthopoulou/provocate (8ed673d5) · Commits · hillengass / SynDRA

.gitignore

+6 −0

Original line number	Diff line number	Diff line
		@@ -8,6 +8,12 @@
		.local
		.npm

		# Ignore virtual environments
		grade_venv/

		# Ignore older python / other program versions
		python=3.6

		# Ignore models-folder except readme
		models/*
		!models/README.md

documentation/todolist.md

0 → 100644

+60 −0

Original line number	Diff line number	Diff line
		# TODOListe

		Das Projekt ist in drei Stufen aufgeteilt. Das erlaubt es uns ambitionierte Ziele ohne das Risiko eines gescheiterten Projekts zu erreichen.

		Unser Ziel ist es die dritte Stufe zu erreichen, sollten wir das zeitlich nicht schaffen, erarbeiten wir nur die resten beiden Stufen. Wenn es bei Stufe 2 zu viele unvorhergesehene Probleme gibt, haben wir zumindest die erste Stufe bearbeitet und können dort noch mehr ins Detail gehen, um sicher ein gutes, erfolgreiches Projekt abgeschlossen zu haben.

		Szenarien

		- Stufe [#1] Synthetische Dialoge mit einem Modell generieren
		- Stufe [#2]: Synthetische Dialoge mit einem von Datenbank gestützten Modell generieren
		- Stufe [#3]: Synthetische Dialoge mit interagierenden, Datenbank gestützten Modellen generieren

		## TODOs

		Übersicht

		\| Lea \| Finn \| Chris \| Target Dates \|
		\| ----------------------------- \| ------------------------------- \| ------------------------------- \| ------------ \|
		\| Annotation der HD Daten \| Annotation der HD Daten \| Annotation der HD Daten \| \|
		\| \| Prompt tuning + Testing (llama) \| Datenbank erstellen und füllen \| 12.12.2023 \|
		\| Metriken vorbereiten + testen \| Cluster einrichten für one-shot \| \| \|
		\| \| one-shot auf Cluster ausführen \| WOZ Schema mappen \| 19.12.2023 \|
		\| \| Extract Dialogue profiles \| \| \|
		\| Metrik einsetzen (one shot) \| Metrik einsetzen (one shot) \| DatenbankEmbeddings vorbereiten \| 22.12.2023 \|


		\| TODOS (nach Weihnachten ca. bis zum 23. Jan) \| Target Dates \|
		\| ------------------------------------------------------------------------------------------------------------- \| ------------ \|
		\| FactScore zum Zerlegen der Fragen \| \|
		\| SBERT 2-Varianten \| \|
		\| Fragen aus Dialog mit SBERT auf Datenbank Fakten abbilden. Werden relevante Daten gefunden? (slot-similarity) \| \|
		\| - SBERT similarity \| \|
		\| - SBERT QA \| \|
		\| - Is retrieved fact coherent in KB/Dialogue \| \|
		\| 5-10 Dialoge manuell annotieren und mit Labruna vergleichen \| \|
		\| Präsentation \| 06.02.2023 \|





		Task Liste

		\| TaskID \| Task Beschreibung \| Status \| Deadline \| Dependency \| Person \|
		\| ------ \| ------------------------------------------------ \| ------ \| -------- \| ---------- \| ------ \|
		\| 1 \| Annotationen der HD-Daten \| TODO \| 12.12. \| \| LCF \|
		\| 2 \| Metriken vorbereiten \| TODO \| 31.12 \| \| L \|
		\| 3 \| Metriken auf 1Shot anwenden \| TODO \| 10.01 \| 2,9 \| LF \|
		\| 4 \| Datenbank fertig machen \| TODO \| 12.12 \| 1 \| C \|
		\| 5 \| WOZ Schema mappen \| TODO \| 31.12 \| 4 \| C \|
		\| 6 \| Datenbank Embedden \| TODO \| 31.12 \| 4 \| C \|
		\| 7 \| Prompt tuning + Testing \| TODO \| 12.12 \| \| F \|
		\| 8 \| Cluster einrichten \| TODO \| 12.12 \| \| F \|
		\| 9 \| 1Shot auf Cluster ausführen \| TODO \| 31.12 \| 7,8 \| LCF \|
		\| 10 \| FActScore zum Zerlegen der Fragen \| TODO \| 23.01 \| \| \|
		\| 11 \| SBERT (CoSim, QA) \| TODO \| 23.01 \| \| \|
		\| 12 \| Embeddings für Retrieval evaluieren \| TODO \| 23.01 \| 11 \| \|
		\| 13 \| 5 Dialoge annotieren und mit Labruna vergleichen \| TODO \| 06.02 \| \| \|
		\| 14 \| Abschluss Präsentation \| TODO \| 06.02 \| \| \|
		\| 15 \| Extract Dialogue profiles \| TODO \| 22.12 \| 4 \| \|

metrics/grade/GRADE/.gitignore

0 → 100644

+5 −0

Original line number	Diff line number	Diff line
		output/
		*.pyc
		__pycache__
		tools/
		.vscode
		No newline at end of file

metrics/grade/GRADE/README.md

0 → 100644

+108 −0

Original line number	Diff line number	Diff line
		# GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems


		This repository contains the source code for the following paper:


		[GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems](https://arxiv.org/abs/2010.03994)
		Lishan Huang, Zheng Ye, Jinghui Qin, Xiaodan Liang; EMNLP 2020

		## Model Overview
		![GRADE](images/GRADE.png)

		## Prerequisites
		Create virtural environment (recommended):
		```
		conda create -n GRADE python=3.6
		source activate GRADE
		```
		Install the required packages:
		```
		pip install -r requirements.txt
		```

		Install Texar locally:
		```
		cd texar-pytorch
		pip install .
		```

		Note: Make sure that your environment has installed cuda 10.1.

		## Data Preparation
		GRADE is trained on the DailyDialog Dataset proposed by ([Li et al.,2017](https://arxiv.org/abs/1710.03957)).

		For convenience, we provide the [processed data](https://drive.google.com/file/d/1sj3Z_GZfYzrhmleWazA-QawhUEhlNmJd/view?usp=sharing) of DailyDialog. And you should also download it and unzip into the `data` directory. And you should also download [tools](https://drive.google.com/file/d/1CaRhHnO0YsQHOnJsmMUJuL4w9HXJZQYw/view?usp=sharing) and unzip it into the root directory of this repo.

		If you wanna prepare the training data from scratch, please follow the steps:
		1. Install [Lucene](https://lucene.apache.org/);
		2. Run the preprocessing script:
		```
		cd ./script
		bash preprocess_training_dataset.sh
		```


		## Training
		To train GRADE, please run the following script:
		```
		cd ./script
		bash train.sh
		```

		Note that the [checkpoint](https://drive.google.com/file/d/1v9o-fSohFDegicakrSEnKNcKliOqhYfH/view?usp=sharing) of our final GRADE is provided. You could download it and unzip into the root directory.

		## Evaluation
		We evaluate GRADE and other baseline metrics on three chit-chat datasets (DailyDialog, ConvAI2 and EmpatheticDialogues). The corresponding evaluation data in the `evaluation` directory has the following file structure:
		```
		.
		└── evaluation
		└── eval_data
		\| └── DIALOG_DATASET_NAME
		\| └── DIALOG_MODEL_NAME
		\| └── human_ctx.txt
		\| └── human_hyp.txt
		└── human_score
		└── DIALOG_DATASET_NAME
		\| └── DIALOG_MODEL_NAME
		\| └── human_score.txt
		└── human_judgement.json
		```
		Note: the entire human judgement data we proposed for metric evaluation is in `human_judgement.json`.


		To evaluate GRADE, please run the following script:
		```
		cd ./script
		bash eval.sh
		```

		## Using GRADE
		To use GRADE on your own dialog dataset:
		1. Put the whole dataset (raw data) into `./preprocess/dataset`;
		2. Update the function load_dataset in `./preprocess/extract_keywords.py` for loading the dataset;
		3. Prepare the context-response data that you want to evaluate and convert it into the following format:
		```
		.
		└── evaluation
		└── eval_data
		└── YOUR_DIALOG_DATASET_NAME
		└── YOUR_DIALOG_MODEL_NAME
		├── human_ctx.txt
		└── human_hyp.txt
		```
		4. Run the following script to evaluate the context-response data with GRADE:
		```
		cd ./script
		bash inference.sh
		```
		5. Lastly, the scores given by GRADE can be found as below:
		```
		.
		└── evaluation
		└── infer_result
		└── YOUR_DIALOG_DATASET_NAME
		└── YOUR_DIALOG_MODEL_NAME
		├── non_reduced_results.json
		└── reduced_results.json
		```

metrics/grade/GRADE/config/config_data_for_metric.py

0 → 100644

+80 −0

Original line number	Diff line number	Diff line
		import copy
		init_embd_file = './tools/numberbatch-en-19.08.txt'
		pickle_data_dir = './data/convai2'
		max_keyword_length = 16
		max_seq_length = 128
		num_classes = 2
		num_test_data = 150

		vocab_file = './data/DailyDialog/keyword.vocab'
		train_batch_size = 8
		max_train_epoch = 20
		pretrained_epoch = -1
		display_steps = 50 # Print training loss every display_steps; -1 to disable


		eval_steps = 100 # Eval on the dev set every eval_steps; -1 to disable
		# Proportion of training to perform linear learning rate warmup for.
		# E.g., 0.1 = 10% of training.
		warmup_proportion = 0.1
		eval_batch_size = 32
		test_batch_size = 32


		feature_types = {
		# Reading features from pickled data file.
		# E.g., Reading feature "input_ids" as dtype `int64`;
		# "FixedLenFeature" indicates its length is fixed for all data instances;
		# and the sequence length is limited by `max_seq_length`.
		"input_ids_raw_text": ["int64", "stacked_tensor", max_seq_length],
		"input_mask_raw_text": ["int64", "stacked_tensor", max_seq_length],
		"segment_ids_raw_text": ["int64", "stacked_tensor", max_seq_length],

		"input_ids_raw_context": ["int64", "stacked_tensor", max_seq_length],
		"input_mask_raw_context": ["int64", "stacked_tensor", max_seq_length],
		"segment_ids_raw_context": ["int64", "stacked_tensor", max_seq_length],

		"input_ids_raw_response": ["int64", "stacked_tensor", max_seq_length],
		"input_mask_raw_response": ["int64", "stacked_tensor", max_seq_length],
		"segment_ids_raw_response": ["int64", "stacked_tensor", max_seq_length],
		}


		test_hparam = {
		"allow_smaller_final_batch": True,
		"batch_size": test_batch_size,
		"datasets": [
		{
		"files": "{}/test/pair-1/test_text.pkl".format(pickle_data_dir),
		'data_name': 'pair_1',
		'data_type': 'record',
		"feature_types": feature_types,
		},
		{
		"files": "{}/test/pair-1/original_dialog_merge.keyword".format(pickle_data_dir),
		'data_name': 'keyword_pair_1',
		'vocab_file': vocab_file,
		"embedding_init": {
		"file": init_embd_file,
		'dim':300,
		'read_fn':"load_glove"
		},
		"max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
		},
		{
		"files": "{}/test/pair-1/original_dialog_merge.ctx_keyword".format(pickle_data_dir),
		'data_name': 'ctx_keyword_pair_1',
		"vocab_share_with":1,
		"embedding_init_share_with":1,
		"max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
		},
		{
		"files": "{}/test/pair-1/original_dialog_merge.rep_keyword".format(pickle_data_dir),
		'data_name': 'rep_keyword_pair_1',
		"vocab_share_with":1,
		"embedding_init_share_with":1,
		"max_seq_length": max_keyword_length, #The length does not include any added "bos_token" or "eos_token"
		}
		],
		"shuffle": False,
		}

Original line number	Diff line number	Diff line
		# TODOListe

		Das Projekt ist in drei Stufen aufgeteilt. Das erlaubt es uns ambitionierte Ziele ohne das Risiko eines gescheiterten Projekts zu erreichen.

		Unser Ziel ist es die dritte Stufe zu erreichen, sollten wir das zeitlich nicht schaffen, erarbeiten wir nur die resten beiden Stufen. Wenn es bei Stufe 2 zu viele unvorhergesehene Probleme gibt, haben wir zumindest die erste Stufe bearbeitet und können dort noch mehr ins Detail gehen, um sicher ein gutes, erfolgreiches Projekt abgeschlossen zu haben.

		Szenarien

		- Stufe [#1] Synthetische Dialoge mit einem Modell generieren
		- Stufe [#2]: Synthetische Dialoge mit einem von Datenbank gestützten Modell generieren
		- Stufe [#3]: Synthetische Dialoge mit interagierenden, Datenbank gestützten Modellen generieren

		## TODOs

		Übersicht

		\| Lea \| Finn \| Chris \| Target Dates \|
		\| ----------------------------- \| ------------------------------- \| ------------------------------- \| ------------ \|
		\| Annotation der HD Daten \| Annotation der HD Daten \| Annotation der HD Daten \| \|
		\| \| Prompt tuning + Testing (llama) \| Datenbank erstellen und füllen \| 12.12.2023 \|
		\| Metriken vorbereiten + testen \| Cluster einrichten für one-shot \| \| \|
		\| \| one-shot auf Cluster ausführen \| WOZ Schema mappen \| 19.12.2023 \|
		\| \| Extract Dialogue profiles \| \| \|
		\| Metrik einsetzen (one shot) \| Metrik einsetzen (one shot) \| DatenbankEmbeddings vorbereiten \| 22.12.2023 \|


		\| TODOS (nach Weihnachten ca. bis zum 23. Jan) \| Target Dates \|
		\| ------------------------------------------------------------------------------------------------------------- \| ------------ \|
		\| FactScore zum Zerlegen der Fragen \| \|
		\| SBERT 2-Varianten \| \|
		\| Fragen aus Dialog mit SBERT auf Datenbank Fakten abbilden. Werden relevante Daten gefunden? (slot-similarity) \| \|
		\| - SBERT similarity \| \|
		\| - SBERT QA \| \|
		\| - Is retrieved fact coherent in KB/Dialogue \| \|
		\| 5-10 Dialoge manuell annotieren und mit Labruna vergleichen \| \|
		\| Präsentation \| 06.02.2023 \|





		Task Liste

		\| TaskID \| Task Beschreibung \| Status \| Deadline \| Dependency \| Person \|
		\| ------ \| ------------------------------------------------ \| ------ \| -------- \| ---------- \| ------ \|
		\| 1 \| Annotationen der HD-Daten \| TODO \| 12.12. \| \| LCF \|
		\| 2 \| Metriken vorbereiten \| TODO \| 31.12 \| \| L \|
		\| 3 \| Metriken auf 1Shot anwenden \| TODO \| 10.01 \| 2,9 \| LF \|
		\| 4 \| Datenbank fertig machen \| TODO \| 12.12 \| 1 \| C \|
		\| 5 \| WOZ Schema mappen \| TODO \| 31.12 \| 4 \| C \|
		\| 6 \| Datenbank Embedden \| TODO \| 31.12 \| 4 \| C \|
		\| 7 \| Prompt tuning + Testing \| TODO \| 12.12 \| \| F \|
		\| 8 \| Cluster einrichten \| TODO \| 12.12 \| \| F \|
		\| 9 \| 1Shot auf Cluster ausführen \| TODO \| 31.12 \| 7,8 \| LCF \|
		\| 10 \| FActScore zum Zerlegen der Fragen \| TODO \| 23.01 \| \| \|
		\| 11 \| SBERT (CoSim, QA) \| TODO \| 23.01 \| \| \|
		\| 12 \| Embeddings für Retrieval evaluieren \| TODO \| 23.01 \| 11 \| \|
		\| 13 \| 5 Dialoge annotieren und mit Labruna vergleichen \| TODO \| 06.02 \| \| \|
		\| 14 \| Abschluss Präsentation \| TODO \| 06.02 \| \| \|
		\| 15 \| Extract Dialogue profiles \| TODO \| 22.12 \| 4 \| \|