Merge branch 'master' of gitlab.cl.uni-heidelberg.de:chrysanthopoulou/provocate (4dd7e875) · Commits · hillengass / SynDRA

.gitignore

+6 −0

Original line number	Diff line number	Diff line
		@@ -8,6 +8,12 @@
		.local
		.npm

		# Ignore virtual environments
		grade_venv/

		# Ignore older python / other program versions
		python=3.6

		# Ignore models-folder except readme
		models/*
		!models/README.md

formatted_dialogues.json→data/own_data/dialogue_domain_knowledge.json

+2034 −2034

File changed and moved.

Preview size limit exceeded, changes collapsed.

documentation/meeting_notes/05.12.2023.md

+8 −22

Original line number	Diff line number	Diff line
		# Notes

		- präsentation schicken, zeitplan schicken, sachen von tafel schicken
		- Zeitplan aktualisieren
		- Projektmanager festlegen bzw jemadnen der sachen hintefragt
		- sachen vorher schicken damit wir feedback bekommen
		- Zeitplan besser darstellen
		- dialog skripte random generieren konzept ausarbeiten
		- dabei alternative modellieren
		- präsentation schicken, zeitplan schicken
		- Zeitplan aktualisieren (das was an der tafel war)
		- Projektmanager festlegen bzw jemadnen der sachen hintefragt festlegen (Lea)
		- sachen allgemein vorher schicken damit wir feedback bekommen
		- Sind unsere Slots gleich wie im Multi Datensatz, bzw müssen wir sachen anpassen, wir wollen besser die Slot values vergleichen können
		- mapping zu daten aus multiwoz damit wir die selben slots füllen und besser auf die slot accuracies vergleichen können
		-


		## Wochenplan

		- komplette Annotation
		- Metriken weitermachen
		- Prompt tuning
		- Modell zum embedden aufsetzen
		- natürlichsprachliche sätze aus den infos machen und embedden (SBERT weil vortrainiert für bestimmte anfrage und information wie ähnlich sind die beiden Sätze, einmal mit sbert ganze datenbank embedden und dann mit einer matrixmultiplikation rausfinden was ist am nähesten dran) Dafür ist wichtig wie anfragen aussehen und dass die aus das selbe format runtergebrochen werden wie unsere embeddings, S3BERT wäre auch eine möglichkeit weil wir zu unsern obekten komplexere statements erstellen müssen, bzw beschreibungen die ausführlicher sind, und dann kann man fragen vergleichen und schauen welche informationen sich überlappen bzw ähnlich sind
		- natürlichsprachlich zu slots und anderstrum sonst keine Slotmetriken
		- Randomisiertes erstellen aus daten
		- wie machen wir anfragen an datenbank
		- Listen wie und was wir machen und dann eine weitere liste mit todos und genauen zeiten plus rollen darauf verteilen
		- am besten viele auch kurze meetings und ein logbuch also todos abhaken
		- todo listen (auch pro wochen)
		No newline at end of file
		- natürlichsprachlich zu slots und anderstrum sonst keine Slotmetriken ??
		- zu Todos leute zuteilen und wenn was geschafft immer abhaken + alles pushen
		- Todo aus Tafel ableiten und in file eintragen
		- Modell zum embedden aufsetzen

metrics/grade/GRADE/.gitignore

0 → 100644

+5 −0

Original line number	Diff line number	Diff line
		output/
		*.pyc
		__pycache__
		tools/
		.vscode
		No newline at end of file

metrics/grade/GRADE/README.md

0 → 100644

+108 −0

Original line number	Diff line number	Diff line
		# GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems


		This repository contains the source code for the following paper:


		[GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems](https://arxiv.org/abs/2010.03994)
		Lishan Huang, Zheng Ye, Jinghui Qin, Xiaodan Liang; EMNLP 2020

		## Model Overview
		![GRADE](images/GRADE.png)

		## Prerequisites
		Create virtural environment (recommended):
		```
		conda create -n GRADE python=3.6
		source activate GRADE
		```
		Install the required packages:
		```
		pip install -r requirements.txt
		```

		Install Texar locally:
		```
		cd texar-pytorch
		pip install .
		```

		Note: Make sure that your environment has installed cuda 10.1.

		## Data Preparation
		GRADE is trained on the DailyDialog Dataset proposed by ([Li et al.,2017](https://arxiv.org/abs/1710.03957)).

		For convenience, we provide the [processed data](https://drive.google.com/file/d/1sj3Z_GZfYzrhmleWazA-QawhUEhlNmJd/view?usp=sharing) of DailyDialog. And you should also download it and unzip into the `data` directory. And you should also download [tools](https://drive.google.com/file/d/1CaRhHnO0YsQHOnJsmMUJuL4w9HXJZQYw/view?usp=sharing) and unzip it into the root directory of this repo.

		If you wanna prepare the training data from scratch, please follow the steps:
		1. Install [Lucene](https://lucene.apache.org/);
		2. Run the preprocessing script:
		```
		cd ./script
		bash preprocess_training_dataset.sh
		```


		## Training
		To train GRADE, please run the following script:
		```
		cd ./script
		bash train.sh
		```

		Note that the [checkpoint](https://drive.google.com/file/d/1v9o-fSohFDegicakrSEnKNcKliOqhYfH/view?usp=sharing) of our final GRADE is provided. You could download it and unzip into the root directory.

		## Evaluation
		We evaluate GRADE and other baseline metrics on three chit-chat datasets (DailyDialog, ConvAI2 and EmpatheticDialogues). The corresponding evaluation data in the `evaluation` directory has the following file structure:
		```
		.
		└── evaluation
		└── eval_data
		\| └── DIALOG_DATASET_NAME
		\| └── DIALOG_MODEL_NAME
		\| └── human_ctx.txt
		\| └── human_hyp.txt
		└── human_score
		└── DIALOG_DATASET_NAME
		\| └── DIALOG_MODEL_NAME
		\| └── human_score.txt
		└── human_judgement.json
		```
		Note: the entire human judgement data we proposed for metric evaluation is in `human_judgement.json`.


		To evaluate GRADE, please run the following script:
		```
		cd ./script
		bash eval.sh
		```

		## Using GRADE
		To use GRADE on your own dialog dataset:
		1. Put the whole dataset (raw data) into `./preprocess/dataset`;
		2. Update the function load_dataset in `./preprocess/extract_keywords.py` for loading the dataset;
		3. Prepare the context-response data that you want to evaluate and convert it into the following format:
		```
		.
		└── evaluation
		└── eval_data
		└── YOUR_DIALOG_DATASET_NAME
		└── YOUR_DIALOG_MODEL_NAME
		├── human_ctx.txt
		└── human_hyp.txt
		```
		4. Run the following script to evaluate the context-response data with GRADE:
		```
		cd ./script
		bash inference.sh
		```
		5. Lastly, the scores given by GRADE can be found as below:
		```
		.
		└── evaluation
		└── infer_result
		└── YOUR_DIALOG_DATASET_NAME
		└── YOUR_DIALOG_MODEL_NAME
		├── non_reduced_results.json
		└── reduced_results.json
		```