Updated README (ba26d526) · Commits · born / newsroomTLS

README.md

+16 −13

Original line number	Diff line number	Diff line
		@@ -34,34 +34,37 @@ We use the [newsroom](https://summari.es) dataset throughout to create event-spe

		Please note that we scraped newsroom ourselves and were not able to fully download the dataset. More specifically, 961 articles could not be downloaded. For reference, the file _data/newsroom\_train\_BBM.txt_ thus contains a list of all the 994,080 articles that constituted our version of the training data. You can use the script _get\_newsroom\_train\_ids.py_ to generate such a file as well (it simply needs the file _train.data_ as input).

		## Pre-processing of training data
		## Obtaining timelines
		While the newsroom dataset contains the articles needed, it does not supply the human-generated timelines used for evaluation. We source our timelines from three
		datasets: _crisis_, _timeline17_ (_tl17_), and _12Stories_. _timeline17_ and _crisis_ are provided by Tran et al. [here](http://www.l3s.de/~gtran/timeline/). _12Stories_ is not publicly released. For copyright reasons, we only supply the original urls the timelines were retrieved from.
		However, these might have been edited in the meantime and also contain boilerplate. Please contact us (<born@cl.uni-heidelberg.de> or <markert@cl.uni-heidelberg.de>) if you want to receive a copy of the cleaned _12Stories_ timelines.

		Every event sub-directory contains a .txt with the filenames and sources of the timelines needed for evaluation.

		# Pre-processing of training data
		## Step 1: Temporally tagging newsroom
		Since tilse requires texts tagged with [heideltime](https://github.com/HeidelTime/heideltime) to construct and evaluate timelines, you need to pre-process _train.data_ the following way:
		* Create a temporary directory for the files to tag, e.g. ```mkdir -p ../data/tmp```
		* Run _heideltime\_newsroom.py_ to create _train\_tagged.data_
		* Example: ```python heideltime_newsroom.py /path/to/tilse/tools/heideltime ../data/train.data 2007 2016 ../data/tmp ../data/train_tagged.data```
		* Specifying the year is necessary so as to not to tag _all_ of newsroom (since we have constraints on the events we consider, see paper for details)
		* Specifying the years is necessary so as to not to tag _all_ of newsroom (since we have constraints on the events we consider, see paper for details)
		* After this is done, you can delete the temporary directory

		## Generating event-specific corpora
		There are two basic scripts you can use to create event-specific corpora, _simple\_extraction.py_ and _bm25\_extraction.py_. For details on how to use them, run the scripts without any arguments. The most significant difference in usage between them is that for _bm25\_extraction.py_, you have to specify each event individually as well as the number of articles to extract. _simple\_extraction.py_, on the other hand, will create corpora for all events at once.
		## Step 2: Extracting event-specific corpora
		As explained in the section _Repo structure_, we do not provide the event-specific corpora _per se_, but rather only a list of URLs constituting these corpora. This is for reference purposes only and cannot be used to predict timelines with tilse.

		After extracting the corpora, they have to be processed for use with tilse. To do this, simply use the script _create\_corpus\_dumps.py_ with the IR-specific directory as argument, e.g. ```python create_corpus_dumps.py ../data/IR_simple``` to prepare the simple IR corpora for tilse.
		Thus, if you want to prepare event-specific corpora for use with tilse, you have to run one of the two extraction scripts (or both): _simple\_extraction.py_ and _bm25\_extraction.py_. For details on how to use them, run the scripts without any arguments. The most significant difference in usage between them is that for _bm25\_extraction.py_, you have to specify each event individually as well as the number of articles to extract (please refer to the paper for details on why this is). _simple\_extraction.py_, on the other hand, will create corpora for all events at once.

		To simply reproduce the procedure from the paper, run ```bash extract_event_corpora.sh```.

		## Obtaining timelines
		While the [newsroom](https://summari.es) dataset contains the articles needed, it does not supply the human-generated timelines used for evaluation. We source our timelines from three
		datasets: _crisis_, _timeline17_ (_tl17_), and _12Stories_. _timeline17_ and _crisis_ are provided by Tran et al. [here](http://www.l3s.de/~gtran/timeline/). _12Stories_ is not publicly released. For copyright reasons
		we only supply the original urls the timelines were retrieved from. However, these might have been edited in the meantime and also contain boilerplate. Please contact us (<born@cl.uni-heidelberg.de> or <markert@cl.uni-heidelberg.de>) if you want to receive a copy of the cleaned _12Stories_ timelines.
		After extracting the event-specific corpora, they have to be processed for use with tilse. To do this, simply use the script _create\_corpus\_dumps.py_ with the IR-specific directory as argument, e.g. ```python create_corpus_dumps.py ../data/IR_simple``` to prepare the simple IR corpora for tilse.

		Every folder contains a .txt with the filenames and sources of the timelines needed for evaluation.
		To simply reproduce the procedure from the paper, run ```bash extract_event_corpora.sh```. This will effectively create all event-specific corpora necessary to validate the results reported in the paper.

		# Predicting timelines
		Predicting timelines now only amounts to running the tilse script _predict-timelines_ with the appropriate configuration file. You will find all config files we used in the _configs_ directory of this repository.

		The script _run\_tilse.sh_ runs tilse in all of our nine settings (three algorithms x three IR methods).

		Note that you will need to replace the corpus path in the cofig files with the _absolute_ paths to your IR\_simple etc. directories from above.
		Note that you will need to replace the corpus paths in the cofig files with the _absolute_ paths to your _IR\_simple_, _IR\_bm25_, and _IR\_bm25\_boot_ directories from above.

		# Reference
		If you make use of the contents of this repository, please cite the following paper: