add further steps (f5de3216) · Commits · hubert / knn_ast_KD_nmt

README.md

+21 −0

Original line number	Diff line number	Diff line
		@@ -32,4 +32,25 @@ The best way to run experiments with generated transcripts is to:



		## data pre-processing

		Data pre-processing was done as explained in fairseq speech-to-text module. Note that if you create your own datasets and want to use a NMT expert, you need to process the target transcripts and translations in the speech translation/recognition dataset the same way you processed the data for the NMT task.

		Scripts to extract source transcripts and target translations from the csv Datafiles created by the speech-to-text pre-processing are included.


		For COVOST2 and MUST-C:
		1. adapt the file paths in [get_src_to_st.py](fairseq/get_src_to_st.py) to fit your setup and simply run `python get_src_to_st.py`.
		2. adapt the file paths in [get_source_text.py](fairseq/examples/speech_to_text/get_source_text.py) to your setup and run `python get_source_text.py`
		3. the extracted data files are saved in `${dataset_name}/${split_name}`
		4. process the extracted text data the same you did for your NMT expert, e.g. by adapting [prepare-rest.sh](fairseq/examples/speech_to_text/prepare-rest.sh)
		5. run `python get_source_text.py` again
		6. adapt the configuration files to point to your NMT expert's vocabulary and BPE.

		## model training

		Model training is done as is specified in the fairseq framework.
		For instance,
		`fairseq-train ${COVOST_ROOT}/en --config-yaml config_st_en_de.yaml --train-subset train_processed --valid-subset dev_processed --num-workers 8 --max-tokens 50000 --max-update 30000 --task speech_to_text --criterion imit_kd --report-accuracy --arch s2t_transformer_s \
		--optimizer adam --lr 0.002 --lr-scheduler inverse_sqrt --seed 1 --clip-norm 10.0 --expert ${PATH_TO_EXPERT_MODEL} --expert-vocab-tgt ${PATH_TO_EXPERT_MODEL_DICTIONARY} --expert-vocab-src ${PATH_TO_EXPERT_MODEL_SRC_DICTIONARY} --path ${PATH_TO_EXPERT_MODEL_DICTIONARY} \
		--save-dir ${ST_SAVE_DIR} --bpe-codes ${PATH_TO_BPE} --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --patience 10 --load-pretrained-encoder-from ${ASR_MODEL} --encoder-freezing-updates 1000`

fairseq/examples/speech_to_text/prepare-rest.sh

0 → 100644

+101 −0

Original line number	Diff line number	Diff line
		#!/bin/bash
		# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh

		echo 'Cloning Moses github repository (for tokenization scripts)...'
		git clone https://github.com/moses-smt/mosesdecoder.git

		echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
		git clone https://github.com/rsennrich/subword-nmt.git

		SCRIPTS=mosesdecoder/scripts
		TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
		CLEAN=$SCRIPTS/training/clean-corpus-n.perl
		NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
		REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl

		fileen=mustc_processed/train_text_en.txt
		tmp=mustc_processed_text

		cat $fileen \| \
		perl $NORM_PUNC $l \| \
		perl $REM_NON_PRINT_CHAR \| \
		perl $TOKENIZER -threads 16 -a -l en >> ${tmp}/train.tok.en


		for split in dev test
		do
		fileen=mustc_processed/${split}_text_en.txt

		cat $fileen\| \
		perl $TOKENIZER -threads 16 -a -l en >> ${tmp}/${split}.tok.en
		done


		filede=mustc_processed/train_text.txt
		tmp=mustc_processed_text

		cat $filede \| \
		perl $NORM_PUNC $l \| \
		perl $REM_NON_PRINT_CHAR \| \
		perl $TOKENIZER -threads 16 -a -l de >> ${tmp}/train.tok.de


		for split in dev test
		do
		filede=mustc_processed/${split}_text.txt

		cat $filede \| \
		perl $TOKENIZER -threads 16 -a -l de >> ${tmp}/${split}.tok.de
		done

		fileen=covost/train_text_en.txt
		tmp=covost_processed_text

		cat $fileen \| \
		perl $NORM_PUNC $l \| \
		perl $REM_NON_PRINT_CHAR \| \
		perl $TOKENIZER -threads 16 -a -l en >> ${tmp}/train.tok.en


		for split in dev test
		do
		fileen=covost/${split}_text_en.txt

		cat $fileen \| \
		perl $TOKENIZER -threads 16 -a -l en >> ${tmp}/${split}.tok.en
		done




		filede=covost/train_text.txt
		tmp=covost_processed_text

		cat $filede \| \
		perl $NORM_PUNC $l \| \
		perl $REM_NON_PRINT_CHAR \| \
		perl $TOKENIZER -threads 16 -a -l de >> ${tmp}/train.tok.de


		for split in dev test
		do
		filede=covost/${split}_text.txt

		cat $filede \| \
		perl $TOKENIZER -threads 16 -a -l de >> ${tmp}/${split}.tok.de
		done





		"""
		filede=libri_processed/train_text.txt
		tmp=libri_processed_text

		cat $filede \| \
		perl $NORM_PUNC $l \| \
		perl $REM_NON_PRINT_CHAR \| \
		perl $TOKENIZER -threads 16 -a -l de >> ${tmp}/train.tok.de

		"""