Add readmes for some of the directories (1162558b) · Commits · chrysanthopoulou / Diachronic LLMs Automatic Dictionary Induction

README.md

+58 −0

Original line number	Diff line number	Diff line
		# 📖 Bilingual Automatic Dictionary Induction for Low-Resource Languages

		## 🔍 Project Overview
		This repository contains code and resources for the project Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages. The project focuses on aligning monolingual embeddings and generating bilingual lexicons for Northern Sámi (SME) and Norwegian Bokmål (NOB) using VecMap and ClassyMap.

		### 🎯 Objectives
		- Develop a BADI pipeline using comparable corpora and embedding-based alignment.
		- Improve dictionary induction by integrating classifier-based translation selection.
		- Evaluate lexicographically informed optimizations to enhance translation accuracy.
		- Contribute to research on NLP for low-resource languages.

		## 📂 Repository Structure
		```
		DIAHRONIC-LLMS-AUTOMATIC-DICTIONARY-INDUCTION/
		│── ClassyMap/ # 🔠 Adaptation of ClassyMap repository
		│── data/ # 📂 Datasets for training and evaluation
		│── results/ # 📊 Output results from different pipeline runs
		│── slurm_scripts/ # 🖥️ SLURM batch scripts for running experiments on a cluster
		│── src/ # 🏗️ Source code implementing the BADI pipeline
		│── vecmap/ # 🗺️ Adaptation of VecMap repository
		│── visualisations/ # 📊 Figures for data visualization and analysis
		│── .gitattributes # ⚙️ Git configuration for handling large files
		│── .gitignore # 🚫 Specifies files and folders to be ignored by Git
		│── contribution_statement.md # 📝 Document outlining contributions by team members
		```

		## ⚙️ Methods
		This project employs static word embeddings trained with:
		- FastText (subword-aware embeddings for morphologically rich languages)
		- Word2Vec (CBOW and Skip-gram models)
		- GloVe (global co-occurrence-based embeddings)

		For embedding alignment:
		- VecMap (projection-based cross-lingual alignment)
		- ClassyMap (classifier-based self-learning for bilingual dictionary induction)

		Evaluation is conducted using:
		- 1-to-1 Match Accuracy
		- Precision@1, 5, 10
		- Mean Reciprocal Rank (MRR)
		- Levenshtein and Jaro-Winkler Distance for polysemous translation filtering
		- Coverage and Clustering Metrics (Silhouette Score, Davies-Bouldin Score)

		## 🏆 Results
		- The best-performing model uses FastText embeddings (300d) with ClassyMap-based classification.
		- Coverage: 46.23%, Precision@1: 27.59%, Precision@10: 46.15%.

		## 👥 Contributions
		Authors:
		- Finn Hillengass
		- Priya Yadav
		- Lea Kyveli Chrysanthopoulou
		- Valentin Höpfl


		## 🙌 Acknowledgments
		Special thanks to the Department of Computational Linguistics, Heidelberg University for access to the CoLi CLuster and for access to the BwUniCluster.

slurm_scripts/README.md

0 → 100644

+58 −0

Original line number	Diff line number	Diff line
		# 🖥️ SLURM Scripts for Job Submission

		This directory contains SLURM batch scripts for executing different stages of the Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages project on the CoLi CLuster and the BwUniCluster.

		## 📌 Folder Structure
		```
		slurm_scripts/
		│── classymap.sh # SLURM script for running ClassyMap alignment
		│── cluster.sh # SLURM script for running clustering experiments
		│── embeddings.sh # Runs embedding training job
		│── embeddings_grid_search.sh # Runs hyperparameter tuning for embeddings
		│── gen_emb.sh # Generates embeddings for different configurations
		│── run_data_ablation.sh # Performs dataset ablation studies
		│── run_dim_study.sh # Runs dimensionality experiments for embeddings
		│── run_embedding_type_study.sh # Compares different embedding models (FastText, Word2Vec, GloVe)
		│── run_grid_search.sh # Performs hyperparameter tuning for BADI pipeline
		│── run_grid_search_add_back_in.sh # Variant of grid search with added parameters
		│── run_grid_search_man.sh # Manually adjusted grid search script
		```

		## 🚀 Running SLURM Jobs
		### Submitting a Job
		To run a specific SLURM script, use:
		```bash
		sbatch <script_name.sh>
		```
		Example:
		```bash
		sbatch embeddings.sh
		```

		### Monitoring Jobs
		To check the status of submitted jobs, use:
		```bash
		squeue -u $USER
		```
		To cancel a job:
		```bash
		scancel <job_id>
		```

		## 📌 Script Descriptions
		### 🔹 Embedding Training Scripts
		- `embeddings.sh`: Runs embedding training.
		- `gen_emb.sh`: Generates embeddings based on different configurations.
		- `embeddings_grid_search.sh`: Performs a grid search over different embedding hyperparameters.

		### 🔹 Experimental and Evaluation Scripts
		- `run_data_ablation.sh`: Removes subsets of data to measure impact on performance.
		- `run_dim_study.sh`: Studies the effect of different embedding dimensionalities.
		- `run_embedding_type_study.sh`: Compares performance of FastText, Word2Vec, and GloVe.
		- `run_grid_search.sh`: Optimizes embedding hyperparameters for the BADI pipeline.
		- `run_grid_search_add_back_in.sh`: Alternative embedding grid search approach with additional parameters.
		- `run_grid_search_man.sh`: Manual adjustments for embedding grid search runs.

		### 🔹 Alignment and Clustering
		- `classymap.sh`: Runs ClassyMap for alignment of word embeddings.
		- `cluster.sh`: Performs clustering experiments on embeddings.
		No newline at end of file

src/README.md

0 → 100644

+35 −0

Original line number	Diff line number	Diff line
		# 📂 Source Code (src) Overview

		This directory contains the core implementation for the Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages project. The scripts and submodules within this folder facilitate data preprocessing, embedding generation, dictionary induction, and evaluation.

		## 📌 Folder Structure
		```
		src/
		│── automatic_dictionary_induction.py # Script for dictionary induction filtering
		│── embeddings/ # Scripts and utilities for training and loading embeddings
		│── evaluation/ # Scripts for evaluating dictionary induction results
		│── pre-processing_and_data_acquisition/ # Scripts for extracting raw datasets and preprocessing them
		│── visualisations/ # Scripts for generating figures and visual analytics
		```

		## 🔧 Module Descriptions
		### 1️⃣ automatic_dictionary_induction.py
		- This script takes the ranked translation candidates provided by ClassyMap and filters them based on the score assigned by the classifier and then filters them based on the Levenshtein and Jaro-Winkler metrics.

		### 2️⃣ embeddings/
		- Contains scripts for training and saving word embeddings.
		- Supports multiple embedding models, including FastText, Word2Vec, and GloVe.

		### 3️⃣ evaluation/
		- Implements evaluation metrics such as Precision@10, Levenshtein Distance, Jaro-Winkler Distance, and Clustering Scores.
		- Compares induced dictionaries against ground-truth lexicons.
		- Evaluates, VecMap, ClassyMap and GPT-4o outputs.

		### 4️⃣ pre-processing_and_data_acquisition/
		- Handles dataset-specific preprocessing by filtering repeating text segments, removal of punctuation, and lower-casing.
		- Splits the datasets into ablation study corpora
		- Prepares GPT-4o input

		### 5️⃣ visualisations/
		- Contains scripts for plotting and analyzing bilingual dictionary induction performance, corpus metadata, and data ablation studies.
		- Generates bar charts and heatmaps.
		No newline at end of file