Commit 1162558b authored by chrysanthopoulou's avatar chrysanthopoulou
Browse files

Add readmes for some of the directories

parent 4bee1187
Loading
Loading
Loading
Loading
+58 −0
Original line number Diff line number Diff line
# 📖 Bilingual Automatic Dictionary Induction for Low-Resource Languages

## 🔍 Project Overview
This repository contains code and resources for the project **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages**. The project focuses on aligning monolingual embeddings and generating bilingual lexicons for **Northern Sámi (SME) and Norwegian Bokmål (NOB)** using **VecMap** and **ClassyMap**.

### 🎯 Objectives
- Develop a BADI pipeline using comparable corpora and embedding-based alignment.
- Improve dictionary induction by integrating classifier-based translation selection.
- Evaluate lexicographically informed optimizations to enhance translation accuracy.
- Contribute to research on NLP for low-resource languages.

## 📂 Repository Structure
```
DIAHRONIC-LLMS-AUTOMATIC-DICTIONARY-INDUCTION/
│── ClassyMap/           # 🔠 Adaptation of ClassyMap repository
│── data/                # 📂 Datasets for training and evaluation
│── results/             # 📊 Output results from different pipeline runs
│── slurm_scripts/       # 🖥️ SLURM batch scripts for running experiments on a cluster
│── src/                 # 🏗️ Source code implementing the BADI pipeline
│── vecmap/              # 🗺️ Adaptation of VecMap repository
│── visualisations/      # 📊 Figures for data visualization and analysis
│── .gitattributes       # ⚙️ Git configuration for handling large files
│── .gitignore           # 🚫 Specifies files and folders to be ignored by Git
│── contribution_statement.md  # 📝 Document outlining contributions by team members
```

## ⚙️ Methods
This project employs **static word embeddings** trained with:
- **FastText** (subword-aware embeddings for morphologically rich languages)
- **Word2Vec** (CBOW and Skip-gram models)
- **GloVe** (global co-occurrence-based embeddings)

For embedding alignment:
- **VecMap** (projection-based cross-lingual alignment)
- **ClassyMap** (classifier-based self-learning for bilingual dictionary induction)

Evaluation is conducted using:
- **1-to-1 Match Accuracy**
- **Precision@1, 5, 10**
- **Mean Reciprocal Rank (MRR)**
- **Levenshtein and Jaro-Winkler Distance** for polysemous translation filtering
- **Coverage and Clustering Metrics** (Silhouette Score, Davies-Bouldin Score)

## 🏆 Results
- The best-performing model uses **FastText embeddings (300d)** with **ClassyMap-based classification**.
- **Coverage: 46.23%**, **Precision@1: 27.59%**, **Precision@10: 46.15%**.

## 👥 Contributions
Authors: 
- Finn Hillengass
- Priya Yadav
- Lea Kyveli Chrysanthopoulou
- Valentin Höpfl


## 🙌 Acknowledgments
Special thanks to the **Department of Computational Linguistics, Heidelberg University** for access to the CoLi CLuster and for access to the BwUniCluster. 
+58 −0
Original line number Diff line number Diff line
# 🖥️ SLURM Scripts for Job Submission

This directory contains SLURM batch scripts for executing different stages of the **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages** project on the CoLi CLuster and the BwUniCluster.

## 📌 Folder Structure
```
slurm_scripts/
│── classymap.sh                      # SLURM script for running ClassyMap alignment
│── cluster.sh                         # SLURM script for running clustering experiments
│── embeddings.sh                      # Runs embedding training job
│── embeddings_grid_search.sh          # Runs hyperparameter tuning for embeddings
│── gen_emb.sh                         # Generates embeddings for different configurations
│── run_data_ablation.sh               # Performs dataset ablation studies
│── run_dim_study.sh                   # Runs dimensionality experiments for embeddings
│── run_embedding_type_study.sh        # Compares different embedding models (FastText, Word2Vec, GloVe)
│── run_grid_search.sh                 # Performs hyperparameter tuning for BADI pipeline
│── run_grid_search_add_back_in.sh     # Variant of grid search with added parameters
│── run_grid_search_man.sh             # Manually adjusted grid search script
```

## 🚀 Running SLURM Jobs
### Submitting a Job
To run a specific SLURM script, use:
```bash
sbatch <script_name.sh>
```
Example:
```bash
sbatch embeddings.sh
```

### Monitoring Jobs
To check the status of submitted jobs, use:
```bash
squeue -u $USER
```
To cancel a job:
```bash
scancel <job_id>
```

## 📌 Script Descriptions
### 🔹 **Embedding Training Scripts**
- `embeddings.sh`: Runs embedding training.
- `gen_emb.sh`: Generates embeddings based on different configurations.
- `embeddings_grid_search.sh`: Performs a grid search over different embedding hyperparameters.

### 🔹 **Experimental and Evaluation Scripts**
- `run_data_ablation.sh`: Removes subsets of data to measure impact on performance.
- `run_dim_study.sh`: Studies the effect of different embedding dimensionalities.
- `run_embedding_type_study.sh`: Compares performance of FastText, Word2Vec, and GloVe.
- `run_grid_search.sh`: Optimizes embedding hyperparameters for the BADI pipeline.
- `run_grid_search_add_back_in.sh`: Alternative embedding grid search approach with additional parameters.
- `run_grid_search_man.sh`: Manual adjustments for embedding grid search runs.

### 🔹 **Alignment and Clustering**
- `classymap.sh`: Runs ClassyMap for alignment of word embeddings.
- `cluster.sh`: Performs clustering experiments on embeddings.
 No newline at end of file

src/README.md

0 → 100644
+35 −0
Original line number Diff line number Diff line
# 📂 Source Code (src) Overview

This directory contains the core implementation for the **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages** project. The scripts and submodules within this folder facilitate data preprocessing, embedding generation, dictionary induction, and evaluation.

## 📌 Folder Structure
```
src/
│── automatic_dictionary_induction.py    # Script for dictionary induction filtering
│── embeddings/                          # Scripts and utilities for training and loading embeddings
│── evaluation/                          # Scripts for evaluating dictionary induction results
│── pre-processing_and_data_acquisition/ # Scripts for extracting raw datasets and preprocessing them
│── visualisations/                      # Scripts for generating figures and visual analytics
```

## 🔧 Module Descriptions
### 1️⃣ **automatic_dictionary_induction.py**
- This script takes the ranked translation candidates provided by **ClassyMap** and filters them based on the score assigned by the classifier and then filters them based on the Levenshtein and Jaro-Winkler metrics.

### 2️⃣ **embeddings/**
- Contains scripts for training and saving word embeddings.
- Supports multiple embedding models, including FastText, Word2Vec, and GloVe.

### 3️⃣ **evaluation/**
- Implements evaluation metrics such as **Precision@10, Levenshtein Distance, Jaro-Winkler Distance, and Clustering Scores**.
- Compares induced dictionaries against ground-truth lexicons.
- Evaluates, VecMap, ClassyMap and GPT-4o outputs.

### 4️⃣ **pre-processing_and_data_acquisition/**
- Handles dataset-specific preprocessing by filtering repeating text segments, removal of punctuation, and lower-casing.
- Splits the datasets into ablation study corpora
- Prepares GPT-4o input

### 5️⃣ **visualisations/**
- Contains scripts for plotting and analyzing bilingual dictionary induction performance, corpus metadata, and data ablation studies.
- Generates **bar charts and heatmaps**.
 No newline at end of file