# 📖 Bilingual Automatic Dictionary Induction for Low-Resource Languages
## 🔍 Project Overview
This repository contains code and resources for the project **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages**. The project focuses on aligning monolingual embeddings and generating bilingual lexicons for **Northern Sámi (SME) and Norwegian Bokmål (NOB)** using **VecMap** and **ClassyMap**.
### 🎯 Objectives
- Develop a BADI pipeline using comparable corpora and embedding-based alignment.
- Improve dictionary induction by integrating classifier-based translation selection.
- Evaluate lexicographically informed optimizations to enhance translation accuracy.
- Contribute to research on NLP for low-resource languages.
## 📂 Repository Structure
```
DIAHRONIC-LLMS-AUTOMATIC-DICTIONARY-INDUCTION/
│── ClassyMap/ # 🔠 Adaptation of ClassyMap repository
│── data/ # 📂 Datasets for training and evaluation
│── results/ # 📊 Output results from different pipeline runs
│── slurm_scripts/ # 🖥️ SLURM batch scripts for running experiments on a cluster
│── src/ # 🏗️ Source code implementing the BADI pipeline
│── vecmap/ # 🗺️ Adaptation of VecMap repository
│── visualisations/ # 📊 Figures for data visualization and analysis
│── .gitattributes # ⚙️ Git configuration for handling large files
│── .gitignore # 🚫 Specifies files and folders to be ignored by Git
│── contribution_statement.md # 📝 Document outlining contributions by team members
```
## ⚙️ Methods
This project employs **static word embeddings** trained with:
-**FastText** (subword-aware embeddings for morphologically rich languages)
Special thanks to the **Department of Computational Linguistics, Heidelberg University** for access to the CoLi CLuster and for access to the BwUniCluster.
This directory contains SLURM batch scripts for executing different stages of the **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages** project on the CoLi CLuster and the BwUniCluster.
## 📌 Folder Structure
```
slurm_scripts/
│── classymap.sh # SLURM script for running ClassyMap alignment
│── cluster.sh # SLURM script for running clustering experiments
│── embeddings.sh # Runs embedding training job
│── embeddings_grid_search.sh # Runs hyperparameter tuning for embeddings
│── gen_emb.sh # Generates embeddings for different configurations
This directory contains the core implementation for the **Bilingual Automatic Dictionary Induction (BADI) for Low-Resource Languages** project. The scripts and submodules within this folder facilitate data preprocessing, embedding generation, dictionary induction, and evaluation.
## 📌 Folder Structure
```
src/
│── automatic_dictionary_induction.py # Script for dictionary induction filtering
│── embeddings/ # Scripts and utilities for training and loading embeddings
│── evaluation/ # Scripts for evaluating dictionary induction results
│── pre-processing_and_data_acquisition/ # Scripts for extracting raw datasets and preprocessing them
│── visualisations/ # Scripts for generating figures and visual analytics
```
## 🔧 Module Descriptions
### 1️⃣ **automatic_dictionary_induction.py**
- This script takes the ranked translation candidates provided by **ClassyMap** and filters them based on the score assigned by the classifier and then filters them based on the Levenshtein and Jaro-Winkler metrics.
### 2️⃣ **embeddings/**
- Contains scripts for training and saving word embeddings.
- Supports multiple embedding models, including FastText, Word2Vec, and GloVe.
### 3️⃣ **evaluation/**
- Implements evaluation metrics such as **Precision@10, Levenshtein Distance, Jaro-Winkler Distance, and Clustering Scores**.
- Compares induced dictionaries against ground-truth lexicons.
- Evaluates, VecMap, ClassyMap and GPT-4o outputs.
### 4️⃣ **pre-processing_and_data_acquisition/**
- Handles dataset-specific preprocessing by filtering repeating text segments, removal of punctuation, and lower-casing.
- Splits the datasets into ablation study corpora
- Prepares GPT-4o input
### 5️⃣ **visualisations/**
- Contains scripts for plotting and analyzing bilingual dictionary induction performance, corpus metadata, and data ablation studies.