GitHub - AfriNLP/mt-training

The repository contains two modules for fine-tuning machine translation models on multilingual datasets: preprocess and train. To run the modules, you need to create a config.yaml file.

config

The config file specifies the location of the pretrained model, the training and validation datasets, and all parameters required for the training and preprocessing steps. Below are the main sections:

`model`

name: Hugging Face ID of the model to be fine-tuned

`datasets`

train: Details of the training dataset (Hugging Face ID, split, and columns)
validation: Details of the validation dataset

`tokenization`

Tokenizer arguments such as max_length, padding, and truncation

`training`

Training arguments such as the number of epochs, learning rate, and batch size.

Preprocess

This module tokenizes the training and validation datasets and saves them to disk. To tokenize the datasets, run:

python preprocess.py --config configs/nllb-200-600m-full-dataset-finetune.yaml

The tokenized dataset is saved to the output directory specified in the config file.

train

This module runs the fine-tuning of the baseline model defined in the config file. To start training, you need to pass the config file and the location of the dataset tokenized during the preprocessing step.

python train.py --config configs/nllb-200-600m-full-dataset-finetune.yaml --tokenized_dataset tokneized/

Installation

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
README.md		README.md
nllb_train.ipynb		nllb_train.ipynb
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

config

`model`

`datasets`

`tokenization`

`training`

Preprocess

train

Installation

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

AfriNLP/mt-training

Folders and files

Latest commit

History

Repository files navigation

config

model

datasets

tokenization

training

Preprocess

train

Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages

`model`

`datasets`

`tokenization`

`training`