The repository contains two modules for fine-tuning machine translation models on multilingual datasets: preprocess and train. To run the modules, you need to create a config.yaml file.
The config file specifies the location of the pretrained model, the training and validation datasets, and all parameters required for the training and preprocessing steps. Below are the main sections:
name: Hugging Face ID of the model to be fine-tuned
train: Details of the training dataset (Hugging Face ID, split, and columns)validation: Details of the validation dataset
- Tokenizer arguments such as
max_length,padding, andtruncation
- Training arguments such as the number of epochs, learning rate, and batch size.
This module tokenizes the training and validation datasets and saves them to disk. To tokenize the datasets, run:
python preprocess.py --config configs/nllb-200-600m-full-dataset-finetune.yaml
The tokenized dataset is saved to the output directory specified in the config file.
This module runs the fine-tuning of the baseline model defined in the config file. To start training, you need to pass the config file and the location of the dataset tokenized during the preprocessing step.
python train.py --config configs/nllb-200-600m-full-dataset-finetune.yaml --tokenized_dataset tokneized/
pip install -r requirements.txt