This project investigates the potential of ByT5, a token-free, byte-level transformer model, for generating high-quality sentence embeddings. It explores various fine-tuning strategies and evaluates the model's performance on semantic textual similarity (STS) tasks, with a focus on its robustness to noise and cross-lingual capabilities.
Most state-of-the-art sentence embedding models rely on subword tokenization, which can be a limitation when dealing with noisy text, morphologically rich languages, or multilingual contexts. This project explores ByT5, a model that operates directly on raw bytes, as a "token-free" alternative.
We fine-tune ByT5 using a contrastive learning framework on Natural Language Inference (NLI) datasets and evaluate its ability to produce meaningful sentence representations. Our findings show that while ByT5 does not consistently outperform leading subword-based models like Sentence-T5 on standard benchmarks, it demonstrates noteworthy potential in specific scenarios, such as improved robustness to character-level noise and stronger cross-lingual transfer.
This work suggests that ByT5 is a valuable alternative for specialized applications, despite the associated computational overhead.
This work is heavily inspired by the research on Sentence-T5, which successfully adapted the T5 model for sentence embedding tasks. We adopt a similar methodology, but with a key difference: while Sentence-T5 uses a standard subword tokenizer, we explore the use of ByT5, a model that operates directly on bytes.
Our goal is to investigate whether the token-free, byte-level approach of ByT5 can offer advantages over Sentence-T5, particularly in scenarios involving:
- Noisy Text: Handling typos, slang, and other character-level perturbations.
- Multilingualism: Processing text from diverse languages without a fixed vocabulary.
- Out-of-Vocabulary Words: Generalizing to words not seen during training.
By replacing the subword tokenizer with a byte-level processor, we aim to create more robust and versatile sentence embeddings, while acknowledging the computational trade-offs.
- Standard Benchmarks: On standard STS benchmarks, fine-tuned ByT5 models generally underperform compared to their subword-based counterparts (e.g., Sentence-T5).
- Robustness to Noise: The byte-level nature of ByT5 provides inherent resilience to character-level perturbations like typos and spelling variations, showing a more graceful degradation in performance compared to subword models.
- Cross-Lingual Transfer: ByT5 shows promise for cross-lingual retrieval tasks, suggesting its token-free approach is beneficial for aligning semantics across different languages and scripts.
- Computational Trade-Offs: Operating at the byte level leads to significantly longer input sequences, resulting in increased memory consumption and slower inference times compared to token-based models.
Our approach is heavily inspired by the methodologies used for Sentence-T5 and SimCSE.
- Model Architecture: We use a pre-trained ByT5 model and adapt it for sentence embedding generation. We primarily focus on using the mean pooling of the encoder's last hidden state to derive a fixed-length sentence vector.
- Fine-Tuning: The model is fine-tuned using a contrastive learning objective. We employ a dual-encoder (siamese) architecture where:
- Positive Pairs: Sentences with an "entailment" relationship from NLI datasets (like SNLI) are pulled closer together in the embedding space.
- Hard Negatives: Sentences with a "contradiction" relationship are used as hard negatives and pushed further apart.
- Loss Function: An in-batch sampled softmax loss is used to optimize the model.
Install the required dependencies. A CUDA-enabled GPU is highly recommended.
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install core dependencies
pip install transformers datasets sentence-transformers scikit-learn numpy pandasYou can evaluate a pre-trained ByT5 model on the Semantic Textual Similarity Benchmark (STS-B) using the provided script.
python evaluate_stsb.py --model google/byt5-smallArguments:
--model: The Hugging Face model identifier for the ByT5 model (e.g.,google/byt5-small,google/byt5-base).--batch_size: The batch size for encoding sentences (default:32).--device: The compute device (cuda,mps,cpu). Defaults to the best available.
You can use any of the official google/byt5 models from the Hugging Face Hub:
google/byt5-small(300M parameters)google/byt5-base(580M parameters)google/byt5-large(1.2B parameters)google/byt5-xl(3.7B parameters)
byt5.py: Contains the core functions for generating sentence embeddings using a ByT5 model.evaluate_stsb.py: Main script for running the STS-B evaluation.training/: Contains scripts and resources related to model training and fine-tuning.report.md: A detailed report on the project's methodology, results, and conclusions.
- ByT5: Xue, L., et al. (2021). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.
- Sentence-T5: Ni, J., et al. (2021). Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models.
- Sentence-BERT: Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
- SimCSE: Gao, T., et al. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings.