Sentence-ByT5: Exploring Token-Free Sentence Embeddings

This project investigates the potential of ByT5, a token-free, byte-level transformer model, for generating high-quality sentence embeddings. It explores various fine-tuning strategies and evaluates the model's performance on semantic textual similarity (STS) tasks, with a focus on its robustness to noise and cross-lingual capabilities.

Abstract

Most state-of-the-art sentence embedding models rely on subword tokenization, which can be a limitation when dealing with noisy text, morphologically rich languages, or multilingual contexts. This project explores ByT5, a model that operates directly on raw bytes, as a "token-free" alternative.

We fine-tune ByT5 using a contrastive learning framework on Natural Language Inference (NLI) datasets and evaluate its ability to produce meaningful sentence representations. Our findings show that while ByT5 does not consistently outperform leading subword-based models like Sentence-T5 on standard benchmarks, it demonstrates noteworthy potential in specific scenarios, such as improved robustness to character-level noise and stronger cross-lingual transfer.

This work suggests that ByT5 is a valuable alternative for specialized applications, despite the associated computational overhead.

Relation to Sentence-T5

This work is heavily inspired by the research on Sentence-T5, which successfully adapted the T5 model for sentence embedding tasks. We adopt a similar methodology, but with a key difference: while Sentence-T5 uses a standard subword tokenizer, we explore the use of ByT5, a model that operates directly on bytes.

Our goal is to investigate whether the token-free, byte-level approach of ByT5 can offer advantages over Sentence-T5, particularly in scenarios involving:

Noisy Text: Handling typos, slang, and other character-level perturbations.
Multilingualism: Processing text from diverse languages without a fixed vocabulary.
Out-of-Vocabulary Words: Generalizing to words not seen during training.

By replacing the subword tokenizer with a byte-level processor, we aim to create more robust and versatile sentence embeddings, while acknowledging the computational trade-offs.

Key Findings

Standard Benchmarks: On standard STS benchmarks, fine-tuned ByT5 models generally underperform compared to their subword-based counterparts (e.g., Sentence-T5).
Robustness to Noise: The byte-level nature of ByT5 provides inherent resilience to character-level perturbations like typos and spelling variations, showing a more graceful degradation in performance compared to subword models.
Cross-Lingual Transfer: ByT5 shows promise for cross-lingual retrieval tasks, suggesting its token-free approach is beneficial for aligning semantics across different languages and scripts.
Computational Trade-Offs: Operating at the byte level leads to significantly longer input sequences, resulting in increased memory consumption and slower inference times compared to token-based models.

Methodology

Our approach is heavily inspired by the methodologies used for Sentence-T5 and SimCSE.

Model Architecture: We use a pre-trained ByT5 model and adapt it for sentence embedding generation. We primarily focus on using the mean pooling of the encoder's last hidden state to derive a fixed-length sentence vector.
Fine-Tuning: The model is fine-tuned using a contrastive learning objective. We employ a dual-encoder (siamese) architecture where:
- Positive Pairs: Sentences with an "entailment" relationship from NLI datasets (like SNLI) are pulled closer together in the embedding space.
- Hard Negatives: Sentences with a "contradiction" relationship are used as hard negatives and pushed further apart.
- Loss Function: An in-batch sampled softmax loss is used to optimize the model.

How to Use This Repository

Installation

Install the required dependencies. A CUDA-enabled GPU is highly recommended.

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install core dependencies
pip install transformers datasets sentence-transformers scikit-learn numpy pandas

Evaluating a Model on STS-B

You can evaluate a pre-trained ByT5 model on the Semantic Textual Similarity Benchmark (STS-B) using the provided script.

python evaluate_stsb.py --model google/byt5-small

Arguments:

--model: The Hugging Face model identifier for the ByT5 model (e.g., google/byt5-small, google/byt5-base).
--batch_size: The batch size for encoding sentences (default: 32).
--device: The compute device (cuda, mps, cpu). Defaults to the best available.

Available ByT5 Models

You can use any of the official google/byt5 models from the Hugging Face Hub:

google/byt5-small (300M parameters)
google/byt5-base (580M parameters)
google/byt5-large (1.2B parameters)
google/byt5-xl (3.7B parameters)

Core Project Files

byt5.py: Contains the core functions for generating sentence embeddings using a ByT5 model.
evaluate_stsb.py: Main script for running the STS-B evaluation.
training/: Contains scripts and resources related to model training and fine-tuning.
report.md: A detailed report on the project's methodology, results, and conclusions.

References

ByT5: Xue, L., et al. (2021). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.
Sentence-T5: Ni, J., et al. (2021). Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models.
Sentence-BERT: Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
SimCSE: Gao, T., et al. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
__pycache__		__pycache__
documents		documents
final_report		final_report
training		training
.gitignore		.gitignore
OLD_README.md		OLD_README.md
Phases.md		Phases.md
README.md		README.md
XTREME_eval.py		XTREME_eval.py
bucc_eval.txt		bucc_eval.txt
byt5.py		byt5.py
evaluate_sts_datasets.py		evaluate_sts_datasets.py
evaluate_stsb.py		evaluate_stsb.py
model_benchmark.py		model_benchmark.py
noise_evaluation.py		noise_evaluation.py
noise_results.txt		noise_results.txt
report.md		report.md
sentEval_tests.py		sentEval_tests.py
steps.md		steps.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence-ByT5: Exploring Token-Free Sentence Embeddings

Abstract

Relation to Sentence-T5

Key Findings

Methodology

How to Use This Repository

Installation

Evaluating a Model on STS-B

Available ByT5 Models

Core Project Files

References

About

Uh oh!

Releases

Packages

Languages

Jacob411/Sentence-byt5

Folders and files

Latest commit

History

Repository files navigation

Sentence-ByT5: Exploring Token-Free Sentence Embeddings

Abstract

Relation to Sentence-T5

Key Findings

Methodology

How to Use This Repository

Installation

Evaluating a Model on STS-B

Available ByT5 Models

Core Project Files

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages