Skip to content

Jacob411/Sentence-byt5

Repository files navigation

Sentence-ByT5: Exploring Token-Free Sentence Embeddings

This project investigates the potential of ByT5, a token-free, byte-level transformer model, for generating high-quality sentence embeddings. It explores various fine-tuning strategies and evaluates the model's performance on semantic textual similarity (STS) tasks, with a focus on its robustness to noise and cross-lingual capabilities.

Abstract

Most state-of-the-art sentence embedding models rely on subword tokenization, which can be a limitation when dealing with noisy text, morphologically rich languages, or multilingual contexts. This project explores ByT5, a model that operates directly on raw bytes, as a "token-free" alternative.

We fine-tune ByT5 using a contrastive learning framework on Natural Language Inference (NLI) datasets and evaluate its ability to produce meaningful sentence representations. Our findings show that while ByT5 does not consistently outperform leading subword-based models like Sentence-T5 on standard benchmarks, it demonstrates noteworthy potential in specific scenarios, such as improved robustness to character-level noise and stronger cross-lingual transfer.

This work suggests that ByT5 is a valuable alternative for specialized applications, despite the associated computational overhead.

Relation to Sentence-T5

This work is heavily inspired by the research on Sentence-T5, which successfully adapted the T5 model for sentence embedding tasks. We adopt a similar methodology, but with a key difference: while Sentence-T5 uses a standard subword tokenizer, we explore the use of ByT5, a model that operates directly on bytes.

Our goal is to investigate whether the token-free, byte-level approach of ByT5 can offer advantages over Sentence-T5, particularly in scenarios involving:

  • Noisy Text: Handling typos, slang, and other character-level perturbations.
  • Multilingualism: Processing text from diverse languages without a fixed vocabulary.
  • Out-of-Vocabulary Words: Generalizing to words not seen during training.

By replacing the subword tokenizer with a byte-level processor, we aim to create more robust and versatile sentence embeddings, while acknowledging the computational trade-offs.

Key Findings

  • Standard Benchmarks: On standard STS benchmarks, fine-tuned ByT5 models generally underperform compared to their subword-based counterparts (e.g., Sentence-T5).
  • Robustness to Noise: The byte-level nature of ByT5 provides inherent resilience to character-level perturbations like typos and spelling variations, showing a more graceful degradation in performance compared to subword models.
  • Cross-Lingual Transfer: ByT5 shows promise for cross-lingual retrieval tasks, suggesting its token-free approach is beneficial for aligning semantics across different languages and scripts.
  • Computational Trade-Offs: Operating at the byte level leads to significantly longer input sequences, resulting in increased memory consumption and slower inference times compared to token-based models.

Methodology

Our approach is heavily inspired by the methodologies used for Sentence-T5 and SimCSE.

  1. Model Architecture: We use a pre-trained ByT5 model and adapt it for sentence embedding generation. We primarily focus on using the mean pooling of the encoder's last hidden state to derive a fixed-length sentence vector.
  2. Fine-Tuning: The model is fine-tuned using a contrastive learning objective. We employ a dual-encoder (siamese) architecture where:
    • Positive Pairs: Sentences with an "entailment" relationship from NLI datasets (like SNLI) are pulled closer together in the embedding space.
    • Hard Negatives: Sentences with a "contradiction" relationship are used as hard negatives and pushed further apart.
    • Loss Function: An in-batch sampled softmax loss is used to optimize the model.

How to Use This Repository

Installation

Install the required dependencies. A CUDA-enabled GPU is highly recommended.

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install core dependencies
pip install transformers datasets sentence-transformers scikit-learn numpy pandas

Evaluating a Model on STS-B

You can evaluate a pre-trained ByT5 model on the Semantic Textual Similarity Benchmark (STS-B) using the provided script.

python evaluate_stsb.py --model google/byt5-small

Arguments:

  • --model: The Hugging Face model identifier for the ByT5 model (e.g., google/byt5-small, google/byt5-base).
  • --batch_size: The batch size for encoding sentences (default: 32).
  • --device: The compute device (cuda, mps, cpu). Defaults to the best available.

Available ByT5 Models

You can use any of the official google/byt5 models from the Hugging Face Hub:

  • google/byt5-small (300M parameters)
  • google/byt5-base (580M parameters)
  • google/byt5-large (1.2B parameters)
  • google/byt5-xl (3.7B parameters)

Core Project Files

  • byt5.py: Contains the core functions for generating sentence embeddings using a ByT5 model.
  • evaluate_stsb.py: Main script for running the STS-B evaluation.
  • training/: Contains scripts and resources related to model training and fine-tuning.
  • report.md: A detailed report on the project's methodology, results, and conclusions.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published