Skip to content

[SYSTEMDS-2850] Generalized parameter server Autoencoder#2434

Open
AdityaPandey2612 wants to merge 13 commits intoapache:mainfrom
AdityaPandey2612:FEATURE#2850-parameter-server-autoencoder
Open

[SYSTEMDS-2850] Generalized parameter server Autoencoder#2434
AdityaPandey2612 wants to merge 13 commits intoapache:mainfrom
AdityaPandey2612:FEATURE#2850-parameter-server-autoencoder

Conversation

@AdityaPandey2612
Copy link

SystemDS: Parameter Server Autoencoder with variable hidden layers (Already rebased onto the main branch of SystemDS)

Overview

This repository contains a comprehensive implementation and experimental evaluation of distributed autoencoder training using Apache SystemDS. The project implements a generalized symmetric autoencoder in DML (Declarative Machine Learning) and provides a complete infrastructure for automated testing, validation, and performance analysis of parameter server-based distributed training.

Key Contributions

1. Generalized Autoencoder Implementation (DML)

  • autoencoder_2layer.dml (867 lines): Core implementation supporting both DEFAULTSERVER (single-node) and PARAMSERVER (distributed) training modes
  • autoencoderGeneralized.dml (130 lines): Wrapper enabling arbitrary encoder depths with symmetric decoder mirroring
  • autoGradientCheck.dml (95 lines): Finite-difference gradient verification for correctness validation

2, Automated Testing Suite

JUnit Integration Tests

The implementation includes comprehensive JUnit tests integrated with the SystemDS test framework:

BuiltinAutoencoderGeneralizedTest.java

Complete test suite covering multiple architectural configurations:

  • testAutoencoderThreeLayerOutputs: Validates 3-layer encoder (16→8→4) architecture
  • testAutoencoderTwoLayerOutputs: Tests 2-layer encoder (16→8) with automatic bottleneck detection
  • testAutoencoderSingleLayerOutputs: Single-layer encoder (16) edge case
  • testAutoencoderSparseInputOutputs: Sparse data handling (20% density) with deeper network (32→16→8)
  • testAutoencoderParamservOutputs: PARAMSERVER mode validation

Test Coverage:

  • Matrix dimension verification (weights, hidden representations)
  • Output consistency checks (W1, Wlast, hidden layer)
  • Sparse and dense input matrices
  • Multiple encoder depths (1-3 layers)
  • Both DEFAULTSERVER and PARAMSERVER training modes

BuiltinAutoencoderGeneralizedBasicTest.java

Basic sanity test for quick validation:

  • testAutoencoderThreeLayerOutputs: Smoke test for standard 3-layer configuration

Key Features:

  • Arbitrary encoder depth with automatic decoder construction
  • Glorot weight initialization for deep networks
  • Support for multiple activation functions (tanh, sigmoid, ReLU)
  • Integration with SystemDS parameter server framework
  • Z-score normalization and random data shuffling

Experimental/Correctness/Testing Infrastructure

Automated Experiment Runner:

  • run_sysds_experiments.py: Python-based automation framework for executing experiment sweeps
    • YAML-based configuration management
    • Grid parameter expansion
    • Multi-repeat execution for statistical analysis
    • Automatic CSV result generation with detailed metrics
    • Progress tracking and error handling

Configuration Files (8 YAML configs):

  • e16_default.yaml: DEFAULTSERVER baseline experiments
  • e16_ps_w2.yaml: 2-worker parameter server configurations
  • e16_ps_w4.yaml: 4-worker parameter server with K-parameter sweep
  • epoch_curve.yaml: Convergence analysis over training epochs
  • epoch_curve_sbp.yaml: SBP staleness parameter exploration
  • gradient_check.yaml: Gradient verification test suite
  • stress_suite.yaml: Comprehensive stress testing (30+ configurations)
  • epoch_curve_fast.yaml: Quick validation experiments

Experimental Results Summary

Correctness Validation

  • Gradient Checking: Achieved relative errors of 10⁻⁹ to 10⁻⁷ across all layers
  • Sanity Tests: Passed no-op updates, tiny overfit, and W=1 equivalence checks
  • Numerical Stability: No gradient explosion or NaN propagation across 700+ runs

Convergence Performance

Best Configuration: SBP(K=2, W=4, ModelAvg=True)

  • Final objective: 8,801.7 (reconstruction error)
  • Improvement over baseline: 74.5 points (0.8% better than DEFAULTSERVER)
  • Runtime overhead: +0.28 seconds (17% increase for 1.68s baseline)

Key Findings:

Configuration Workers Final Obj Wall Time Improvement
DEFAULTSERVER 1 8,876.2 1.68s baseline
PS BSP 2 8,887.8 1.83s -11.6
PS BSP 4 8,848.6 1.85s +27.6
PS SBP(K=3) 4 8,858.8 1.83s +17.4
PS SBP(K=2) 4 8,801.7 1.90s +74.5

Key Insights

  1. SBP Outperforms BSP: Relaxed synchronization (K=2 out of 4 workers) achieves superior convergence compared to strict BSP
  2. K Parameter Matters: SBP(K=2) beats SBP(K=3) by 51.3 points, demonstrating optimal staleness tolerance
  3. Model Averaging: Provides consistent 5-7 point improvement for W=4 configurations
  4. Modest Overhead: Distributed training adds only 0.15-0.28 seconds for moderate datasets (32,768 rows)
  5. Scalability: Execution time dominates compilation overhead for datasets >10K rows

Repository Structure

LDE_Experiments/
├── src/test/scripts/functions/builtin/
│   ├── autoencoder_2layer.dml           # Main implementation (867 lines)
│   ├── autoencoderGeneralized.dml       # Generalized wrapper (130 lines)
│   └── autoGradientCheck.dml            # Gradient checking (95 lines)
├── configs/
│   ├── e16_default.yaml                 # Baseline experiments
│   ├── e16_ps_w2.yaml                   # 2-worker configs
│   ├── e16_ps_w4.yaml                   # 4-worker configs
│   ├── epoch_curve.yaml                 # Convergence analysis
│   ├── epoch_curve_sbp.yaml             # SBP parameter sweep
│   ├── gradient_check.yaml              # Gradient verification
│   ├── stress_suite.yaml                # Comprehensive testing
│   └── epoch_curve_fast.yaml            # Quick validation
├── run_sysds_experiments.py             # Experiment automation
├── results/
│   ├── results1.csv                     # Gradient checking (90 runs)
│   ├── results2-4.csv                   # Early experiments
│   ├── results8-11.csv                  # Scaling/stress tests
│   ├── results12.csv                    # DEFAULTSERVER (5 runs)
│   ├── results13.csv                    # PARAMSERVER W=2 (20 runs)
│   └── results14.csv                    # PARAMSERVER W=4 (40 runs)
├── figures/                             # Generated visualizations
│   ├── gradient_check_errors.png
│   ├── gradient_check_by_layer.png
│   ├── epoch16_comparison.png
│   ├── bsp_vs_sbp_comparison.png
│   ├── model_averaging_impact.png
│   ├── runtime_breakdown.png
│   ├── variance_analysis.png
│   ├── performance_accuracy_tradeoff.png
│   ├── convergence_w2.png
│   ├── convergence_w4.png
│   └── scaling_analysis.png
├── report_comprehensive.pdf             # Full technical report (38 pages)
├── report_comprehensive.tex             # LaTeX source
└── README.md                            # This file

Quick Start

Prerequisites

  • Apache SystemDS 3.0.0+ (installation guide)
  • Java 11 or higher
  • Python 3.8+ with packages: pyyaml, pandas, matplotlib, seaborn, numpy

Installation

# Clone repository
git clone https://github.com/AdityaPandey2612/LDE_Experiments.git
cd LDE_Experiments

# Install Python dependencies
pip install pyyaml pandas matplotlib seaborn numpy

# Update SystemDS path in runner script
# Edit run_sysds_experiments.py, line ~25:
# SYSTEMDS_ROOT = "/path/to/your/SystemDS"  # UPDATE THIS

Generate Data

# Create data directory
mkdir -p data

# Generate 32,768 x 64 random training data
systemds -f - << 'EOF'
X = rand(rows=32768, cols=64, min=0, max=1, pdf="uniform");
write(X, "data/X.bin", format="binary");
print("Data generated successfully");
EOF

Run Experiments

Single configuration:

python run_sysds_experiments.py \
  --yaml configs/e16_ps_w4.yaml \
  --stage epoch_curve \
  --repeats 5 \
  --output results/results14.csv

Full experiment suite:

# Run all configurations
for config in configs/*.yaml; do
  basename=$(basename $config .yaml)
  python run_sysds_experiments.py \
    --yaml $config \
    --repeats 5 \
    --output results/${basename}.csv
done

Manual execution (for debugging):

systemds src/test/scripts/functions/builtin/autoencoderGeneralized.dml \
  -exec singlenode -stats -nvargs \
  X=data/X.bin H1=16 H2=8 H3=4 \
  EPOCH=16 BATCH=256 STEP=1e-4 DECAY=1.0 MOMENTUM=0.0 \
  FULL_OBJ=TRUE METHOD=PARAMSERVER MODE=LOCAL UTYPE=SBP \
  FREQ=EPOCH WORKERS=4 K=2 SCHEME=DISJOINT_RANDOM \
  NBATCHES=0 MODELAVG=TRUE \
  W1_out=W1.bin Wlast_out=Wlast.bin hidden_out=hidden.bin

Analyze Results

# Generate visualizations and statistics
python analyze_results.py

# Outputs:
# - All PNG figures in current directory
# - summary_statistics.txt

Visualizations

The analysis pipeline generates 11 publication-ready figures:

  1. Gradient Check Errors: Scatter plot of relative errors across all layers
  2. Gradient Check by Layer: Box plot showing error distribution per layer
  3. EPOCH=16 Comparison: Bar charts comparing final objective and wall time
  4. BSP vs SBP Comparison: Direct comparison for W=2 and W=4
  5. Model Averaging Impact: Effect of ModelAvg on convergence
  6. Convergence Curves: Before/after objective for W=2 and W=4
  7. Scaling Analysis: Runtime and convergence vs dataset size
  8. Runtime Breakdown: Compilation vs execution time
  9. Variance Analysis: Standard deviation across repeats
  10. Performance-Accuracy Trade-off: Scatter plot of runtime vs convergence quality

Experimental Details

Model Architecture

  • Input/Output: 64 dimensions
  • Encoder: 64 → 16 → 8 → 4 (bottleneck)
  • Decoder: 4 → 8 → 16 → 64 (symmetric)
  • Activation: tanh (with derivative caching for efficient backprop)
  • Loss: Mean squared reconstruction error

Training Configuration

  • Dataset: 32,768 rows × 64 columns (Gaussian random)
  • Batch size: 256
  • Learning rate: 10⁻⁴
  • Momentum: 0.0
  • Decay: 1.0 (no decay)
  • Epochs: 16 (primary experiments)

Synchronization Strategies Evaluated

Strategy Description Workers K Parameter
DEFAULTSERVER Single-node SGD 1 N/A
BSP Bulk Synchronous Parallel 2, 4 N/A
SBP Stale synchronous with Backup workers 2, 4 1, 2, 3

SBP Parameter K:

  • K = number of workers to wait for before proceeding
  • Remaining (W-K) workers act as backups for straggler tolerance
  • K = W → equivalent to BSP
  • K < W → provides asynchrony and faster updates

Performance Metrics

Convergence Quality

  • Gradient accuracy: 10⁻⁹ to 10⁻⁷ relative error
  • Final reconstruction error: 8,801.7 (best SBP configuration)
  • Improvement range: -11.6 to +74.5 points vs baseline
  • Coefficient of variation: 0.8-1.2% (stable across repeats)

Runtime Performance

  • DEFAULTSERVER: 1.68 ± 0.03 seconds
  • PARAMSERVER overhead: +0.15 to +0.28 seconds
  • Compilation time: ~0.52 seconds (constant)
  • Execution scaling: O(N^1.1) for dataset size N

Statistical Analysis

  • Total experimental runs: 700+
  • Configurations tested: 50+
  • Repeats per config: 3-5
  • CSV result files: 11 (results1-results14)
  • Total result size: ~500KB

Technical Highlights

DML Implementation Features

  • Generalized architecture: Supports arbitrary encoder depths via recursive construction
  • Parameter server integration: Native SystemDS paramserv() API usage
  • Gradient computation: Separate worker gradient function for distributed execution
  • Aggregation function: Server-side gradient aggregation and model update
  • Momentum support: Velocity accumulators maintained across iterations
  • Glorot initialization: Proper weight scaling for deep networks

Infrastructure Features

  • YAML configuration: Declarative experiment definitions with grid expansion
  • Automated execution: Parallel experiment scheduling with progress tracking
  • Error handling: Timeout protection, retry logic, detailed error logging
  • Metrics collection: Comprehensive timing breakdown (compilation, execution, wall time)
  • Result validation: Automatic parsing of SystemDS statistics output
  • Reproducibility: Fixed seeds, deterministic ordering, version tracking

Documentation

Complete Technical Report

The repository includes a comprehensive 38-page technical report (report_comprehensive.pdf) covering:

  • Mathematical formulation: Autoencoder architecture, loss function, backpropagation
  • Implementation details: Code walkthroughs with actual DML snippets
  • Experimental methodology: Configuration management, automation pipeline
  • Correctness verification: Gradient checking, sanity tests, numerical stability
  • Convergence analysis: Detailed comparison of synchronization strategies
  • Scaling analysis: Runtime and convergence vs dataset size and worker count
  • Discussion: Findings, limitations, when to use distributed training
  • Reproducibility: Step-by-step instructions, troubleshooting, verification checklist

Key Sections

  1. Introduction and Research Objectives
  2. Model Architecture and Training Algorithms
  3. DML Implementation with Code Listings
  4. Experimental Infrastructure and YAML Configs
  5. Correctness Verification (Gradient Checking)
  6. Convergence Analysis (BSP vs SBP)
  7. Scaling Analysis and Overhead Breakdown
  8. Discussion and Future Work
  9. Complete Reproducibility Guide

Research Context

This work was completed as part of the Large-Scale Data Engineering course at Technische Universität Berlin. The project demonstrates:

  • Scalable implementation of deep learning in declarative ML frameworks
  • Comprehensive experimental methodology for distributed systems evaluation
  • Trade-off analysis between convergence quality and synchronization overhead
  • Best practices for reproducible machine learning research

Future Work

Algorithmic Extensions

  • Implement fully asynchronous (ASP) parameter server mode
  • Add adaptive K parameter based on worker latency distribution
  • Integrate gradient compression (sparsification, quantization)
  • Support local SGD (multiple local updates before sync)
  • Implement adaptive learning rate methods (Adam, RMSprop)

Infrastructure Enhancements

  • Add checkpointing for true learning curves across epochs
  • Implement distributed execution on multi-node Spark cluster
  • Integrate Bayesian hyperparameter optimization
  • Add real-time TensorBoard-style monitoring
  • Support for convolutional and variational autoencoders

Experimental Extensions

  • Evaluate on real datasets (MNIST, CIFAR-10)
  • Larger-scale experiments (1M+ samples, 1000+ dimensions)
  • Benchmark against TensorFlow/PyTorch distributed training
  • Heterogeneous worker environments (straggler simulation)
  • Communication cost analysis in true distributed setting

Contributing

  • Additional synchronization strategies (SSP, Gossip, etc.)
  • Alternative architectures (VAE, DAE, CAE)
  • Real-world dataset experiments
  • Performance optimizations
  • Documentation improvements
  • Bug fixes and testing

Author

Aditya Pandey
Technische Universität Berlin

For a more comprehensive understanding of the project, experimentation, and documentation, please look at the pdf below

report_comprehensive.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant