LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project by kubraaksux · Pull Request #2431 · apache/systemds

kubraaksux · 2026-02-16T14:43:52Z

Benchmarking framework that compares LLM inference across four backends: OpenAI API, Ollama, vLLM, and SystemDS JMLC with the native llmPredict built-in. Evaluated on 5 workloads (math, reasoning, summarization, JSON extraction, embeddings) with n=50 per workload.

Purpose and motivation

This project was developed as part of the LDE (Large-Scale Data Engineering) course. The llmPredict native built-in was added to SystemDS in PR #2430. This PR (#2431) contains the benchmarking framework that evaluates llmPredict against established LLM serving solutions, plus the benchmark results.

Research questions:

How does SystemDS's llmPredict built-in compare to dedicated LLM backends (OpenAI, Ollama, vLLM) in terms of accuracy and throughput?
How does Java-side concurrent request dispatch scale with the llmPredict instruction?
What is the cost-performance tradeoff across cloud APIs, local CPU inference, and GPU-accelerated backends?

Approach:

Built a Python benchmarking framework that runs standardized workloads against all four backends under identical conditions (same prompts, same evaluation metrics)
The llmPredict built-in (from PR Add LLM inference support to JMLC API #2430) goes through the full DML compilation pipeline (parser → hops → lops → CP instruction) and makes HTTP calls to any OpenAI-compatible inference server
Ran evaluation in two phases: (1) sequential baseline across all backends, (2) SystemDS with Java-side concurrency (ExecutorService thread pool in the llmPredict instruction)
GPU backends (vLLM, SystemDS) executed on NVIDIA H100 PCIe (81GB). Ollama ran on local MacBook (CPU). OpenAI ran on local MacBook calling cloud API. All runs used 50 samples per workload, temperature=0.0 for reproducibility.

Project structure

scripts/staging/llm-bench/
├── runner.py                  # Main benchmark runner (CLI entry point)
├── backends/
│   ├── openai_backend.py      # OpenAI API (gpt-4.1-mini)
│   ├── ollama_backend.py      # Ollama local server (llama3.2)
│   ├── vllm_backend.py        # vLLM serving engine (streaming HTTP)
│   └── systemds_backend.py    # SystemDS JMLC via Py4J + llmPredict DML
├── workloads/
│   ├── math/                  # GSM8K dataset, numerical accuracy
│   ├── reasoning/             # BoolQ dataset, logical accuracy
│   ├── summarization/         # XSum dataset, ROUGE-1 scoring
│   ├── json_extraction/       # Built-in structured extraction
│   └── embeddings/            # STS-Benchmark, similarity scoring
├── evaluation/
│   └── perf.py                # Latency, throughput metrics
├── scripts/
│   ├── report.py              # HTML report generator
│   ├── aggregate.py           # Cross-run aggregation
│   └── run_all_benchmarks.sh  # Batch automation (all backends, all workloads)
├── results/                   # Benchmark outputs (metrics.json per run)
└── tests/                     # Unit tests for accuracy checks + runner

Note: The llmPredict built-in implementation (Java pipeline files) is in PR #2430. This PR includes the benchmark framework and results only. Some llmPredict code appears in this diff because both branches share the same local repository.

Backends

Backend	Type	Model	Hardware	Inference path
OpenAI	Cloud API	gpt-4.1-mini	MacBook (API call)	Python HTTP to OpenAI servers
Ollama	Local server	llama3.2 (3B)	MacBook CPU	Python HTTP to local Ollama
vLLM	GPU server	Qwen2.5-3B-Instruct	NVIDIA H100	Python streaming HTTP to vLLM engine
vLLM	GPU server	Mistral-7B-Instruct	NVIDIA H100	Python streaming HTTP to vLLM engine
SystemDS	JMLC API	Qwen2.5-3B-Instruct	NVIDIA H100	Py4J → JMLC → DML `llmPredict` → Java HTTP → vLLM

SystemDS and vLLM Qwen 3B use the same model on the same vLLM inference server, making their accuracy directly comparable. Any accuracy difference comes from the serving path, not the model.

Benchmark results

Evaluation methodology

Each workload defines its own accuracy_check(prediction, reference) function that returns true/false per sample. The accuracy percentage is correct_count / n. All accuracy counts were verified against raw samples.jsonl files and reproduced locally.

Workload	Criterion	How it works
math	Exact numerical match	Extracts the final number from the model's chain-of-thought response using regex patterns (explicit markers like `####`, `\boxed{}`, bold `N`, or the last number in the text). Compares against the GSM8K reference answer. Passes if `abs(predicted - reference) < 1e-6`.
reasoning	Extracted answer match	Extracts yes/no or text answer from the response using CoT markers ("answer is X", "therefore X") or the last short line. Compares against BoolQ reference using exact match, word-boundary substring match, or numeric comparison.
summarization	ROUGE-1 F1 >= 0.2	Computes ROUGE-1 F1 score between the generated summary and the XSum reference using the `rouge-score` library with stemming. A threshold of 0.2 means the summary shares at least 20% unigram overlap (F1) with the reference. Predictions shorter than 10 characters are rejected.
json_extraction	>= 90% fields match	Parses JSON from the model response (tries direct parse, markdown code fences, regex). Checks that all required fields from the reference are present. Values compared with strict matching: case-insensitive for strings, exact for numbers/booleans. Passes if at least 90% of field values match.
embeddings	Score within 1.0 of reference	The model rates sentence-pair similarity on a 0-5 STS scale. The predicted score is extracted from the response. Passes if `abs(predicted - reference) <= 1.0` (20% tolerance). This is standard for STS-B evaluation.

Accuracy (% correct, n=50 per workload)

Workload	Ollama llama3.2 3B	OpenAI gpt-4.1-mini	vLLM Qwen 3B	SystemDS Qwen 3B c=1	SystemDS Qwen 3B c=4	vLLM Mistral 7B
math	58%	94%	68%	68%	68%	38%
json_extraction	74%	84%	52%	52%	52%	50%
reasoning	44%	70%	60%	60%	64%	68%
summarization	80%	88%	50%	50%	62%	68%
embeddings	40%	88%	90%	90%	90%	82%

Key comparisons

SystemDS vs vLLM (same model, same server — Qwen2.5-3B-Instruct on H100):
SystemDS c=1 matches vLLM Qwen 3B accuracy exactly on all 5 workloads (68%, 52%, 60%, 50%, 90%). This confirms that the llmPredict instruction produces identical results to calling vLLM directly. Both use temperature=0.0 (deterministic), same prompts, same inference server. c=4 shows minor variation on reasoning (64% vs 60%) and summarization (62% vs 50%) because concurrent requests cause vLLM to batch them differently, introducing floating-point non-determinism in GPU computation.

OpenAI gpt-4.1-mini vs local models:
OpenAI achieves the highest accuracy on all 5 workloads. The gap is largest on math (94% vs 68% for Qwen 3B) and smallest on embeddings (88% vs 90% for Qwen 3B, where the local model actually wins). OpenAI's advantage comes from model quality (much larger model), not serving infrastructure.

Qwen 3B vs Mistral 7B (different models, same vLLM server):
Despite being smaller (3B vs 7B parameters), Qwen outperforms Mistral on math (68% vs 38%) and embeddings (90% vs 82%). Mistral is better on reasoning (68% vs 60%) and summarization (68% vs 50%). This shows that model architecture and training data matter more than parameter count alone. Mistral's low math score (38%) has two causes: in 20 of 31 incorrect samples the model computed the wrong answer entirely (wrong formulas, negative results, or refusing to solve), and in 10 cases the correct answer appeared in the response but the number extractor grabbed an intermediate value instead due to verbose chain-of-thought formatting.

Ollama llama3.2 3B (MacBook CPU):
Ollama leads on summarization (80%) likely because llama3.2's training emphasized concise outputs that align well with the ROUGE-1 threshold. It scores lowest on embeddings (40%) because the model frequently refuses the similarity-rating task or defaults to high scores regardless of actual similarity.

Per-prompt latency (mean ms/prompt, n=50)

Workload	Ollama (MacBook CPU)	OpenAI (MacBook → Cloud)	vLLM Qwen 3B (H100)	SystemDS Qwen 3B c=1 (H100)
math	5781	3630	4619	2273
json_extraction	1642	1457	1151	610
reasoning	5252	2641	2557	1261
summarization	1079	1036	791	373
embeddings	371	648	75	41

Note on measurement methodology: Latency numbers are not directly comparable across backends because each measures differently. The vLLM backend uses Python requests with streaming (SSE token-by-token parsing adds overhead). SystemDS measures Java-side HttpURLConnection round-trip time (non-streaming, gets full response at once). Ollama measures Python HTTP round-trip on CPU. OpenAI includes network round-trip to cloud servers. The accuracy comparison is the apples-to-apples metric since all backends process the same prompts.

SystemDS concurrency scaling (throughput)

Workload	c=1 (req/s)	c=4 (req/s)	Speedup
math	0.44	1.63	3.71x
json_extraction	1.62	5.65	3.49x
reasoning	0.79	3.11	3.95x
summarization	2.62	7.27	2.78x
embeddings	20.07	46.34	2.31x

Throughput = n / total_wall_clock_seconds (measured Python-side, end-to-end including JMLC overhead). Theoretical maximum speedup is 4x. Math and reasoning (longer generation, ~1-2s per prompt) get closest to 4x because the per-request time dominates. Embeddings (very short responses, ~41ms per prompt) only achieves 2.31x because JMLC pipeline overhead becomes proportionally significant.

Cost comparison

All backends incur compute cost (hardware amortization + electricity) for the machine running them. GPU backends run on the H100 server; Ollama and OpenAI run on a local MacBook. OpenAI additionally incurs API cost per token.

How cost is calculated: compute_cost = wall_clock_time × (hardware_cost / lifetime_hours + power_watts × electricity_rate) / 3600. Assumptions: H100 server: 350W, $30K over 15K hours ($2.00/h + $0.105/h electricity = $2.105/h). MacBook: 50W, $3K over 15K hours ($0.20/h + $0.015/h electricity = $0.215/h). OpenAI API cost recorded by the runner from response headers (x-usage header).

Backend	Hardware	Wall clock (250 queries)	Compute cost	API cost	Total cost	Cost per query
Ollama llama3.2 3B	MacBook CPU	706s	$0.0422	--	$0.0422	$0.000169
OpenAI gpt-4.1-mini	MacBook + API	471s	$0.0281	$0.0573	$0.0855	$0.000342
vLLM Qwen 3B	H100 GPU	460s	$0.2688	--	$0.2688	$0.001076
SystemDS Qwen 3B c=1	H100 GPU	230s	$0.1345	--	$0.1345	$0.000538
SystemDS Qwen 3B c=4	H100 GPU	64s	$0.0372	--	$0.0372	$0.000149

OpenAI API cost breakdown (recorded per run): math $0.0227, reasoning $0.0172, json_extraction $0.0080, summarization $0.0076, embeddings $0.0019.

Conclusions

SystemDS llmPredict produces identical results to vLLM: SystemDS c=1 matches vLLM Qwen 3B accuracy exactly on all 5 workloads (68%, 52%, 60%, 50%, 90%). Both use the same model on the same inference server with temperature=0.0, confirming that the llmPredict DML built-in adds no distortion to model outputs.
Concurrency scales throughput 2.3-3.9x: The ExecutorService thread pool in the llmPredict instruction dispatches up to 4 requests concurrently. Longer-running workloads (math 3.71x, reasoning 3.95x) get closest to the theoretical 4x speedup. Short workloads (embeddings 2.31x) are limited by JMLC pipeline overhead.
OpenAI leads on accuracy but costs more per query: gpt-4.1-mini achieves the highest accuracy on all 5 workloads (94% math, 84% json, 70% reasoning, 88% summarization, 88% embeddings) but at $0.000342/query. SystemDS c=4 achieves $0.000149/query — 56% cheaper — with competitive accuracy on focused tasks like embeddings (90% vs 88%).
Model quality matters more than parameter count: Qwen 3B outperforms the larger Mistral 7B on math (68% vs 38%) and embeddings (90% vs 82%), while Mistral 7B is stronger on reasoning (68% vs 60%) and summarization (68% vs 50%). The serving framework (vLLM vs SystemDS) has zero impact on accuracy when using the same model.
Concurrency reduces compute cost on GPU: SystemDS c=4 at $0.000149/query is the cheapest GPU option — 86% less than vLLM's $0.001076/query — because higher throughput means less wall-clock time per query. Ollama on MacBook CPU is cheapest overall ($0.000169/query) due to low hardware and power costs, but 11x slower.
Latency measurements are not comparable across backends: Each backend uses a different HTTP client (Python streaming, Java non-streaming, cloud API) and measures time differently. Per-prompt latency should only be compared within the same backend across workloads, not across backends.

Generic LLM benchmark suite for evaluating inference performance across different backends (vLLM, Ollama, OpenAI, MLX). Features: - Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA), summarization (XSum, CNN/DM), JSON extraction - Pluggable backend architecture for different inference engines - Performance metrics: latency, throughput, memory usage - Accuracy evaluation per workload type - HTML report generation This framework can be used to evaluate SystemDS LLM inference components once they are developed.

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()

- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout

- Fix duplicate accuracy computation in runner.py - Add --model flag and error handling to run_all_benchmarks.sh - Fix ttft_stats and timing_stats logic bugs - Extract shared helpers into scripts/utils.py - Add HuggingFace download fallback to all loaders - Fix reasoning accuracy false positives with word-boundary regex - Pin dependency versions in requirements.txt - Clean up dead code and unify config keys across backends - Fix README clone URL and repo structure

- Use real token counts from Ollama/vLLM APIs, omit when unavailable - Correct TTFT and cost estimates - Add --gpu-hour-cost and --gpu-count flags for server benchmarks

- 121 unit tests for all accuracy checkers, loaders, and metrics - ROUGE-1/2/L scoring for summarization (replaces quality-gate heuristic) - Concurrent request benchmarking with --concurrency flag - GPU profiling via pynvml - Real TTFT for MLX backend via stream_generate - Backend factory pattern and config validation - Proper logging across all components - Updated configs to n_samples=50

Replace declare -A (bash 4+ only) with a case function for default model lookup. macOS ships with bash 3.x.

- New embeddings workload using STS-Benchmark from HuggingFace - Model rates semantic similarity between sentence pairs (0-5 scale) - 21 new tests for score extraction, accuracy check, sample loading - Total: 142 tests passing across 5 workloads

- Add electricity + hardware amortization cost estimation to runner (--power-draw-w, --electricity-rate, --hardware-cost flags) - Fix aggregate.py cost key mismatch (api_cost_usd vs cost_total_usd) - Add compute cost columns to CSV output and HTML report - Update README with cost model documentation and embeddings workload

Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each) with metrics, samples, configs, HTML report, and aggregated CSV.

- 5 workloads x 2 models on NVIDIA H100 PCIe via vLLM - Mistral-7B-Instruct-v0.3: strong reasoning (68%), fast embeddings (129ms) - Qwen2.5-3B-Instruct: best embeddings accuracy (90%), 75ms latency - Compute costs reflect H100 electricity (350W) + hardware amortization - Regenerated summary.csv and benchmark_report.html with all 20 runs

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()

- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout

Integrate SystemDS as a benchmark backend using the JMLC API. All prompts are processed through PreparedScript.generateBatchWithMetrics() which returns results in a typed FrameBlock with per-prompt timing and token metrics. Benchmark results for 4 workloads with distilgpt2 on H100.

Run the embeddings (semantic similarity) workload with SystemDS JMLC, bringing SystemDS to 5 workloads matching all other backends.

Run all 5 workloads with Qwen/Qwen2.5-3B-Instruct through the SystemDS JMLC backend, replacing the distilgpt2 toy model. This enables a direct apples-to-apples comparison with vLLM Qwen 3B: same model, different serving path (raw HuggingFace via JMLC vs optimized vLLM inference).

Replace distilgpt2 toy model with same models used by vLLM backends: - SystemDS + Qwen 3B (5 workloads) vs vLLM + Qwen 3B - SystemDS + Mistral 7B (5 workloads) vs vLLM + Mistral 7B All runs include compute cost flags (350W, $0.30/kWh, $30k hardware). Increase JMLC worker timeout from 60s to 300s for larger models.

Replace sequential per-prompt inference with true GPU batching: - LLMCallback.java: add generateBatch() for batched inference - PreparedScript.java: call generateBatch() instead of per-prompt loop - llm_worker.py: implement batched tokenization and model.generate() Results (50 samples per workload, NVIDIA H100): - Qwen 3B: 3-12x speedup (math 22s->1.9s, embeddings 144ms->49ms) - Mistral 7B: 7-14x speedup (json 5.4s->388ms, embeddings 380ms->28ms) - Batched SystemDS now faster than sequential vLLM on most workloads - Accuracy comparable (within statistical noise, n=50)

- LLMCallback.java: add generateBatch() interface method - PreparedScript.java: replace per-prompt for-loop with single batch call - llm_worker.py: implement batched tokenization and model.generate() Achieves 3-14x speedup over sequential inference on H100.

PreparedScript.generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched inference (new), false for the original sequential for-loop. Defaults to batched=true. systemds_backend.py passes the batched flag from config so benchmark runs can select either mode.

generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched (new), false for original sequential for-loop.

…ssion

# Conflicts: # .gitignore # src/test/java/org/apache/sysds/test/functions/jmlc/JMLCLLMInferenceTest.java

- Use proper imports instead of inline fully-qualified class names - Add try-with-resources for HTTP streams to prevent resource leaks - Add connect/read timeouts to HTTP calls - Add lineage tracing support for llmPredict - Add checkInvalidParameters validation in parser - Remove leftover Py4J code from Connection/PreparedScript - Delete LLMCallback.java - Remove .claude/.env/meeting_notes from .gitignore - Trim verbose docstrings

- Use proper imports instead of inline fully-qualified class names - Add try-with-resources for HTTP streams to prevent resource leaks - Add connect/read timeouts to HTTP calls - Add lineage tracing support for llmPredict - Add checkInvalidParameters validation in parser - Remove .claude/.env/meeting_notes from .gitignore - Trim verbose docstrings

Supports parallel HTTP calls to the inference server via ExecutorService. Default concurrency=1 keeps sequential behavior.

# Conflicts: # src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java # src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java

- Delete Py4J-based benchmark results (will re-run with llmPredict) - Remove license header from test (Matthias will add) - Clarify llm_server.py docstring

JMLC requires the LHS variable name in read() assignments to match the input name registered in prepareScript(). Changed X/R to prompts/results so RewriteRemovePersistentReadWrite correctly converts persistent reads to transient reads.

Correct SystemDS concurrency scaling numbers to match actual metrics.json data (throughput-based instead of incorrect per-prompt estimates). Update latency table, concurrency scaling table, run_all_benchmarks.sh for automatic c=1/c=4 runs, and regenerate HTML report.

- Remove broken base SystemDS result directories (0% accuracy, 0ms latency from failed earlier run) - Remove fabricated cost per query table (benchmarks were run without --power-draw-w/--hardware-cost flags, all cost data was $0) - Fix accuracy claim: c=1 matches vLLM exactly, c=4 shows minor variation on reasoning (64% vs 60%) and summarization (62% vs 50%) due to vLLM batching non-determinism - Add SystemDS c=1 and c=4 columns to accuracy tables - Fix report.py to show c=1 and c=4 as separate backends instead of merging them into one "systemds (Qwen2.5-3B)" column - Fix floating point truncation bug in accuracy tooltip (int(50*0.58)=28, now uses accuracy_count from metrics.json directly) - Replace stale "Py4J bridge cost" references with "JMLC overhead" - Regenerate HTML report and summary CSV

…usions Major changes: - Restructure README: move SystemDS architecture section before results, add compilation pipeline files, add JMLC code example - Add measurement methodology note: vLLM uses Python streaming HTTP while SystemDS uses Java non-streaming HttpURLConnection, making per-prompt latency not directly comparable across backends - Rewrite conclusions to be evidence-based: llmPredict correctness proven by accuracy match, concurrency scaling quantified, model-vs-backend distinction made explicit, latency caveat explained - Remove MLX from supported backends table (not benchmarked), mark as "not benchmarked" in repo structure - Remove fabricated OpenAI cost claim ($0.02-0.03) - Remove "All backends overview" table (redundant with other tables) - Simplify concurrency scaling table to throughput only (remove misleading effective latency columns) - Put accuracy table first (apples-to-apples metric) before latency

…and evaluation methodology - Fix bold-pattern regex in math number extraction: allow arbitrary text between number and closing ** (fixes 3 false negatives in OpenAI math, 44/50 -> 47/50) - Re-score all 30 result sets from raw samples.jsonl (only OpenAI math changed) - Add complete cost comparison table with all backends including OpenAI API cost + local compute cost - Add cost calculation formula with hardware assumptions - Add evaluation methodology section explaining per-workload accuracy criteria - Add cross-backend comparisons (SystemDS vs vLLM, OpenAI vs local, Qwen 3B vs Mistral 7B, Ollama analysis) - Fix PR description scope: this is the benchmark framework PR, not llmPredict - Fix hardware claims: Ollama/OpenAI ran on MacBook, not H100 - Add model names to SystemDS column headers (SystemDS Qwen 3B c=1/c=4) - Explain Mistral's low math results (verbose output confuses extractor) - Regenerate HTML report

The previous explanation attributed all failures to the number extractor. Analysis of raw samples shows 20 of 31 incorrect answers were genuinely wrong (wrong formulas, negative results, refusing to solve), while only 10 had the correct answer present but extracted the wrong number.

kubraaksux added 30 commits January 19, 2026 15:04

Add LLM inference support to JMLC API via Py4J bridge

8e7d6da

Move llm_worker.py to fix Python module collision

dacdc1c

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

Use python3 with fallback to python in Connection.java

29f657c

Add batch inference with FrameBlock and metrics support

e40e4f2

Clean up test: extract constants and shared setup method

fdd1684

Fix fake metrics and add compute cost tracking

1510e8a

- Use real token counts from Ollama/vLLM APIs, omit when unavailable - Correct TTFT and cost estimates - Add --gpu-hour-cost and --gpu-count flags for server benchmarks

Fix bash 3.x compatibility in run_all_benchmarks.sh

a18979f

Replace declare -A (bash 4+ only) with a case function for default model lookup. macOS ships with bash 3.x.

Add benchmark results for project submission

7239460

Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each) with metrics, samples, configs, HTML report, and aggregated CSV.

Add LLM inference support to JMLC API via Py4J bridge

fd3a117

Move llm_worker.py to fix Python module collision

0cc05f6

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

Use python3 with fallback to python in Connection.java

036a221

Add batch inference with FrameBlock and metrics support

ef8c1f4

Clean up test: extract constants and shared setup method

af54019

Add embeddings workload for SystemDS backend

190d952

Run the embeddings (semantic similarity) workload with SystemDS JMLC, bringing SystemDS to 5 workloads matching all other backends.

Trim verbose docstring in systemds_backend.py

a39078c

kubraaksux added 28 commits February 16, 2026 17:50

Keep both sequential and batched inference modes in PreparedScript

c9c85d4

generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched (new), false for original sequential for-loop.

Add gitignore rules for .env files, meeting notes, and local tool config

4b44dd1

Add llmPredict builtin, opcode and ParamBuiltinOp entries

72bc334

Add llmPredict parser validation in ParameterizedBuiltinFunctionExpre…

0ad1b56

…ssion

Wire llmPredict through hop, lop and instruction generation

1e48362

Add llmPredict CP instruction with HTTP-based inference

de675ac

Remove Py4J-based LLM inference from JMLC API

5eab87d

Rewrite LLM test to use llmPredict DML built-in

bea062a

Add OpenAI-compatible HTTP inference server for HuggingFace models

edf4e39

Merge branch 'llm-api' into llm-benchmark

04f82ac

# Conflicts: # .gitignore # src/test/java/org/apache/sysds/test/functions/jmlc/JMLCLLMInferenceTest.java

Update benchmark backend to use llmPredict DML built-in

f5fa4ec

Add concurrency parameter to llmPredict built-in

c3e9a1f

Supports parallel HTTP calls to the inference server via ExecutorService. Default concurrency=1 keeps sequential behavior.

Merge branch 'llm-api' into llm-benchmark

c0ec34b

# Conflicts: # src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java # src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java

Remove old SystemDS results and clean up headers

6d8797c

- Delete Py4J-based benchmark results (will re-run with llmPredict) - Remove license header from test (Matthias will add) - Clarify llm_server.py docstring

Pass concurrency to llmPredict via SYSTEMDS_CONCURRENCY env var

223c606

Route SystemDS concurrency through Java instead of Python threads

d269db7

Fix JVM incubator vector module for Py4J gateway

a27e0fa

Add SystemDS llmPredict benchmark results (c=1 and c=4)

5d47925

kubraaksux force-pushed the llm-benchmark branch from cf6a6b3 to 83b90e4 Compare February 16, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project#2431

LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project#2431
kubraaksux wants to merge 71 commits intoapache:mainfrom
kubraaksux:llm-benchmark

kubraaksux commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kubraaksux commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and motivation

Project structure

Backends

Benchmark results

Evaluation methodology

Accuracy (% correct, n=50 per workload)

Key comparisons

Per-prompt latency (mean ms/prompt, n=50)

SystemDS concurrency scaling (throughput)

Cost comparison

Conclusions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kubraaksux commented Feb 16, 2026 •

edited

Loading