Team: Ruban (Solo Dev) | Location: Singapore
I iterated through several approaches before landing on the final architecture:
-
System prompt tuning. I changed FunctionGemma's system prompt from the default to
"You are a model that can do function calling with the following functions"— this improved tool selection accuracy on multi-tool queries. -
Tool description engineering. FunctionGemma kept confusing
set_timerandset_alarm. I added explicit disambiguation: "Set a countdown timer for a duration in minutes. NOT an alarm." This reduced timer/alarm swaps significantly. -
Confidence threshold tuning. I lowered
confidence_thresholdto 0.1 to prevent premature cloud handoffs — FunctionGemma's built-in confidence signal was too conservative, rejecting correct outputs. I handle routing ourselves instead of relying on the model's self-assessment. -
Model singleton with KV cache reset. Loading FunctionGemma on every call cost ~500ms. I switched to loading once and calling
cactus_resetbetween queries to clear state without reinitializing. -
Tool RAG disabled. Set
tool_rag_top_k=0to pass all tools to the model instead of letting Cactus pre-filter. The pre-filtering was dropping the correct tool on some queries. -
JSON repair layer. FunctionGemma outputs malformed JSON — leading zeros (
"minute":01), fullwidth colons,<escape>tags. I added a repair pass before parsing. -
Query decomposition. Multi-intent queries failed when sent as one prompt. I split on conjunctions but only when the next word is an action verb — preventing false splits on phrases like "mac and cheese" or "R&B".
-
Pronoun resolution. After decomposition, "find Laura and send her a message" needs "her" replaced with "Laura" in the second sub-query. I extract proper nouns from earlier sub-queries and substitute pronouns.
-
Argument post-processing. The biggest win. Rather than trusting FunctionGemma's arguments, I extract them from the original query. This fixed hallucinated song names, broken PM conversions, and wrong timer values.
-
Validation ordering. Initially I validated before fixing arguments — which meant
minute=120got rejected and fell to cloud. Flipping the order (fix first, validate after) converted several cloud fallbacks to on-device. -
Cloud output normalization. Gemini returns protobuf types, not native Python. Trailing periods on strings (
"thanks for lunch.") caused F1 mismatches. I cast all cloud arguments to native types and strip trailing punctuation.
| Metric | Value |
|---|---|
| F1 (tool call accuracy) | 0.96 |
| On-device ratio | ~60% |
| Avg latency | ~1400ms |
| Hard cases (multi-intent) | 98% F1, 100% on-device |
- 5 tools only — FunctionGemma-270M gets confused with more. Five well-scoped tools give high on-device accuracy.
- Sequential sub-query execution — Multi-intent queries run N model calls. Parallel would be faster but risks race conditions on macOS actions.
- Rule-based slot filling — Doesn't help with completely novel argument schemas, but cloud fallback catches those.
- Cloud fallback adds latency — ~1000ms penalty, but ensures correctness over speed.
Spike is a macOS menu bar voice agent. Hold a hotkey, speak a command, and your Mac executes it — opens apps, types text, presses keyboard shortcuts, clicks UI elements, reads screen content. The entire pipeline runs on-device by default: Cactus Whisper for transcription, FunctionGemma-270M for tool routing, with Gemini 2.5 Flash as a cloud fallback only when on-device confidence is low.
Voice ──→ Cactus Whisper (on-device) ──→ Transcribed text
│
▼
Query Decomposition
(split multi-intent)
│
▼
FunctionGemma (on-device)
Intent classification
│
▼
Rule-Based Slot Filling
Fix arguments from query
│
▼
Validation + Sanity Check
│ │
Pass Fail
│ │
▼ ▼
On-Device Gemini Flash
(~250ms) (cloud fallback)
│ │
└──────┬───────┘
▼
Action Dispatcher
(NSWorkspace, CGEvent,
AXUIElement)
Key insight: FunctionGemma is excellent at picking which function to call but poor at extracting accurate arguments. I follow the same architecture as production voice assistants (Siri, Alexa): neural model for intent classification, rule-based extraction for entities. FunctionGemma always runs first — regex is never used for function selection.
-
Query Decomposition — Multi-intent queries ("open Safari and type google.com") are split on action verbs. Each sub-query gets independent FunctionGemma inference.
-
Pronoun Resolution — "Find Laura and send her a message" → second sub-query becomes "send Laura a message".
-
FunctionGemma Inference — On-device intent classification. Picks the function from available tools. ~200-400ms.
-
Argument Fixing — Post-processor extracts accurate arguments from the original query text. Handles AM/PM time conversion, song names, contact names, locations, reminder titles. If the tool type is unknown, model's original arguments pass through.
-
Validation — Rejects garbage outputs: out-of-range hours/minutes, CJK hallucinations, ISO datetime formats. Cross-checks function names against query keywords (e.g., "remind" → must be
create_reminder, notset_alarm). -
Cloud Fallback — When validation fails, Gemini 2.5 Flash handles it. Total latency includes both on-device attempt + cloud call for full transparency.
| Component | Technology | Purpose |
|---|---|---|
| On-device transcription | Cactus Whisper (cactus_transcribe) |
Voice → text, fully local |
| On-device tool routing | FunctionGemma-270M via Cactus (cactus_complete) |
Intent classification |
| Cloud fallback | Gemini 2.5 Flash (google-genai) |
Handles cases FunctionGemma can't |
| App execution | macOS NSWorkspace, CGEvent, AXUIElement | Opens apps, types, clicks, reads screen |
| Bridge server | FastAPI (Python) | Connects Swift frontend to Cactus/Gemini |
| Frontend | SwiftUI menu bar app | Hotkey capture, audio recording, result display |
| Audio | AVAudioRecorder (16kHz mono PCM) | Captures voice input |
Privacy-first. Voice, screen content, and commands stay on-device. Cloud is only used when FunctionGemma's confidence drops below threshold — and the user sees exactly when that happens via the on-device/cloud badge.
Real execution. This isn't a mock demo. Spike opens real apps, types real text, presses real keyboard shortcuts, and clicks real buttons via macOS Accessibility APIs.
Low latency. On-device path: ~250ms from voice to action. Cloud fallback: ~1300ms. Users feel the difference.
Generalizable routing. The hybrid logic handles any tool schema. Known tools get argument fixing for near-perfect accuracy. Unknown tools degrade gracefully to cloud — correct but slower.
- Cactus runs Google DeepMind's FunctionGemma at up to 3000 toks/sec prefill speed on M4 Macs.
- While decode speed reaches 200 tokens/sec, all without GPU, to remain energy-efficient.
- FunctionGemma is great at tool calling, but small models are not the smartest for some tasks.
- There is a need to dynamically combine edge and cloud (Gemini Flash) to get the best of both worlds.
- Cactus develops various strategies for choosing when to fall back to Gemini or FunctionGemma.
- FunctionGemma is just a tool-call model, but tool calling is the core of agentic systems.
- You MUST design new strategies that decide when to stick with on-device or fall to cloud.
- You will be objectively ranked on tool-call correctness, speed and edge/cloud ratio (priortize local).
- You can focus on prompting, tool description patterns, confidence score algorithms, anything!
- Please ensure at least 1 team member has a Mac, Cactus runs on Macs, mobile devices and wearables.
- Step 1: Fork this repo, clone to your Mac, open terminal.
- Step 2:
git clone https://github.com/cactus-compute/cactus - Step 3:
cd cactus && source ./setup && cd ..(re-run in new terminal) - Step 4:
cactus build --python - Step 5:
cactus download google/functiongemma-270m-it --reconvert - Step 6: Get cactus key from the cactus website
- Sept 7: Run
cactus authand enter your token when prompted. - Step 8:
pip install google-genai - Step 9: Obtain Gemini API key from Google AI Studio
- Step 10:
export GEMINI_API_KEY="your-key" - Step 11: Click on location to get Gemini credits - SF, Boston, DC, London, Singapore, Online
- Step 12: Join the Reddit channel, ask any technical questions there.
- Step 13: read and run
python benchmark.pyto understand how objective scoring works. - Note: Final objective score will be done on held-out evals, top 10 are then judged subjectively.
- Your main task is to modify the internal logic of the
generate_hybridmethod inmain.py. - Do not modify the input or output signature (function arguments and return variables) of the
generate_hybridmethod. Keep the hybrid interface compatible withbenchmark.py. - Submit to the leaderboard
python submit.py --team "YourTeamName" --location "YourCity", only 1x every 1hr. - The dataset is a hidden Cactus eval, quite difficult for FunctionGemma by design.
- Use
python benchmark.pyto iterate, but your best score is preserved. - For transparency, hackers can see live rankings on the leaderboard.
- Leaderboard will start accepting submissions once event starts.
- The top hackers in each location will make it to judging.
- Rubric 1: The quality of your hybrid routing algorithm, depth and cleverness.
- Rubric 2: End-to-end products that execute function calls to solve real-world problems.
- Rubric 3: Building low-latency voice-to-action products, leveraging
cactus_transcribe.
import json
from cactus import cactus_init, cactus_complete, cactus_destroy
model = cactus_init("weights/lfm2-vl-450m")
messages = [{"role": "user", "content": "What is 2+2?"}]
response = json.loads(cactus_complete(model, messages))
print(response["response"])
cactus_destroy(model)| Parameter | Type | Description |
|---|---|---|
model_path |
str |
Path to model weights directory |
corpus_dir |
str |
(Optional) dir of txt/md files for auto-RAG |
model = cactus_init("weights/lfm2-vl-450m")
model = cactus_init("weights/lfm2-rag", corpus_dir="./documents")| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle from cactus_init |
messages |
list|str |
List of message dicts or JSON string |
tools |
list |
Optional tool definitions for function calling |
temperature |
float |
Sampling temperature |
top_p |
float |
Top-p sampling |
top_k |
int |
Top-k sampling |
max_tokens |
int |
Maximum tokens to generate |
stop_sequences |
list |
Stop sequences |
include_stop_sequences |
bool |
Include matched stop sequences in output (default: False) |
force_tools |
bool |
Constrain output to tool call format |
tool_rag_top_k |
int |
Select top-k relevant tools via Tool RAG (default: 2, 0 = use all tools) |
confidence_threshold |
float |
Minimum confidence for local generation (default: 0.7, triggers cloud_handoff when below) |
callback |
fn |
Streaming callback fn(token, token_id, user_data) |
# Basic completion
messages = [{"role": "user", "content": "Hello!"}]
response = cactus_complete(model, messages, max_tokens=100)
print(json.loads(response)["response"])# Completion with tools
tools = [{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}]
response = cactus_complete(model, messages, tools=tools)
cactus_complete(model, messages, callback=on_token)Response format (all fields always present):
{
"success": true,
"error": null,
"cloud_handoff": false,
"response": "Hello! How can I help?",
"function_calls": [],
"confidence": 0.85,
"time_to_first_token_ms": 45.2,
"total_time_ms": 163.7,
"prefill_tps": 619.5,
"decode_tps": 168.4,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}Cloud handoff response (when model detects low confidence):
{
"success": false,
"error": null,
"cloud_handoff": true,
"response": null,
"function_calls": [],
"confidence": 0.18,
"time_to_first_token_ms": 45.2,
"total_time_ms": 45.2,
"prefill_tps": 619.5,
"decode_tps": 0.0,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 0,
"total_tokens": 28
}-
When
cloud_handoffisTrue, the model's confidence dropped belowconfidence_threshold(default: 0.7) and recommends deferring to a cloud-based model for better results. -
You will NOT rely on this, hackers must design custom strategies to fall-back to cloud, that maximizes on-devices and correctness, while minimizing end-to-end latency!
| Parameter | Type | Description |
|---|---|---|
model |
handle | Whisper model handle |
audio_path |
str |
Path to audio file (WAV) |
prompt |
str |
Whisper prompt for language/task |
whisper = cactus_init("weights/whisper-small")
prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
response = cactus_transcribe(whisper, "audio.wav", prompt=prompt)
print(json.loads(response)["response"])
cactus_destroy(whisper)| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle |
text |
str |
Text to embed |
normalize |
bool |
L2-normalize embeddings (default: False) |
embedding = cactus_embed(model, "Hello world")
print(f"Dimension: {len(embedding)}")Reset model state (clear KV cache). Call between unrelated conversations.
cactus_reset(model)Stop an ongoing generation (useful with streaming callbacks).
cactus_stop(model)Free model memory. Always call when done.
cactus_destroy(model)Get the last error message, or None if no error.
error = cactus_get_last_error()
if error:
print(f"Error: {error}")Query RAG corpus for relevant text chunks. Requires model initialized with corpus_dir.
| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle (must have corpus_dir set) |
query |
str |
Query text |
top_k |
int |
Number of chunks to retrieve (default: 5) |
model = cactus_init("weights/lfm2-rag", corpus_dir="./documents")
chunks = cactus_rag_query(model, "What is machine learning?", top_k=3)
for chunk in chunks:
print(f"Score: {chunk['score']:.2f} - {chunk['text'][:100]}...")- Join the Reddit channel, ask any technical questions there.
- To gain some technical insights on AI, checkout Maths, CS & AI Compendium.
