| title | Cactus Python SDK | |||||||
|---|---|---|---|---|---|---|---|---|
| description | Python bindings for Cactus on-device AI inference engine. Supports chat completion, vision, transcription, embeddings, RAG, tool calling, and streaming. | |||||||
| keywords |
|
Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.
Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python# Download models (CLI)
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small
# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus authfrom src.downloads import ensure_model
from src.cactus import cactus_init, cactus_complete, cactus_destroy
import json
# Downloads weights from HuggingFace if not already present
weights = ensure_model("LiquidAI/LFM2-VL-450M")
model = cactus_init(str(weights), None, False)
messages = json.dumps([{"role": "user", "content": "What is 2+2?"}])
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])
cactus_destroy(model)All functions are module-level and mirror the C FFI directly. Handles are plain int values (C pointers).
Download pre-converted weights programmatically (no CLI needed):
from src.downloads import ensure_model, get_weights_dir, download_from_hf
# ensure_model downloads if missing, returns Path to weights dir
weights = ensure_model("openai/whisper-tiny")
# Or check / download manually
weights_dir = get_weights_dir("openai/whisper-tiny") # -> Path("weights/whisper-tiny")
download_from_hf("openai/whisper-tiny", weights_dir, precision="INT4") # -> boolmodel = cactus_init(model_path: str, corpus_dir: str | None, cache_index: bool) -> int
cactus_destroy(model: int)
cactus_reset(model: int) # clear KV cache
cactus_stop(model: int) # abort ongoing generation
cactus_get_last_error() -> str | NoneReturns a JSON string with success, error, cloud_handoff, response, optional thinking (only present when the model emits chain-of-thought content, placed before function_calls), function_calls, segments (always [] for completion — populated only in transcription responses), confidence, timing stats (time_to_first_token_ms, total_time_ms, prefill_tps, decode_tps, ram_usage_mb), and token counts (prefill_tokens, decode_tokens, total_tokens).
result_json = cactus_complete(
model: int,
messages_json: str, # JSON array of {role, content}
options_json: str | None, # optional inference options
tools_json: str | None, # optional tool definitions
callback: Callable[[str, int], None] | None # streaming token callback
) -> str# With options and streaming
options = json.dumps({"max_tokens": 256, "temperature": 0.7})
def on_token(token, token_id): print(token, end="", flush=True)
result = json.loads(cactus_complete(model, messages_json, options, None, on_token))
if result["cloud_handoff"]:
# response already contains cloud result
passResponse format:
{
"success": true,
"error": null,
"cloud_handoff": false,
"response": "4",
"function_calls": [],
"segments": [],
"confidence": 0.92,
"time_to_first_token_ms": 45.2,
"total_time_ms": 163.7,
"prefill_tps": 619.5,
"decode_tps": 168.4,
"ram_usage_mb": 512.3,
"prefill_tokens": 28,
"decode_tokens": 12,
"total_tokens": 40
}Pre-processes input text and populates the KV cache without generating output tokens. This reduces latency for subsequent calls to cactus_complete.
cactus_prefill(
model: int,
messages_json: str, # JSON array of {role, content}
options_json: str | None, # optional inference options
tools_json: str | None # optional tool definitions
) -> Nonetools = json.dumps([{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City, State, Country"}
},
"required": ["location"]
}
}
}])
messages = json.dumps([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
{"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
{"role": "assistant", "content": "It's sunny and 72°F in Paris!"}
])
cactus_prefill(model, messages, None, tools)
completion_messages = json.dumps([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
{"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
{"role": "assistant", "content": "It's sunny and 72°F in Paris!"},
{"role": "user", "content": "What about SF?"}
])
result = json.loads(cactus_complete(model, completion_messages, None, tools, None))Response format:
{
"success": true,
"error": null,
"prefill_tokens": 25,
"prefill_tps": 166.1,
"total_time_ms": 150.5,
"ram_usage_mb": 245.67
}Returns a JSON string. Use json.loads() to access the response field (transcribed text), the segments array (timestamped segments as {"start": <sec>, "end": <sec>, "text": "<str>"} — Whisper: phrase-level from timestamp tokens; Parakeet TDT: word-level from frame timing; Parakeet CTC and Moonshine: one segment per transcription window (consecutive VAD speech regions up to 30s)), and other metadata.
result_json = cactus_transcribe(
model: int,
audio_path: str | None,
prompt: str | None,
options_json: str | None,
callback: Callable[[str, int], None] | None,
pcm_data: bytes | None
) -> strCustom vocabulary biases the decoder toward domain-specific words (supported for Whisper and Moonshine models). Pass custom_vocabulary and vocabulary_boost in options_json:
options = json.dumps({
"custom_vocabulary": ["Omeprazole", "HIPAA", "Cactus"],
"vocabulary_boost": 3.0
})
result = json.loads(cactus_transcribe(model, "medical_notes.wav", None, options, None, None))Streaming transcription: Streaming transcription also returns JSON strings:
stream = cactus_stream_transcribe_start(model: int, options_json: str | None) -> int
partial_json = cactus_stream_transcribe_process(stream: int, pcm_data: bytes) -> str
final_json = cactus_stream_transcribe_stop(stream: int) -> strIn cactus_stream_transcribe_process responses: confirmed is the stable text from segments that have been finalised across two consecutive decode passes (potentially replaced by a cloud result); confirmed_local is the same text before any cloud substitution; pending is the current window's unconfirmed transcription text; segments contains timestamped segments for the current audio window.
result = json.loads(cactus_transcribe(model, "/path/to/audio.wav", None, None, None, None))
print(result["response"])
for seg in result["segments"]:
print(f"[{seg['start']:.3f}s - {seg['end']:.3f}s] {seg['text']}")Streaming also accepts custom_vocabulary in the options passed to cactus_stream_transcribe_start. The bias is applied for the lifetime of the stream session.
embedding = cactus_embed(model: int, text: str, normalize: bool) -> list[float]
embedding = cactus_image_embed(model: int, image_path: str) -> list[float]
embedding = cactus_audio_embed(model: int, audio_path: str) -> list[float]tokens = cactus_tokenize(model: int, text: str) -> list[int]
result_json = cactus_score_window(model: int, tokens: list[int], start: int, end: int, context: int) -> strresult_json = cactus_detect_language(
model: int,
audio_path: str | None,
options_json: str | None,
pcm_data: bytes | None
) -> strReturns a JSON string with fields: success, error, language (BCP-47 code), language_token, token_id, confidence, entropy, total_time_ms, ram_usage_mb.
result_json = cactus_vad(
model: int,
audio_path: str | None,
options_json: str | None,
pcm_data: bytes | None
) -> strReturns a JSON string: {"success":true,"error":null,"segments":[{"start":<sample_index>,"end":<sample_index>},...],"total_time_ms":...,"ram_usage_mb":...}. VAD segments contain only start and end as integer sample indices — no text field.
result_json = cactus_diarize(
model: int,
audio_path: str | None,
options_json: str | None,
pcm_data: bytes | None
) -> strOptions (all optional):
step_ms(int, default 1000) — sliding window stride in millisecondsthreshold(float) — zero out per-speaker scores below this value (segmentation.thresholdin Python pipeline)num_speakers(int) — keep only the N most active speakersmin_speakers(int) — minimum number of speakers to retainmax_speakers(int) — maximum number of speakers to retain
Returns {"success":true,"error":null,"num_speakers":3,"scores":[...],"total_time_ms":...,"ram_usage_mb":...}. The scores field is a flat array of T×3 float32 values (index f*3+s), one per output frame per speaker, each in [0,1].
result_json = cactus_embed_speaker(
model: int,
audio_path: str | None,
options_json: str | None,
pcm_data: bytes | None
) -> strReturns a JSON string: {"success":true,"error":null,"embedding":[<float>, ...],"total_time_ms":...,"ram_usage_mb":...}. The embedding is a 256-dimensional speaker vector from the WeSpeaker ResNet34-LM model.
result_json = cactus_rag_query(model: int, query: str, top_k: int) -> strReturns a JSON string with a chunks array. Each chunk has score (float), source (str, from document metadata), and content (str):
{
"chunks": [
{"score": 0.0142, "source": "doc.txt", "content": "relevant passage..."}
]
}index = cactus_index_init(index_dir: str, embedding_dim: int) -> int
cactus_index_add(index: int, ids: list[int], documents: list[str],
embeddings: list[list[float]], metadatas: list[str] | None)
cactus_index_delete(index: int, ids: list[int])
result_json = cactus_index_get(index: int, ids: list[int]) -> str
result_json = cactus_index_query(index: int, embedding: list[float], options_json: str | None) -> str
cactus_index_compact(index: int)
cactus_index_destroy(index: int)cactus_index_query returns {"results":[{"id":<int>,"score":<float>}, ...]}. cactus_index_get returns {"results":[{"document":"...","metadata":<str|null>,"embedding":[...]}, ...]}.
cactus_log_set_level(level: int) # 0=DEBUG 1=INFO 2=WARN (default) 3=ERROR 4=NONE
cactus_log_set_callback(callback: Callable[[int, str, str], None] | None)cactus_set_telemetry_environment(cache_location: str)
cactus_set_app_id(app_id: str)
cactus_telemetry_flush()
cactus_telemetry_shutdown()Functions that return a value raise RuntimeError on failure. cactus_prefill, cactus_index_add, cactus_index_delete, and cactus_index_compact also raise RuntimeError on failure despite not returning a value. Truly void functions that never raise: cactus_destroy, cactus_reset, cactus_stop, cactus_index_destroy, logging and telemetry functions.
Pass images in the messages content for vision-language models:
messages = json.dumps([{
"role": "user",
"content": "Describe this image",
"images": ["path/to/image.png"]
}])
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])The Graph API provides a tensor computation graph for building and executing dataflow pipelines on the Cactus kernel layer:
from src.graph import Graph
import numpy as np
g = Graph()
a = g.input((2, 2))
b = g.input((2, 2))
y = ((a - b) * (a + b)).abs().pow(2.0).view((4,))
g.set_input(a, np.array([[2, 4], [6, 8]], dtype=np.float16))
g.set_input(b, np.array([[1, 2], [3, 4]], dtype=np.float16))
g.execute()
print(y.numpy()) # [9. 36. 81. 144.]Supported ops: +, -, *, /, abs, pow, view, flatten, concat, cat, relu, sigmoid, tanh, gelu, softmax.
Run the full test suite:
python python/test.py # compact output
python python/test.py -v # verboseTests are in python/tests/:
test_graph.py— Graph elementwise, composed, tensor, activation, and softmax opstest_model.py— VLM completion/embeddings, Whisper transcription/embeddings (auto-downloads weights if missing)
- Cactus Engine API — Full C API reference that the Python bindings wrap
- Cactus Index API — Vector database API for RAG applications
- Fine-tuning Guide — Train and deploy custom LoRA fine-tunes
- Runtime Compatibility — Weight versioning across releases
- Swift SDK — Swift bindings for iOS/macOS
- Kotlin/Android SDK — Kotlin bindings for Android
- Flutter SDK — Dart bindings for cross-platform mobile