This note documents a small end‑to‑end experiment: run a local Qwen3‑VL multimodal model via llama.cpp’s llama-server, use it as a VLM classifier on EuroSAT (satellite land‑use classes), and then use DSPy’s MIPROv2 to optimize the prompt + few‑shot demos.
This project was the first primarily executed over the course of a day via Telegram with Krusty the Krabs. Apoorva set the goals, gave Krusty code ‘starters’, and directed the iteration choices, while Krabbs (the OpenClaw assistant) put together the scripts, ran the experiments, and wrote the first draft of this note; Apoorva then read the code and re-ran some stuff and edited out chatgpt smells (including the customary ‘the goal is not X; it’s Y’ phrasing and many unnecessary details). krusty-science will becoem a semi-frequent genre of post here since I’d rather inflict these ideas upon a friendly group of long-suffering friends than by polluting the commons by putting it on the arxiv.
Another adjacent goal here was to give the main Krusty the Krabs a mini LLM with reasonably fast inference to call for menial tasks. We’ll call it SideshowSpongebob, continuing with the cursed crossover world.
0) Setup: llama.cpp for local inference
llama.cpp is a great way to run quantized local models with good performance, and it ships with a built‑in HTTP server (llama-server) that exposes an OpenAI‑compatible API. That API makes it easy to connect to DSPy’s dspy.LM wrapper.
On a Mac mini:
brew install llama-cpp(Elsewhere you might build from source, e.g. for CUDA on linux.)
1) Local LLM / VLM setup
Models
For this project, we used Qwen3‑VL 8B (multimodal) GGUF + mmproj:
- Model weights:
Qwen3VL-8B-Instruct-Q4_K_M.gguf - Multimodal projector:
mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf
Server
We run the model behind llama.cpp’s HTTP server.
Default endpoint:
- health:
http://127.0.0.1:8092/health - OpenAI‑style chat:
http://127.0.0.1:8092/v1/chat/completions
Start (in a dedicated terminal):
PORT=8092 scripts/start_qwen_vlm_server.shNotes:
- The server is started with
--media-path $PWDso we can pass local images asfile://...URLs. - We set
--reasoning-format noneand--reasoning-budget 0to avoid “reasoning_content‑only” responses.
2) Task definition: EuroSAT classification
We used EuroSAT RGB (Sentinel‑2 land use / land cover):
- Download source: https://madm.dfki.de/files/sentinel/EuroSAT.zip
EuroSAT has 10 classes. In the scripts we normalize them to lowercase tokens:
annualcropforestherbaceousvegetationhighwayindustrialpasturepermanentcropresidentialriversealake
Example images
Demos (these were the exact images MIPROv2 selected as few‑shot examples):


A few additional examples:






Prediction format
We constrain the model to output:
- exactly one of the 10 labels
- and nothing else (no explanation)
3) DSPy + MIPROv2 basics
DSPy gives you:
- a Signature (typed I/O interface)
- a Program/Module composed of predictors (
dspy.Predict,dspy.ChainOfThought, etc.) - a metric to score outputs
- an optimizer/teleprompter to search over prompt/program configurations
MIPROv2 (high level)
MIPROv2 does (roughly):
- Bootstrap candidate few‑shot demo sets (using the trainset)
- Propose a few candidate instructions/prompts
- Search combinations of (instruction, demo set) using Bayesian optimization (Optuna) to maximize the metric on the valset
MIPROv2 returns an optimized DSPy program, which you can .save() to JSON and reload later.
4) EuroSAT exercise (baseline → optimized)
4.1 Program we started with
Starting instruction (baseline):
Classify a EuroSAT RGB satellite image.
Choose exactly one label from label_set. Reply with ONLY the label.Starting demos: none (zero‑shot).
4.2 Program we ended up with
MIPROv2 produced an optimized program state saved to miprov2_eurosat_optimized.json.
Key differences:
- Updated instruction (more directive):
You are an advanced satellite imagery analyst deployed by a global environmental agency.
Your mission is to classify a critical satellite image from the EuroSAT dataset into one of the 10 land cover categories:
annualcrop, forest, herbaceousvegetation, highway, industrial, pasture, permanentcrop, residential, river, sealake.
This classification determines the next phase of environmental monitoring and disaster response.
You must choose exactly one label from the provided label_set — no ambiguity, no extra text.
Your response must be the exact label.
Accuracy is paramount — misclassification could affect millions of people. Proceed with full confidence and precision.The optimized prompt relies on flattery and bigging up the importance of the task. Great; LLMs are just like us.
- Selected few‑shot demos (2 labeled examples):
SeaLake_2323.jpg→sealakeHerbaceousVegetation_1883.jpg→herbaceousvegetation
4.3 Results
On a small dev split (train_n=80, val_n=80, auto=light):
- baseline accuracy: 0.200
- optimized accuracy: 0.350
(+15pp absolute.)
Practical notes / gotchas
-
Keep the server running Connection‑refused errors usually mean
llama-serverisn’t up on:8092. -
Token limits DSPy’s adapters require structured output markers. If
max_tokensis too small, you’ll get truncation and parsing failures. -
file://vs base64 images For the optimizer run we usedfile://...URLs (smaller prompts) and relied on--media-path.