Optimizing tiny LLM programs: EuroSAT + DSPy MIPROv2

gist

This note documents a small end‑to‑end experiment: run a local Qwen3‑VL multimodal model via llama.cpp’s llama-server, use it as a VLM classifier on EuroSAT (satellite land‑use classes), and then use DSPy’s MIPROv2 to optimize the prompt + few‑shot demos.

This project was the first primarily executed over the course of a day via Telegram with Krusty the Krabs. Apoorva set the goals, gave Krusty code ‘starters’, and directed the iteration choices, while Krabbs (the OpenClaw assistant) put together the scripts, ran the experiments, and wrote the first draft of this note; Apoorva then read the code and re-ran some stuff and edited out chatgpt smells (including the customary ‘the goal is not X; it’s Y’ phrasing and many unnecessary details). krusty-science will becoem a semi-frequent genre of post here since I’d rather inflict these ideas upon a friendly group of long-suffering friends than by polluting the commons by putting it on the arxiv.

Another adjacent goal here was to give the main Krusty the Krabs a mini LLM with reasonably fast inference to call for menial tasks. We’ll call it SideshowSpongebob, continuing with the cursed crossover world.

0) Setup: llama.cpp for local inference

llama.cpp is a great way to run quantized local models with good performance, and it ships with a built‑in HTTP server (llama-server) that exposes an OpenAI‑compatible API. That API makes it easy to connect to DSPy’s dspy.LM wrapper.

On a Mac mini:

brew install llama-cpp

(Elsewhere you might build from source, e.g. for CUDA on linux.)

1) Local LLM / VLM setup

Models

For this project, we used Qwen3‑VL 8B (multimodal) GGUF + mmproj:

Model weights: Qwen3VL-8B-Instruct-Q4_K_M.gguf
Multimodal projector: mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf

Server

We run the model behind llama.cpp’s HTTP server.

Default endpoint:

health: http://127.0.0.1:8092/health
OpenAI‑style chat: http://127.0.0.1:8092/v1/chat/completions

Start (in a dedicated terminal):

PORT=8092 scripts/start_qwen_vlm_server.sh

Notes:

The server is started with --media-path $PWD so we can pass local images as file://... URLs.
We set --reasoning-format none and --reasoning-budget 0 to avoid “reasoning_content‑only” responses.

2) Task definition: EuroSAT classification

We used EuroSAT RGB (Sentinel‑2 land use / land cover):

Download source: https://madm.dfki.de/files/sentinel/EuroSAT.zip

EuroSAT has 10 classes. In the scripts we normalize them to lowercase tokens:

annualcrop
forest
herbaceousvegetation
highway
industrial
pasture
permanentcrop
residential
river
sealake

Example images

Demos (these were the exact images MIPROv2 selected as few‑shot examples):

A few additional examples:

Prediction format

We constrain the model to output:

exactly one of the 10 labels
and nothing else (no explanation)

3) DSPy + MIPROv2 basics

DSPy gives you:

a Signature (typed I/O interface)
a Program/Module composed of predictors (dspy.Predict, dspy.ChainOfThought, etc.)
a metric to score outputs
an optimizer/teleprompter to search over prompt/program configurations

MIPROv2 (high level)

MIPROv2 does (roughly):

Bootstrap candidate few‑shot demo sets (using the trainset)
Propose a few candidate instructions/prompts
Search combinations of (instruction, demo set) using Bayesian optimization (Optuna) to maximize the metric on the valset

MIPROv2 returns an optimized DSPy program, which you can .save() to JSON and reload later.

4) EuroSAT exercise (baseline → optimized)

4.1 Program we started with

Starting instruction (baseline):

Classify a EuroSAT RGB satellite image.
 
Choose exactly one label from label_set. Reply with ONLY the label.

Starting demos: none (zero‑shot).

4.2 Program we ended up with

MIPROv2 produced an optimized program state saved to miprov2_eurosat_optimized.json.

Key differences:

Updated instruction (more directive):

You are an advanced satellite imagery analyst deployed by a global environmental agency. 
Your mission is to classify a critical satellite image from the EuroSAT dataset into one of the 10 land cover categories: 
annualcrop, forest, herbaceousvegetation, highway, industrial, pasture, permanentcrop, residential, river, sealake. 
This classification determines the next phase of environmental monitoring and disaster response. 
You must choose exactly one label from the provided label_set — no ambiguity, no extra text. 
Your response must be the exact label. 
Accuracy is paramount — misclassification could affect millions of people. Proceed with full confidence and precision.

The optimized prompt relies on flattery and bigging up the importance of the task. Great; LLMs are just like us.

Selected few‑shot demos (2 labeled examples):

SeaLake_2323.jpg → sealake
HerbaceousVegetation_1883.jpg → herbaceousvegetation

4.3 Results

On a small dev split (train_n=80, val_n=80, auto=light):

baseline accuracy: 0.200
optimized accuracy: 0.350

(+15pp absolute.)

Practical notes / gotchas

Keep the server running Connection‑refused errors usually mean llama-server isn’t up on :8092.
Token limits DSPy’s adapters require structured output markers. If max_tokens is too small, you’ll get truncation and parsing failures.
file:// vs base64 images For the optimizer run we used file://... URLs (smaller prompts) and relied on --media-path.

Lalgorithms

Explorer