Text-to-MongoDB: Fine-Tuning a 7B Model to Generate Queries from Natural Language
Text-to-MongoDB
I built a system that takes a plain English question and a collection schema, and produces a valid MongoDB find or aggregate query. It runs locally on a single GPU with ~1 second latency.
Try it on Google Colab | Model on HuggingFace | GitHub
The Problem
LLMs can generate MongoDB queries, but they hallucinate fields, misuse operators, and require expensive API calls. I wanted something that runs locally, respects the schema, and is actually reliable.
How It Works
graph TD
A["🗣️ User: Find orders over $100 shipped last month"] --> B
B["📋 Schema: orders collection\nFields: order_id, total, status, created_at\nAllowed ops: $match, $gte..."] --> C
C["🧠 Qwen 7B + LoRA adapter\n4-bit quantized, ~5GB VRAM"] --> D
D["📤 MongoDB Query"]
Output:
{
"type": "find",
"filter": {
"total": {"$gt": 100},
"status": "shipped",
"created_at": {"$gte": {"$date": "2025-01-01"}}
}
}
The model receives the collection schema (field names, types, roles, enum values) and a list of allowed operators, then composes a query that respects both. The key insight is that it learns to read the schema, not memorize query patterns.
Results
I fine-tuned Qwen2.5-Coder-7B-Instruct using 4-bit QLoRA on ~1,300 synthetic examples. No human-labeled data — the entire dataset is generated programmatically.
| Model | Split | Syntax | Operators | Fields | Overall |
|---|---|---|---|---|---|
| Qwen 7B baseline (zero-shot) | eval (145) | 51.0% | 51.0% | 40.0% | 40.0% |
| Qwen 7B baseline (zero-shot) | held-out (368) | 53.8% | 53.5% | 39.9% | 39.9% |
| Qwen 7B LoRA r=8 | eval (145) | 100.0% | 100.0% | 100.0% | 100.0% |
| Qwen 7B LoRA r=8 | held-out (368) | 98.9% | 98.9% | 98.9% | 98.9% |
The held-out set uses 3 collection schemas the model never saw during training, testing real generalization to unseen domains.
Synthetic Data Generation
The training data is fully synthetic — no human labeling required. Here's the pipeline:
graph LR
A["19 Schemas\n8 domains"] --> B["10 Intent Generators\nfilters, aggregations,\ntime ranges, top-N..."]
B --> C["~235 Base Examples"]
C --> D["4 Augmentation Strategies"]
D --> E["~1,800 Training Examples"]
- 19 hand-crafted schemas across 8 domains (e-commerce, healthcare, IoT, HR, finance, logistics, social, education)
- 10 intent generators produce (natural language, MongoDB query) pairs: filters, aggregations, time ranges, top-N, counts, etc.
- 4 augmentation strategies multiply ~235 base examples to ~1,800:
- Field name shuffling (
salary→compensation) to prevent memorization - Negative examples with hallucinated fields, teaching the model to reject bad queries
- Date variation with random concrete dates
- Operator subset removal, forcing the model to read constraints
4-Layer Evaluation
Every generated query passes through four validation layers:
- Syntax — valid JSON, correct structure, well-formed
findoraggregate - Operators — only allowed operators used, no unsafe ops (
$where,$merge,$out) - Fields — every field reference exists in the schema (catches hallucination)
- Generalization — compares pass rates between training and held-out schemas to detect overfitting
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
| Quantization | 4-bit NF4 + double quantization |
| LoRA rank | 8 (alpha=16) |
| Training time | ~20 minutes (RTX 5090) |
| Training VRAM | ~5 GB (4-bit model + LoRA gradients) |
| Inference VRAM | ~5 GB |
| Dataset | ~1,300 synthetic examples |
| Adapter size | ~20 MB |
Generalization Example
The model has never seen an HR schema. Give it an employees collection with name, department, salary, hire_date, and ask "Show total salary by department sorted by highest first":
{
"type": "aggregate",
"pipeline": [
{"$group": {"_id": "$department", "total": {"$sum": "$salary"}}},
{"$sort": {"total": -1}}
]
}
It reads the field types, understands department is a category for grouping, salary is a measure for summing, and builds the correct pipeline. ~1 second latency.
Limitations
- Single-collection queries only (no
$lookupjoins) - No write operations (
update,delete,insert) - Field descriptions must be short (2-5 words) to match training prompt length
- Training still requires a GPU