Text-to-MongoDB: Fine-Tuning a 7B Model to Generate Queries from Natural Language

2026-03-10

Text-to-MongoDB

I built a system that takes a plain English question and a collection schema, and produces a valid MongoDB find or aggregate query. It runs locally on a single GPU with ~1 second latency.

Try it on Google Colab | Model on HuggingFace | GitHub

The Problem

LLMs can generate MongoDB queries, but they hallucinate fields, misuse operators, and require expensive API calls. I wanted something that runs locally, respects the schema, and is actually reliable.

How It Works

graph TD
    A["🗣️ User: Find orders over $100 shipped last month"] --> B
    B["📋 Schema: orders collection\nFields: order_id, total, status, created_at\nAllowed ops: $match, $gte..."] --> C
    C["🧠 Qwen 7B + LoRA adapter\n4-bit quantized, ~5GB VRAM"] --> D
    D["📤 MongoDB Query"]

Output:

{
  "type": "find",
  "filter": {
    "total": {"$gt": 100},
    "status": "shipped",
    "created_at": {"$gte": {"$date": "2025-01-01"}}
  }
}

The model receives the collection schema (field names, types, roles, enum values) and a list of allowed operators, then composes a query that respects both. The key insight is that it learns to read the schema, not memorize query patterns.

Results

I fine-tuned Qwen2.5-Coder-7B-Instruct using 4-bit QLoRA on ~1,300 synthetic examples. No human-labeled data — the entire dataset is generated programmatically.

Model	Split	Syntax	Operators	Fields	Overall
Qwen 7B baseline (zero-shot)	eval (145)	51.0%	51.0%	40.0%	40.0%
Qwen 7B baseline (zero-shot)	held-out (368)	53.8%	53.5%	39.9%	39.9%
Qwen 7B LoRA r=8	eval (145)	100.0%	100.0%	100.0%	100.0%
Qwen 7B LoRA r=8	held-out (368)	98.9%	98.9%	98.9%	98.9%

The held-out set uses 3 collection schemas the model never saw during training, testing real generalization to unseen domains.

Synthetic Data Generation

The training data is fully synthetic — no human labeling required. Here's the pipeline:

graph LR
    A["19 Schemas\n8 domains"] --> B["10 Intent Generators\nfilters, aggregations,\ntime ranges, top-N..."]
    B --> C["~235 Base Examples"]
    C --> D["4 Augmentation Strategies"]
    D --> E["~1,800 Training Examples"]

19 hand-crafted schemas across 8 domains (e-commerce, healthcare, IoT, HR, finance, logistics, social, education)
10 intent generators produce (natural language, MongoDB query) pairs: filters, aggregations, time ranges, top-N, counts, etc.
4 augmentation strategies multiply ~235 base examples to ~1,800:
Field name shuffling (salary → compensation) to prevent memorization
Negative examples with hallucinated fields, teaching the model to reject bad queries
Date variation with random concrete dates
Operator subset removal, forcing the model to read constraints

4-Layer Evaluation

Every generated query passes through four validation layers:

Syntax — valid JSON, correct structure, well-formed find or aggregate
Operators — only allowed operators used, no unsafe ops ($where, $merge, $out)
Fields — every field reference exists in the schema (catches hallucination)
Generalization — compares pass rates between training and held-out schemas to detect overfitting

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-Coder-7B-Instruct
Quantization	4-bit NF4 + double quantization
LoRA rank	8 (alpha=16)
Training time	~20 minutes (RTX 5090)
Training VRAM	~5 GB (4-bit model + LoRA gradients)
Inference VRAM	~5 GB
Dataset	~1,300 synthetic examples
Adapter size	~20 MB

Generalization Example

The model has never seen an HR schema. Give it an employees collection with name, department, salary, hire_date, and ask "Show total salary by department sorted by highest first":

{
  "type": "aggregate",
  "pipeline": [
    {"$group": {"_id": "$department", "total": {"$sum": "$salary"}}},
    {"$sort": {"total": -1}}
  ]
}

It reads the field types, understands department is a category for grouping, salary is a measure for summing, and builds the correct pipeline. ~1 second latency.

Limitations

Single-collection queries only (no $lookup joins)
No write operations (update, delete, insert)
Field descriptions must be short (2-5 words) to match training prompt length
Training still requires a GPU