Text-to-MongoDB: Fine-Tuning a 7B Model to Generate Queries from Natural Language

Text-to-MongoDB

I built a system that takes a plain English question and a collection schema, and produces a valid MongoDB find or aggregate query. It runs locally on a single GPU with ~1 second latency.

Try it on Google Colab | Model on HuggingFace | GitHub

The Problem

LLMs can generate MongoDB queries, but they hallucinate fields, misuse operators, and require expensive API calls. I wanted something that runs locally, respects the schema, and is actually reliable.

How It Works

graph TD
    A["🗣️ User: Find orders over $100 shipped last month"] --> B
    B["📋 Schema: orders collection\nFields: order_id, total, status, created_at\nAllowed ops: $match, $gte..."] --> C
    C["🧠 Qwen 7B + LoRA adapter\n4-bit quantized, ~5GB VRAM"] --> D
    D["📤 MongoDB Query"]

Output:

{
  "type": "find",
  "filter": {
    "total": {"$gt": 100},
    "status": "shipped",
    "created_at": {"$gte": {"$date": "2025-01-01"}}
  }
}

The model receives the collection schema (field names, types, roles, enum values) and a list of allowed operators, then composes a query that respects both. The key insight is that it learns to read the schema, not memorize query patterns.

Results

I fine-tuned Qwen2.5-Coder-7B-Instruct using 4-bit QLoRA on ~1,300 synthetic examples. No human-labeled data — the entire dataset is generated programmatically.

Model Split Syntax Operators Fields Overall
Qwen 7B baseline (zero-shot) eval (145) 51.0% 51.0% 40.0% 40.0%
Qwen 7B baseline (zero-shot) held-out (368) 53.8% 53.5% 39.9% 39.9%
Qwen 7B LoRA r=8 eval (145) 100.0% 100.0% 100.0% 100.0%
Qwen 7B LoRA r=8 held-out (368) 98.9% 98.9% 98.9% 98.9%

The held-out set uses 3 collection schemas the model never saw during training, testing real generalization to unseen domains.

Synthetic Data Generation

The training data is fully synthetic — no human labeling required. Here's the pipeline:

graph LR
    A["19 Schemas\n8 domains"] --> B["10 Intent Generators\nfilters, aggregations,\ntime ranges, top-N..."]
    B --> C["~235 Base Examples"]
    C --> D["4 Augmentation Strategies"]
    D --> E["~1,800 Training Examples"]
  • 19 hand-crafted schemas across 8 domains (e-commerce, healthcare, IoT, HR, finance, logistics, social, education)
  • 10 intent generators produce (natural language, MongoDB query) pairs: filters, aggregations, time ranges, top-N, counts, etc.
  • 4 augmentation strategies multiply ~235 base examples to ~1,800:
  • Field name shuffling (salarycompensation) to prevent memorization
  • Negative examples with hallucinated fields, teaching the model to reject bad queries
  • Date variation with random concrete dates
  • Operator subset removal, forcing the model to read constraints

4-Layer Evaluation

Every generated query passes through four validation layers:

  • Syntax — valid JSON, correct structure, well-formed find or aggregate
  • Operators — only allowed operators used, no unsafe ops ($where, $merge, $out)
  • Fields — every field reference exists in the schema (catches hallucination)
  • Generalization — compares pass rates between training and held-out schemas to detect overfitting

Training Details

Parameter Value
Base model Qwen/Qwen2.5-Coder-7B-Instruct
Quantization 4-bit NF4 + double quantization
LoRA rank 8 (alpha=16)
Training time ~20 minutes (RTX 5090)
Training VRAM ~5 GB (4-bit model + LoRA gradients)
Inference VRAM ~5 GB
Dataset ~1,300 synthetic examples
Adapter size ~20 MB

Generalization Example

The model has never seen an HR schema. Give it an employees collection with name, department, salary, hire_date, and ask "Show total salary by department sorted by highest first":

{
  "type": "aggregate",
  "pipeline": [
    {"$group": {"_id": "$department", "total": {"$sum": "$salary"}}},
    {"$sort": {"total": -1}}
  ]
}

It reads the field types, understands department is a category for grouping, salary is a measure for summing, and builds the correct pipeline. ~1 second latency.

Limitations

  • Single-collection queries only (no $lookup joins)
  • No write operations (update, delete, insert)
  • Field descriptions must be short (2-5 words) to match training prompt length
  • Training still requires a GPU

Links

← Back to Home