Evaluation API Reference

The evaluate() function allows you to benchmark Text-to-SQL model outputs against the LLMSQL gold queries and SQLite database. It prints metrics, logs mismatches, and saves detailed reports automatically.

Features

  • Evaluate model predictions from JSONL files or Python dicts.

  • Automatically download benchmark questions and SQLite DB if missing.

  • Prints mismatch summaries and supports configurable reporting.

  • Saves detailed JSON report with metrics, mismatches, timestamp, and input mode.

Usage Examples

Evaluate from a JSONL file:

from llmsql.evaluation.evaluate import evaluate

report = evaluate("path_to_outputs.jsonl")
print(report)

Evaluate from a list of Python dicts:

predictions = [
    {"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"},
    {"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"},
]

report = evaluate(predictions)
print(report)

Providing your own DB and questions (skip workdir):

report = evaluate(
    "path_to_outputs.jsonl",
    questions_path="bench/questions.jsonl",
    db_path="bench/sqlite_tables.db",
    workdir_path=None
)

Function Arguments

Argument

Description

outputs

Path to JSONL file or a list of prediction dicts (required).

workdir_path

Directory for automatic benchmark downloads. Ignored if both questions_path and db_path are provided. Default: “llmsql_workdir”.

questions_path

Optional path to benchmark questions JSONL file.

db_path

Optional path to SQLite DB with evaluation tables.

save_report

Path to save detailed JSON report. Defaults to “evaluation_results_{uuid}.json”.

show_mismatches

Print mismatches while evaluating. Default True.

max_mismatches

Maximum number of mismatches to display. Default 5.

Input Format

The predictions should be in JSONL format:

{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"}
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"}
{"question_id": "3", "predicted_sql": "SELECT * FROM Table WHERE active=1"}

Output Metrics

The function returns a dictionary with the following keys:

  • total – Total queries evaluated

  • matches – Queries where predicted SQL results match gold results

  • pred_none – Queries where the model returned NULL or no result

  • gold_none – Queries where the reference result was NULL or no result

  • sql_errors – Invalid SQL or execution errors

  • accuracy – Overall exact match accuracy

  • mismatches – List of mismatched queries with details

  • timestamp – Evaluation timestamp

  • input_mode – How results were provided (“jsonl_path” or “dict_list”)

Report Saving

By default, a report is saved automatically as evaluation_results_{uuid}.json in the current directory. It contains metrics, mismatches, timestamp, and input mode. You can override this path using save_report.

LLMSQL Evaluation Module

Provides the evaluate() function to benchmark Text-to-SQL model outputs on the LLMSQL benchmark.

See the documentation for full usage details.

llmsql.evaluation.evaluate.evaluate(outputs: str | list[dict[int, str | int]], *, workdir_path: str | None = 'llmsql_workdir', questions_path: str | None = None, db_path: str | None = None, save_report: str | None = None, show_mismatches: bool = True, max_mismatches: int = 5) dict[source]

Evaluate predicted SQL queries against the LLMSQL benchmark.

Parameters:
  • outputs – Either a JSONL file path or a list of dicts.

  • workdir_path – Directory for auto-downloads (ignored if all paths provided).

  • questions_path – Manual path to benchmark questions JSONL.

  • db_path – Manual path to SQLite benchmark DB.

  • save_report – Optional manual save path. If None → auto-generated.

  • show_mismatches – Print mismatches while evaluating.

  • max_mismatches – Max mismatches to print.

Returns:

Metrics and mismatches.

Return type:

dict

💬 Made with ❤️ by the LLMSQL Team