LLM

Text-to-SQL 에이전트만 고도화하면 충분할까?

harheem

29 Apr 2026 • 18 min read

자연어로 질문하면 SQL을 생성하고, 실행하고, 결과를 요약해주는 에이전트는 이제 꽤 익숙한 패턴이 됐습니다.

저도 그동안 데이터 에이전트를 만든다고 하면 거의 항상 text-to-SQL을 먼저 떠올렸습니다. 질문을 받고, 관련 schema를 찾고, 테이블과 컬럼을 고르고, SQL을 만들고, 실행 결과를 다시 자연어로 설명하는 방식입니다.

대략 이런 흐름입니다.

사용자 질문
→ schema retrieval
→ table / column selection
→ SQL generation
→ SQL execution
→ result summarization

이 접근은 직관적이고 강력합니다. 데이터만 있다면 거의 어떤 질문이든 시도할 수 있습니다. 그런데 실제 업무 데이터에 붙여보면 금방 불편한 지점이 보입니다.

SQL이 실행됐다고 해서 답이 맞는 것은 아닙니다.

특히 “매출”, “활성 고객”, “전환율”, “순매출”, “리텐션” 같은 비즈니스 지표를 다룰 때는 더 그렇습니다. 문제는 SQL 문법이 아니라, 그 지표가 조직 안에서 무엇을 의미하느냐입니다.

text-to-SQL의 진짜 어려움은 SQL 문법이 아니다

예를 들어 사용자가 이렇게 묻는다고 해보겠습니다.

지난 분기 지역별 순매출을 보여줘.

text-to-SQL 에이전트는 대략 이런 SQL을 생성해야 합니다.

SELECT
  c.region,
  SUM(o.amount - o.refund_amount - o.tax_amount) AS net_revenue
FROM orders o
JOIN customers c
  ON o.customer_id = c.customer_id
WHERE o.status = 'paid'
  AND o.order_date >= ...
GROUP BY c.region;

겉으로는 그럴듯합니다. 하지만 여기에는 숨은 의사결정이 많습니다.

amount는 gross revenue일까요, net revenue일까요?
환불은 빼야 할까요?
세금은 제외해야 할까요?
order_date 기준일까요, payment_date 기준일까요?
고객의 region은 주문 당시 주소 기준일까요, 현재 주소 기준일까요?
조인 후 중복 집계가 생기지는 않을까요?

이건 SQL 생성 문제가 아니라 비즈니스 의미 해석 문제입니다.

LLM이 raw schema만 보고 매번 이런 결정을 추론하게 만들면, SQL은 실행되지만 숫자가 틀리는 상황이 생깁니다. 그리고 이 경우가 가장 위험합니다. 에러가 나는 SQL은 고치면 되지만, 그럴듯한 숫자는 의심 없이 사용될 수 있기 때문입니다.

BIRD benchmark도 이 문제를 잘 보여줍니다. BIRD는 95개 데이터베이스, 12,751개 text-to-SQL pair, 37개 전문 도메인을 포함한 대규모 benchmark인데, 논문은 dirty database contents, external knowledge, SQL efficiency 같은 현실적인 문제를 강조합니다. 당시 ChatGPT의 execution accuracy는 40.08%였고, human result 92.96%와 큰 차이가 있었습니다.

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

Text-to-SQL parsing, which aims at converting natural language instructions into executable SQLs, has gained increasing attention in recent years. In particular, Codex and ChatGPT have shown impressive results in this task. However, most of the prevalent benchmarks, i.e., Spider, and WikiSQL, focus on database schema with few rows of database contents leaving the gap between academic study and real-world applications. To mitigate this gap, we present Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domains. Our emphasis on database values highlights the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing. The experimental results demonstrate the significance of database values in generating accurate text-to-SQLs for big databases. Furthermore, even the most effective text-to-SQL models, i.e. ChatGPT, only achieves 40.08% in execution accuracy, which is still far from the human result of 92.96%, proving that challenges still stand. Besides, we also provide an efficiency analysis to offer insights into generating text-to-efficient-SQLs that are beneficial to industries. We believe that BIRD will contribute to advancing real-world applications of text-to-SQL research. The leaderboard and source code are available: https://bird-bench.github.io/.

arXiv.orgJinyang Li

Spider 2.0은 더 직접적으로 enterprise text-to-SQL의 난이도를 다룹니다. 632개 real-world workflow problem, BigQuery와 Snowflake 같은 환경, 1,000개 이상 column을 가진 데이터, metadata·dialect documentation·project-level codebase 탐색까지 포함합니다. o1-preview 기반 code agent가 21.3%만 해결했다는 결과도 보고했습니다.

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 21.3% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation -- especially in prior text-to-SQL benchmarks -- they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at https://spider2-sql.github.io

arXiv.orgFangyu Lei

요지는 단순합니다.

LLM은 SQL을 점점 잘 쓰고 있지만, enterprise analytics에서는 SQL 작성 능력만으로는 부족합니다.

dbt Semantic Layer는 다른 방향에서 접근한다

dbt Semantic Layer는 text-to-SQL을 더 잘하게 만드는 기술이 아닙니다.
오히려 반대에 가깝습니다. LLM에게 SQL을 직접 쓰게 하지 않습니다.

대신 dbt project 안에 metric, dimension, entity 같은 비즈니스 의미를 미리 정의해두고, LLM은 그중 어떤 metric과 dimension을 사용할지 선택합니다. 실제 SQL 생성은 MetricFlow가 처리합니다.

dbt 문서에 따르면 MetricFlow는 다양한 data dimension에 대해 metric 생성을 간소화하기 위한 SQL query generation tool이며, YAML 파일을 통해 semantic graph를 구성합니다. 이 graph는 semantic model과 metric으로 이루어집니다.

About MetricFlow | dbt Developer Hub

Learn more about MetricFlow and its key concepts

dbt Labsdbt Labs

구조를 단순화하면 이렇습니다.

사용자 질문
→ 사용 가능한 metric 조회
→ dimension / entity 조회
→ metric + dimension + filter 선택
→ Semantic Layer query 실행
→ 결과 요약

예를 들어 사용자의 질문이 이렇다고 해보겠습니다.

지난 분기 지역별 순매출을 보여줘.

text-to-SQL 방식에서는 LLM이 SQL 전체를 만들어야 합니다.
Semantic Layer 방식에서는 LLM이 이런 요청을 만듭니다.

{
  "metrics": ["net_revenue"],
  "group_by": ["customer__region", "metric_time__quarter"],
  "where": "metric_time = previous_quarter"
}

그다음 MetricFlow가 정의된 metric, dimension, entity 관계를 바탕으로 SQL을 생성합니다.

dbt의 semantic model 문서에서도 dimension은 metric을 바라보는 방식, 즉 group by parameter로 설명됩니다. entity는 semantic model 안에서 join key 역할을 하며, MetricFlow가 join에 활용합니다.

Semantic models | dbt Developer Hub

Semantic models are YAML abstractions on top of a dbt model, connected via joining keys as edges

dbt Labsdbt Labs

정리하면 이렇습니다.

text-to-SQL:
자연어 → SQL 전체 생성

Semantic Layer:
자연어 → metric / dimension / filter 선택

LLM이 해야 할 일이 줄어들수록, 틀릴 수 있는 공간도 줄어듭니다.

“SQL 생성”이 아니라 “검증된 metric API 호출”로 바꾸는 것

Semantic Layer를 이해하는 가장 쉬운 방식은 API로 보는 것입니다.

text-to-SQL은 LLM에게 이렇게 말하는 방식입니다.

이 데이터베이스를 보고 정답 SQL을 직접 작성해줘.

Semantic Layer는 이렇게 말하는 방식입니다.

이미 정의된 metric API가 있으니, 사용자 질문에 맞는 metric과 dimension을 골라 호출해줘.

이건 function calling과도 비슷합니다.

자유롭게 텍스트를 생성하게 하는 것보다, 정해진 tool schema 안에서 argument를 채우게 하는 쪽이 더 안정적입니다. Semantic Layer도 마찬가지입니다. LLM에게 SQL이라는 자유도 높은 출력을 맡기는 대신, 사전에 정의된 metric interface를 호출하게 만듭니다.

예를 들어 net_revenue가 이미 이렇게 정의되어 있다고 가정해보겠습니다.

metrics:
  - name: net_revenue
    description: "Paid orders only, excluding refunds and taxes"
    type: simple
    type_params:
      measure: net_revenue_amount

이 경우 LLM은 “순매출이 뭔지” 매번 추론할 필요가 없습니다. 그저 사용자 질문이 net_revenue metric에 해당하는지를 판단하면 됩니다.

여기서 정확도 차이가 납니다.
SQL을 더 잘 쓰는 모델을 고르는 문제가 아니라, 모델이 추론해야 하는 범위를 줄이는 문제입니다.

실제 benchmark에서는 어떤 결과가 나왔나

dbt는 2026년에 Semantic Layer와 text-to-SQL을 비교한 benchmark update를 공개했습니다. 이 실험은 ACME Insurance benchmark 기반 11개 질문을 여러 모델에서 반복 실행했고, text-to-SQL, minimal Semantic Layer, modeled Semantic Layer, modeled data 위의 text-to-SQL을 비교했습니다. dbt는 text-to-SQL에 전체 schema를 context로 넣었다는 caveat도 함께 명시했습니다.

Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update | dbt Developer Blog

With 2026’s best models, the dbt Semantic Layer hits near-100% accuracy for covered queries. Here’s what changed and what didn’t in our updated benchmark.

dbt LabsJason Ganz

dbt가 공개한 주요 결과는 다음과 같습니다.

모델	방식	정확도
Claude Sonnet 4.6	Text-to-SQL	90.0%
Claude Sonnet 4.6	Semantic Layer	98.2%
GPT-5.3 Codex	Text-to-SQL	84.1%
GPT-5.3 Codex	Semantic Layer	100.0%

dbt는 Semantic Layer scope 안에 있는 질문에 대해서는 정확도가 100%에 가깝거나 100%에 도달했고, deterministic query generation 덕분에 LLM이 미묘하게 틀린 결과를 만들 가능성을 줄인다고 설명합니다. 또한 text-to-SQL은 flexible하지만, 잘못된 join이나 column 의미 오해, 실행은 되지만 결과가 틀린 query를 만들 수 있다고 지적합니다.

물론 이 결과를 그대로 일반화하면 안 됩니다. dbt가 수행한 vendor benchmark이기 때문입니다.

하지만 결론 자체는 꽤 현실적입니다.

정확도가 중요한 KPI 질문에는 Semantic Layer를 쓰고, ad hoc 분석에는 text-to-SQL을 쓰는 것이 더 안전하다.

dbt도 같은 방향을 권장합니다. 정확도가 중요한 board data, auditor, OKR, KPI, weekly report에는 Semantic Layer를 연결하고, one-off 질문이나 prototyping에는 Semantic Layer를 먼저 확인한 뒤 필요하면 text-to-SQL로 fallback하라고 설명합니다.

Knowledge Graph benchmark도 비슷한 메시지를 준다

이 관점은 dbt만의 이야기는 아닙니다.

data.world 연구진의 Knowledge Graph benchmark에서는 enterprise SQL database에 대해 GPT-4 zero-shot text-to-SQL을 사용했을 때 정확도가 약 16%였고, 같은 데이터베이스를 Knowledge Graph representation으로 질의했을 때 약 54%로 증가했다고 보고했습니다.

https://data.world/mstatic/assets/pdf/kg_llm_accuracy_benchmark_11132023_public.pdf

이 실험은 dbt Semantic Layer 자체를 테스트한 것은 아닙니다.
하지만 중요한 메시지는 같습니다.

raw schema보다 business semantics가 구조화된 중간 layer를 제공하는 것이 LLM 기반 질의 정확도에 유리할 수 있다.

즉, 문제는 “LLM에게 어떤 모델을 쓰느냐”만이 아닙니다.
LLM에게 어떤 context와 interface를 제공하느냐가 중요합니다.

Semantic Layer가 항상 정답은 아니다

Semantic Layer는 강력하지만, 만능은 아닙니다. 모델링된 metric과 dimension 안에서만 답할 수 있습니다.

예를 들어 net_revenue는 정의되어 있지만 refund_reason별 분석이 semantic model에 없다면, Semantic Layer만으로는 답하지 못할 수 있습니다. 또는 필요한 entity 관계가 정의되어 있지 않으면 query가 실패할 수 있습니다.

하지만 이 실패 방식이 중요합니다.

text-to-SQL의 실패는 종종 이런 모양입니다.

SQL 실행 성공
결과 반환 성공
하지만 숫자가 틀림

Semantic Layer의 실패는 대개 이런 모양입니다.

이 metric / dimension 조합은 지원되지 않음

실무에서는 후자가 더 안전합니다. 특히 KPI, 감사 자료, 임원 보고, OKR dashboard처럼 틀린 숫자가 조직 의사결정에 바로 영향을 주는 경우에는 “모른다”고 말하는 시스템이 “틀린 숫자를 그럴듯하게 말하는 시스템”보다 낫습니다.

그래서 text-to-SQL을 버려야 할까?

아닙니다.

저는 오히려 hybrid 구조가 현실적이라고 봅니다.

사용자 질문
  ↓
Semantic Layer로 답 가능한가?
  ├─ 가능 → governed metric query
  └─ 불가능 → text-to-SQL fallback

조금 더 구체적으로는 이렇게 설계할 수 있습니다.

1. KPI / metric / reporting 질문인가?
   → Semantic Layer 우선 사용

2. 관련 metric은 있지만 dimension이 부족한가?
   → semantic model 개선 후보로 기록

3. 완전히 ad hoc 분석인가?
   → mart model 대상으로 text-to-SQL fallback

4. fallback 질문이 반복되는가?
   → 새 dbt model 또는 metric으로 승격

이 구조의 핵심은 text-to-SQL을 없애는 것이 아닙니다. text-to-SQL이 가장 잘 맞는 영역과 Semantic Layer가 가장 잘 맞는 영역을 나누는 것입니다.

제가 보기에는 역할 분담이 이렇게 됩니다.

질문 유형	더 적합한 접근
“지난달 지역별 매출 보여줘”	Semantic Layer
“이번 분기 활성 고객 수 추이 보여줘”	Semantic Layer
“A/B 테스트 그룹별 전환율 비교해줘”	Semantic Layer, 단 metric 정의 필요
“이 테이블에 어떤 데이터가 있는지 탐색해줘”	text-to-SQL
“최근 생성된 raw event 중 특이한 패턴 찾아줘”	text-to-SQL
“반복적으로 묻는 ad hoc 질문”	처음엔 text-to-SQL, 이후 metric으로 승격

데이터 에이전트의 목표를 다시 정의해야 한다

text-to-SQL 에이전트를 만들다 보면 자연스럽게 이런 문제에 집중하게 됩니다.

schema retrieval을 어떻게 잘할 것인가
SQL generation prompt를 어떻게 개선할 것인가
실행 에러를 어떻게 repair할 것인가
SQL dialect를 어떻게 맞출 것인가
join path를 어떻게 찾을 것인가

이 문제들은 여전히 중요합니다. 하지만 production analytics agent의 목표를 “SQL을 잘 생성하는 것”으로만 두면 부족합니다.

목표는 이쪽에 더 가깝습니다.

사용자가 신뢰할 수 있는 숫자를 일관되게 얻도록 하는 것.

이 관점에서 보면 Semantic Layer는 단순한 dbt 기능이 아닙니다. LLM 기반 데이터 에이전트의 출력 공간을 제한하고, metric governance를 에이전트에 연결하는 interface입니다.

즉, data agent architecture를 이렇게 바꿔보자는 이야기입니다.

Before:
Natural Language → SQL → Result

After:
Natural Language
→ Governed Metric Query if possible
→ SQL fallback if necessary
→ Recurring fallback questions become modeled metrics

도입한다면 작게 시작하는 게 좋다

처음부터 전사 Semantic Layer를 만들 필요는 없습니다. 오히려 그렇게 시작하면 실패할 가능성이 높습니다.

작게 시작하려면 다음 정도가 적당합니다.

핵심 metric 5~10개
실제 사용자 질문 30~50개
text-to-SQL baseline
Semantic Layer-first agent
동일 benchmark로 비교

평가할 때는 execution success만 보면 안 됩니다. SQL이 실행됐다고 해서 답이 맞는 것은 아니기 때문입니다.

더 중요한 지표는 다음입니다.

평가 항목	의미
result accuracy	gold result와 일치하는가
silent failure rate	실행은 됐지만 결과가 틀린 비율
repeatability	같은 질문을 여러 번 했을 때 결과가 일관되는가
coverage	Semantic Layer로 답 가능한 질문 비율
refusal quality	답할 수 없는 질문에 적절히 거절하는가
traceability	어떤 metric definition이 사용됐는지 추적 가능한가

특히 silent failure rate를 꼭 봐야 합니다.

text-to-SQL 에이전트는 발전할수록 더 그럴듯한 SQL을 만듭니다. 그런데 그 SQL이 틀렸을 때도 더 그럴듯해집니다.

결론

text-to-SQL은 계속 중요합니다. ad hoc exploration이나 prototype 분석에서는 여전히 가장 유연한 접근입니다.

하지만 모든 자연어 데이터 질의를 text-to-SQL로 해결하려는 전략은 재검토할 필요가 있습니다. 특히 조직에서 공유되는 KPI, metric, reporting 질문이라면 LLM이 매번 SQL을 새로 생성하게 하는 것보다, 검증된 metric interface를 호출하게 하는 편이 더 안전합니다.

dbt Semantic Layer의 핵심은 LLM을 더 똑똑하게 만드는 것이 아니라, LLM이 틀릴 수 있는 공간을 줄이는 것입니다.

그래서 앞으로 데이터 에이전트를 설계한다면 저는 text-to-SQL-only 구조보다는 다음 구조를 우선 검토할 것 같습니다.

Semantic Layer first
→ governed answer
→ text-to-SQL fallback
→ 반복 질문은 metric으로 승격

자연어 데이터 질의의 목표가 “SQL 생성 자동화”라면 text-to-SQL만으로도 충분할 수 있습니다. 하지만 목표가 “신뢰 가능한 숫자를 제공하는 것”이라면, Semantic Layer는 꽤 중요한 선택지가 될 것 같습니다.