2025/12/06

Engineering The Semantic Layer: Principles for Data at Scale

Author:: Sathiya Sarathi Gunasekaran

, 2025/12/06

Engineering The Semantic Layer: Principles for Data at Scale

This post is for Day 6 of Mercari Advent Calendar 2025, brought to you by sathiya from the Mercari JB Data team.

We are in an era of Data Intelligence, where we have efficient analytical datastores and advanced AI tooling, yet we are challenged in answering questions like "What was our revenue yesterday?" or "Why do different dashboards show different numbers?"

This article explores why these inconsistencies happen, how they slow down analytic operations and machine learning, and why a semantic layer is required as an essential infrastructure for data at scale.

A Fragmented Workflow

For years, most data organizations have relied on a familiar pattern:

Raw data lands in the warehouse, often including unrealistic values (e.g., placeholder prices like ¥9,999,999, draft item descriptions, or test records).
Analysts build derived tables and views to make that data usable for reporting.
Dashboards are created, limiting business users to predefined reports.
Data scientists work outside the BI layer, writing custom SQL and performing their own data cleaning before modeling.

This structure creates redundant work, inconsistent definitions, and a tangled set of transformations across teams. While these issues are usually discussed from an analytics perspective, data scientists feel the same pain – clean, reliable data matters just as much in machine learning as in BI, because poor inputs degrade model performance.

The lack of an unified semantic understanding is an issue – a "co-habitant inter-dependence” where every team relies on data but interprets it differently. When meaning is not defined consistently across the tools, the organization becomes overly dependent on their analysts and engineers to reconcile definitions, answer basic questions and reconcile conflicting dashboards.

The Challenges of a SQL-First Exploration

SQL is powerful, but relying on it as the primary interface for organization-wide data exploration introduces significant barriers:

Challenge	Description
Inconsistency	Metrics (e.g., revenue, active users), filters, and time grains are defined differently by every user.
Duplicacy and Fragility	Repetitive code (filters, joins, Common Table Expressions (CTEs) A.K.A the ‘WITH’ clause) is copy-pasted across projects, becoming difficult to govern and manage.
Data Model Knowledge Required	Users must have a thorough understanding of underlying schemas to write any meaningful, correct query.
Lack of Business Meaning	SQL focuses on how to grab data, not what that data means in a business context.

Exploration should not be gated by SQL ability. Data meaning must be standardized and accessible to all.

Semantics Layer, Simplified

A Semantic Layer is a new architectural layer that bridges the gap between raw data and end-users, serving as the single source of truth for business definitions.

Humans naturally attach meaning to symbols. When we see the word ELEPHANT, our brain instantly recalls a rich mental model – a large, majestic animal with tusks and a trunk. Data systems don’t have this intuition. They only interpret raw column names, codes, and table identifiers unless we explicitly define what they mean. This is where a Semantic Layer comes in to provide interpretation in between.

It transforms technical schemas into clear, human-understandable business concepts – for example, a table represented as tbl_usr_oo1 internally can translate to the table USERS, the field ord_amt corresponds to the Order Amount.

This translation step ensures that machines interpret data correctly, and more importantly, humans interpret it consistently across dashboards, teams, and tools. But semantics go beyond simple renaming.

The modern Data Scape ft. the Semantic layer

A robust semantic layer embeds business rules, calculations, and governance so that the organization operates from a single shared understanding of its data. At its core, it is:

An abstraction layer over raw data
It hides the complexity of schemas, joins, and column names, exposing clean, human-readable concepts.
A single repository of truth for business logic
Every metric, filter, exclusion, and rule is defined once and reused everywhere, eliminating inconsistency.
An interpreter that exposes universal meaning to every downstream tool BI dashboards, notebooks, ML systems, and applications all consume the same definitions – no duplication, no drift.
A universal API for metrics, relationships, and concepts Tools don’t need to know how to calculate Revenue or Lifetime Value—they simply request the metric, and the semantic layer guarantees correctness.

Raw Data vs. a Semantic Layer

The following table highlights how a semantic layer fundamentally changes the way organizations work with data.

Feature	Raw Data	Semantic Layer
Nature	Technical, schema-driven, often difficult to interpret.	Logical, curated, aligned to business concepts.
Language	Tables and columns (tbl_usr_001, ord_amt).	Friendly terms (“User”, “Order Amount”).
Business Logic	Typically missing or recreated repeatedly (e.g., manually excluding cancelled orders).	Logic is embedded once (e.g., “Revenue” always excludes cancelled orders).
User Experience	Requires SQL and schema knowledge.	Drag-and-drop or natural language; no SQL required.
Risk	High chance of inconsistent metrics.	Single Source of Truth – consistent definitions everywhere.

The “What” vs. the “How”: Business Policies Encoded in the Semantic Layer

Business stakeholders define what should be counted, excluded, or labeled. The semantic layer defines how those rules are technically executed.

Aspect	Business Policy (The What)	Semantic Layer Implementation (The How)
Data Validity	“A sale counts only if the payment has been settled.”	Apply predicates such as `WHERE payment_status = 'settled'`.
Exclusions	“Exclude test orders and employee purchases.”	Hard-coded exclusion logic like `WHERE is_test_flag = 0 AND email NOT LIKE '%@company.com'`.
Calculations	“Profit = Revenue – Cost of Goods Sold – Shipping.”	Provide a reusable metric such as profit: `measure: revenue { type: sum sql: ${order_price} ;; } measure: cogs { type: sum sql: ${cost_of_goods_sold} ;; } measure: shipping { type: sum sql: ${shipping_cost} ;; } measure: profit { type: number sql: ${revenue} - ${cogs} - ${shipping} ;; description: "Profit = Revenue - COGS - Shipping (pre-tax)" }`
Null Handling	“If the region is unknown, label it ‘Unassigned’.”	Use `COALESCE(region_name, 'Unassigned')`.
Time Standards	“Our fiscal year starts on April 1.”	Provide a reusable fiscal calendar so all consumers use the same fiscal logic (assuming a fiscal year starts in April) `dimensions: date_year: type: number sql: YEAR(${TABLE}.order_date) ;; date_month: type: number sql: MONTH(${TABLE}.order_date) ;; fiscal_year: sql: CASE WHEN ${date_month} >= 4 THEN ${date_year} ELSE ${date_year} - 1 END ;; fiscal_month: sql: CASE WHEN ${date_month} >= 4 THEN ${date_month} - 3 ELSE ${date_month} + 9 END ;;`

Engineering the Principles for Scale

Building a successful semantic layer requires a deep understanding of the business’ core principles. It must act as a shared language and a foundation of trust across all teams – engineering, analytics, data science, and business operations.

To achieve this, a semantic layer must be designed to function as:

A contract and dictionary for business terms, ensuring that concepts like “Active User,” “Revenue,” or “Valid Listing” mean the same thing everywhere.
A compiler that translates business requests into optimized SQL, so users focus on intent rather than implementation.
A governance and security layer, enforcing access controls, data quality rules, and standardized definitions.
A universal metric API, exposing consistent, reusable metrics to dashboards, notebooks, ML pipelines, and applications.

Lessons from Designing the Semantic Layer at Mercari

Considering all of the above principles in mind, we decided to come up with a Semantic Layer that would abide by all of the best practices and the following were our findings.

1. One Definition of Metrics

Instead of redefining key metrics repeatedly, we coded the definition in a semantic model definition/data model configuration as:

...
measure: revenue {
  type: sum
  sql: ${order_price} ;;
}
...

All downstream models inherit the exact same definition, eliminating drift and guaranteeing consistency. When a definition changes, the core model is updated once and the change propagates automatically to every dashboard, report, or AI agent. This removes duplication and prevents conflicting logic.

When other models need to adjust or extend certain dimensions, they can do so within their inherited model definition without affecting any others. This provides flexibility while preserving a consistent foundation.

2. Logical Models over Explicit SQL Joins

Instead of the analysts or data scientists to hand-write joins repeatedly, the semantic layer represents relationships logically and lets the engine generate the optimal SQL automatically.

Traditional SQL Version

...
FROM 
    orders 
JOIN 
    customers 
ON orders.customer_id = customers.id

Equivalent Semantic Model Definition

explore: orders { 
    join: customers { 
        type: left_outer 
        sql_on: ${orders.customer_id} = ${customers.id} ;;
    }
 }

The semantic engine constructed the correct and most efficient SQL based solely on the dimensions and measures requested, fully abstracting away the complex SQL logics and table relationships. This eliminated the writing of repetitive boilerplate SQL, reduced the errors in analytical logic and ensured every downstream consumer to use consistent, and validated relationships without thinking about the underlying complexity

3. Insulation From Schema Changes

We decoupled data modeling from data querying. The Semantic layer acted as a protective buffer between warehouse changes and downstream users. When a column is renamed, a table is split, or fields are reorganized, the update is applied once in the semantic model definition. All dashboards, reports, AI agents and downstream applications continue to function without interruption. This prevented breakages, eliminated emergency fixes and ensured stability even when the schema evolves.

4. Enables Next-Generation Analytics

When meaning is consistently encoded in the semantic layer, advanced analytics become far more reliable. Natural Language Query (NLQ) systems can interpret user intent accurately, drag-and-drop BI tools generate correct and expressive queries, and external applications can access trustworthy metrics through a universal API. This foundation unlocks a new class of analytical and AI-driven capabilities without requiring every tool to understand the underlying data structures.

5. Explicit Business Meaning and Contracts

The semantic layer formalizes business definitions, turning what was once tacit, undocumented knowledge into explicit, governed contracts. It answers foundational questions such as: What defines a “customer”? What counts as an “active user”? How is Gross Merchandise Volume (GMV) calculated? By codifying these concepts, the semantic layer ensures that every team, tool, and workflow operates from the same authoritative definitions. In this way, semantic layers become the system of record for business meaning.

Conclusion: Meaning Before Measurement

The future of data is not SQL-first or dashboard-first – it is semantic-first. A strong data foundation must prioritize meaning, not mechanics.
Raw data without semantics is just storage. Data enriched with shared definitions, business logic, and consistent rules becomes trustworthy insight, operational intelligence, and scalable decision-making.

Semantics turn data into understanding, and understanding is what organizations ultimately depend on to move faster, align better, and make smarter decisions at scale.

Tomorrow’s article will be QAエンジニアがAIで日々の課題を解決した話 by Yuga Hashimoto.

Engineering The Semantic Layer: Principles for Data at Scale

A Fragmented Workflow

The Challenges of a SQL-First Exploration

Semantics Layer, Simplified

Raw Data vs. a Semantic Layer

The “What” vs. the “How”: Business Policies Encoded in the Semantic Layer

Engineering the Principles for Scale

Lessons from Designing the Semantic Layer at Mercari

Conclusion: Meaning Before Measurement

Related article

When Speed Wasn’t About Coding Faster: Our Journey to ‘One Person One Release’

Search Results Quality Monitoring with LLMs

Finally, Mercari Japan in English! Our Road to Cross-Platform i18n