The Semantic Layer: Accelerator for AI-Ready Data Architectures

There is no one-size-fits-all solution when it comes to implementing a semantic layer. The right approach for any organization will depend on its data stack, team structure, use cases, and existing tooling.

AI developers face many challenges that can be easily addressed by embedding a semantic layer in the data platform. For example, imagine a developer building a model to better predict customer churn. They reach out to the sales and finance departments to obtain datasets containing customer sales history, along with information on payment status, contracts, and other financial metrics.

The model heavily depends on the concept of an “active_customer” and finds this column present in both the sales and financial datasets. After a quick inspection, it becomes clear that the definition of “active_customer” in the sales dataset is inconsistent with the definition used in the financial dataset. Which one should be used for the model? Since assumptions can’t be made, it’s now up to the developer to uncover the business logic behind the “active_customer” column in both datasets to determine which definition is appropriate for the model. It turns out that the “active_customer” column in the sales dataset is more akin to “active_customer_or_prospect,” as it considers whether customers have active contracts and includes current prospects based on the sales phase. In contrast, the financial dataset strictly counts a customer as “active” only if there is an outstanding contract.

A semantic layer, which is a logical abstraction layer that sits between raw data and end-users of BI tools, translating complex technical data structures into meaningful business terms, can help in this instance. By defining terms like “active_customer” and maintaining a single, consistent definition in one place for all teams to use, it provides a unified, consistent view of data, simplifies data access, and ensures that everyone is on the same page when analyzing data. In the absence of a semantic layer, AI developers will struggle to establish a consistent system of business metric definitions.

AI/ML Powering AI Workloads with Intelligent Data Infrastructure and Open Source

The Challenge of Undefined Semantics

When it comes to AI development, several potential challenges can occur when developing AI without well-defined, consistent metrics, including:

Hard-Coded Business Logic – In the absence of a system where business logic is already defined and the data is easily retrievable, an AI developer may choose to hard-code the business logic within the AI training scripts or AI agent code. The problem arises when another AI developer on the same project needs to use the same metric. Even if the metric is initially coded correctly, any future changes to its definition would require careful updates across multiple scripts and codebases, increasing the risk of data definition drift over time.
Hard to Explain – Even if the data is accurate and there isn’t a central, well-accepted definition for many of the inputs in the models, it can be difficult to explain how different models may work. This leads to a lack of confidence in the model’s outputs from stakeholders.

● Fragile Pipelines – By defining these metrics in downstream workloads, you introduce greater fragility to upstream schema changes compared to a scenario where definitions are agreed upon and established further upstream. This can lead to tedious sessions spent analyzing data lineage to identify what changed and caused the model to break. However, if a semantic layer were in place to define these elements and ensure everyone is aligned, the likelihood of such drift would be significantly reduced.

● Onboarding Challenges – When new data scientists and data analysts are hired, they must spend time learning the different ways people define the same metrics and in what contexts, due to the lack of well-documented, agreed-upon definitions. This delays the time it takes for new team members to start providing value and increases the risk of encountering many of the pitfalls already mentioned.

● Inconsistency between AI Projects and BI Dashboards – If data scientists and data analysts are defining these metrics within their individual tools, they still run the risk of inconsistency with one another, where AI algorithms may tell an organization a different story than a BI dashboard due to inconsistent definitions across teams and tooling.

How to Implement a Semantic Layer

At the end of the day, a semantic layer isn’t a specific technological approach, but rather the act of creating canonical definitions of common business metrics that are easily discoverable and usable. A semantic layer can be implemented in many ways, which may or may not require additional tooling for the data platform. Here are five key methods to implement a semantic layer:

1) Using SQL Views: Many databases, data warehouses, and data lakehouse systems enable organizations to define views, which are not separate copies of the data but logical representations based on rules expressed in the SQL that define the view. The benefit of this approach is that the logic is executed with each query, so as the underlying data updates, the view returns the updated results. However, repeatedly running the business logic can lead to high computational costs unless a caching layer is implemented. Keep in mind that while SQL views can create a consistent version of the data to access, it doesn’t solve the problem of stakeholders understanding the definitions since views don’t inherently have documentation in most systems.

2) BI-Tool based Semantic Layers: Many BI tools like Tableau and Power BI come with built-in semantic layer features for defining and documenting business metrics. These capabilities allow analysts and business users to create centralized definitions of key KPIs and dimensions. It also measures directly within the BI environment, which improves consistency and self-service within that specific tool. However, the challenge with these tools is that while they are helpful for the teams using the specific BI tool, they offer little benefit to those who do not. Data scientists, backend engineers, or teams using different analytics tools often lack access to or integration with these definitions. This creates silos of understanding, where the same metric, such as “monthly active users” or “customer lifetime value,” may be defined differently depending on the tool or team. As a result, larger organizations using multiple BI tools or a combination of BI and AI platforms are often forced to duplicate and redefine metrics across environments, increasing the risk of inconsistencies, confusion, and misaligned reporting.

3) Headless BI Tools: A category of intentional semantic layer tools exists that is not tied to a specific method of consuming the end data, allowing the same definitions to be used across both BI and AI. These tools typically offer ways to define business metrics across multiple data sources using SQL, Python, or other expressive options. They often include mechanisms to cache these business metrics for faster access and provide features for building documentation, such as integrated wikis. Tools that fall into this category include Cube, Dremio, AtScale, dbt, and others, each in their own way. Headless BI tools often allow you to capture a broad set of modeling beyond just a single column/metric/feature, but whole tables and views on the data that can then be cached and used as needed for AI/BI.

4) Metric Stores: Metric stores, or metrics-as-a-service providers, offer a centralized solution for defining, storing, and serving business metrics in a consistent and reusable way across teams and tools. These systems are purpose-built to manage metric definitions independently of any specific application, allowing users to query metrics via APIs or SDKs with built-in support for versioning, lineage, and governance. This approach helps ensure consistency between analytics, reporting, and AI models by decoupling metric logic from downstream implementations. Examples of tools in this category include Transform, Metriql, and GoodData, which enable teams to establish a trusted source of truth for business metrics, regardless of the front-end application consuming them. So while Headless BI tools can offer access to different metrics through a JDBC/ODBC/Arrow Flight interface, Metrics-as-a-Service services will often make their metrics available via REST/GRPC/GraphQL API.

5) Standardizing with Enterprise Data Catalogs: Platforms like Collibra, Atlan, and Acryl Data offer catalogs that organizations can use to document and manage datasets across the enterprise, while also enabling users to discover assets and request access in a governed manner. These platforms often serve as the backbone of data governance strategies, providing visibility into data lineage, ownership, and usage. A key feature of many of these catalogs is the inclusion of a business glossary, which allows teams to define business metrics and KPIs in a canonical, standardized way. This glossary acts as a single source of truth for metric definitions, offering clarity and alignment across departments. Even if these metrics are hardcoded into specific scripts, pipelines, or embedded in downstream tools like BI dashboards or AI models, the catalog provides a reference point to validate their meaning and intent. Over time, this not only supports data literacy and onboarding but also helps ensure consistency as metrics evolve. In complex, multi-team environments, such catalogs reduce duplication of effort, prevent definition drift, and support auditability across regulatory and operational requirements.

Next Steps?

There is no one-size-fits-all solution when it comes to implementing a semantic layer. The right approach for any organization will depend on its data stack, team structure, use cases, and existing tooling. In many cases, a hybrid strategy that combines multiple approaches, such as SQL views for fast prototyping, headless BI tools for shared modeling, and enterprise catalogs for governance, may offer the best balance of flexibility, accessibility, and consistency.

To determine the best path forward, consider the following questions:

Who needs access to these metrics? Just analysts, or also data scientists, engineers, and business users?
How many tools and platforms are being used across the organization for analytics, AI, and reporting?
Are your teams already aligned on metric definitions, or do silos exist between departments?
How frequently do metric definitions change, and how well are those changes communicated?
Do you need APIs, SQL endpoints, or BI integrations to expose metrics consistently?
Is your priority speed and flexibility, governance, and auditability, or a balance of both?
What resources do you have available to maintain documentation and metric definitions over time?

Answers to these questions can help guide whether AI developers should start with lightweight options like views and BI-defined metrics, invest in a headless BI or metric store for broader consistency, or integrate semantic modeling into a data catalog for enterprise-wide governance.

For example, if a small team is working within a single data warehouse and using SQL as a primary interface, defining views can be a fast and effective way to standardize business logic without needing additional infrastructure. If the organization is scaling and supporting both BI dashboards and machine learning models, a headless BI approach or metric store can help ensure metrics are accessible and consistent across different tools and programming languages, while also offering better performance through caching.

In more complex enterprise environments where multiple departments rely on different tools and data sources, embedding semantic definitions into a centralized data catalog can provide the governance, documentation, and access control needed to manage metrics at scale. Each approach has trade-offs, but what matters most is taking that first step toward making your data definitions explicit, discoverable, and trustworthy; because that’s what truly accelerates AI readiness.

Alex Merced

Alex Merced is the co-author of “Apache Iceberg: The Definitive Guide” and Head of Developer Relations at Dremio, providers of the leading, unified lakehouse platform for self-service analytics and AI. With experience as a developer and instructor, his professional journey includes roles at GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly. He co-authored “Apache Iceberg: The Definitive Guide,” published by O’Reilly, and has spoken at notable events such as Data Day Texas and Data Council. Follow Alex on LinkedIn, X, or Dremio at LinkedIn.

1 thought on “The Semantic Layer: The Hidden Accelerator for AI-Ready Data Architectures”

Lionel Beneteau August 18, 2025 at 1:57 pm


Thanks for this article that clearly explains the value of data semantics to build robust AI. You are not taking about ontology and knowledge graphs as another way to create this semantic layer ? Can you please explain why ?