Table of contents
Introduction
In Part 1 of Toward Data Science, I briefly explained what is feature store and how it is different than datawarehouse. In this blog, I explain the technical details of how feature store is built and why it is gaining lot of popularity in all MLOps platforms. Lets first recap what is feature store. A feature store serves as a centralized repository designed to manage and store preprocessed, transformed, and refined data variables, known as "features," which are extensively used in the field of machine learning (ML) and artificial intelligence (AI). This platform acts as a dedicated location for organizing and sharing these features across various segments of an organization's machine learning pipelines. The primary role of a feature store is to hold a diverse range of features that have been processed and engineered from raw data, making them readily available for the development of machine learning models. One of its key functionalities is ensuring consistency in feature engineering methodologies, allowing different teams to reuse standard features across different models and projects. Additionally, the feature store maintains version control for features, enabling traceability and reproducibility of model results and experiments. It fosters collaboration among teams by providing standardized features that can be shared across multiple models, projects, and departments. With its scalable and efficient solutions, a feature store effectively manages a high volume of features, streamlining the machine learning process and reducing redundancy, ultimately leading to faster model iteration, improved model quality, and enhanced productivity for data science teams.
Building Elements
Here I briefly explain what is required to build feature store in Azure, called Feast. Feast is specifically focused on feature management, and it is providing a cross-platform solution that can be used with various ML platforms such as GCP, although I personally have never used it in GCP.
To build a feature store using Feast, three key elements need to be defined:
1.Define the Data Source:
a. Description: The data source refers to the raw underlying data, typically stored in a table in Azure SQL Database or Synapse SQL.
b.Feast Model: Feast employs a time-series data model to represent data, considering the temporal aspect of the features.
2. Define Feature View:
a. Description: A feature view is an abstraction that represents a logical grouping of time-series feature data, as it exists in a specified data source.- Feature views organize and structure the features, making them accessible for model training and serving.
b.Components:
Entities: Entities are collections of semantically related features. They are defined as part of feature views and are used to map to the domain of a specific use case.
Features: Features represent the individual characteristics or attributes within an entity. These are the elements to be used in machine learning models.
Data Source: Each feature view is associated with a data source, connecting it to the raw data.
3. Define Entities:
a. Description: Entities are semantically related collections of features. They are defined within feature views and play a crucial role in mapping to the domain of the use case.
b. Entity Key: Entities have keys (entity keys) that are used during the lookup of feature values from the online store and in point-in-time joins. These keys help identify the primary key for storing and retrieving feature values.
c. Composite Entities: It is possible to define composite entities, where more than one entity object is included in a feature view. This allows for flexibility and reuse of entities across different feature views
Bear in mind, Feast does not generate feature values. It acts as the ingestion and serving system. The data sources described within feature views should reference feature values in their already computed form.
Storage Type
The choice of a feature store hinges on the intended use of features and the overarching goal behind their implementation. While the primary purpose of a feature store is to facilitate model building, the methods for training and serving models can significantly differ across scenarios. Consider a scenario where a model necessitates frequent retraining, perhaps every few minutes, or one where the model is expected to serve predictions millions of times per hour. In such instances, opting for an online feature store becomes imperative, ensuring ultra-low latency, sometimes as swift as 32 seconds or less. This low latency is particularly crucial for applications involving streaming data and real-time predictions. Conversely, when dealing with models trained or consumed in a batch format, the focus shifts to high-latency features. Offline features in this context exhibit a range of latencies, spanning from 1 hour to a day or more. Consulting this latency spectrum aids in pinpointing the most suitable type of feature store for a given use case (Figure 1).
Feature Store in Big Picture
Let's step back and take a broader look to understand how and where Feature Stores play a crucial role in the overall process. I've created a simple diagram (Figure 2) to illustrate the utility of Feature Stores.
Step 0: Data Engineering Foundation
In a previous blog post, I highlighted the pivotal role of data engineering. In essence, data engineers construct pipelines to gather data from diverse sources and establish data lakes or data warehouses on preferred cloud platforms. This infrastructure becomes the accessible reservoir of data for data scientists to explore.
Step 1: Data Scientist's Exploration
With access to the data, data scientists embark on exploring potential opportunities for data science work. However, to maintain clear, explainable, and understandable data usage for data science work, it's crucial to comprehend how the required data for a model is prepared and identify the main features used in training. This is where the concept of a feature store proves invaluable. Data scientists, at this stage, read data from the sources built by data engineers, apply necessary transformations, and store it as a dataset. They then commence work on building feature stores, starting with the definition of data sources/datasets, feature views, and entities along with their keys.
Step 2: Feature Store Registration
Once all elements of the feature store are defined, it's time to apply Feature Store Registration. This process registers the feature store, similar to how we register models in the MLOps process. The registered feature store is stored in a binary format, typically on a data lake such as Blob Storage on Azure. This approach facilitates versioning of feature stores and allows for meticulous tracking of all changes. Implementing a CI/CD process ensures a controlled and transparent integration process.
Step 3: Feature Access for Experimentation
Data scientists can now access the registered feature stores for experimentation and model development. To do this, they simply fetch the feature store and load it into a data frame.
Step 4: Model Registration
Upon completing experiments and model development, data scientists can register the model in the same manner as they registered features. This enables the registration and versioning of both models and the features used to train those models.
Summary
A Feature Store in MLOps serves as a pivotal component that seamlessly integrates data science and data engineering efforts. It emerges at the intersection of data exploration and model development, allowing data scientists to meticulously curate, transform, and store datasets for efficient model training. The process involves defining data sources, feature views, and entities with keys, creating a structured foundation for subsequent steps. Feature Store Registration, akin to model registration in the MLOps process, encapsulates the registered feature store in a binary format, fostering versioning and controlled integration through CI/CD processes. This organized approach ensures transparency, traceability, and a systematic framework for accessing feature stores during experimentation and model development. The choice between online and offline feature stores depends on the specific requirements of the model and its use case. In scenarios demanding real-time predictions, an online feature store with ultra-low latency becomes imperative, while batch-oriented models may leverage high-latency features with latencies ranging from 1 hour to a day or more. The Feature Store thus emerges as a dynamic tool, adapting to the diverse needs of model training and deployment in the intricate landscape of MLOps.
Coming up…
In the upcoming post, we will delve into the complex workings of MLOps internal processes, breaking down the key criteria that form the bedrock for devising the right strategy and architecture. By dissecting the internal mechanisms of MLOps, we aim to provide a comprehensive understanding of the factors that influence decision-making in crafting robust and effective strategies for machine learning operations. Stay tuned for a detailed exploration that navigates the intricacies of MLOps, offering insights into the pivotal considerations that shape its internal dynamics.
Comments