Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.
First, let’s build some generate some training and test data in the same format. We use a random seed to generate different data for the test.
Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.
import featuretools as ft
es_train = ft.demo.load_mock_customer(return_entityset=True)
es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)
Now let’s build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.
feature_matrix, feature_defs = ft.dfs(entityset=es_train,
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
5 rows × 121 columns
Now, we can use featuretools.save_features to save a list features to a json file
We can use featuretools.load_features to read in a list of saved features to calculate for our new entity set.
saved_features = ft.load_features('feature_definitions.json')
After we load the features back in, we can calculate the feature matrix.
feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)
As you can see above, we have the exact same features as before, but calculated using the test data.
The feature matrix is a pandas DataFrame that we can save to disk
We can also read it back in as follows:
import pandas as pd
saved_fm = pd.read_csv("feature_matrix.csv", index_col="customer_id")