NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.
First, let’s build some generate some training and test data in the same format. We use a random seed to generate different data for the test.
Note
Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.
[1]:
import featuretools as ft es_train = ft.demo.load_mock_customer(return_entityset=True) es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)
Now let’s build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.
[2]:
feature_matrix, feature_defs = ft.dfs(entityset=es_train, target_dataframe_name="customers") feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs) feature_matrix_enc
5 rows × 121 columns
Now, we can use featuretools.save_features to save a list features to a json file
[3]:
ft.save_features(features_enc, "feature_definitions.json")
We can use featuretools.load_features to read in a list of saved features to calculate for our new entity set.
[4]:
saved_features = ft.load_features('feature_definitions.json')
After we load the features back in, we can calculate the feature matrix.
[5]:
feature_matrix = ft.calculate_feature_matrix(saved_features, es_test) feature_matrix
As you can see above, we have the exact same features as before, but calculated using the test data.
The feature matrix is a pandas DataFrame that we can save to disk
[6]:
feature_matrix.to_csv("feature_matrix.csv")
We can also read it back in as follows:
[7]:
import pandas as pd saved_fm = pd.read_csv("feature_matrix.csv", index_col="customer_id") saved_fm