NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.


Deployment

Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.

Saving Features

First, let’s build some generate some training and test data in the same format. We use a random seed to generate different data for the test.

Note

Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.

[1]:
import featuretools as ft

es_train = ft.demo.load_mock_customer(return_entityset=True)
es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)

Now let’s build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.

[2]:
feature_matrix, feature_defs = ft.dfs(entityset=es_train,
                                      target_dataframe_name="customers")

feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc
[2]:
COUNT(sessions) NUM_UNIQUE(sessions.device) COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) NUM_UNIQUE(transactions.product_id) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) ... MODE(sessions.MODE(transactions.product_id)) is unknown MODE(sessions.MONTH(session_start)) = 1 MODE(sessions.MONTH(session_start)) is unknown MODE(sessions.WEEKDAY(session_start)) = 2 MODE(sessions.WEEKDAY(session_start)) is unknown MODE(sessions.YEAR(session_start)) = 2014 MODE(sessions.YEAR(session_start)) is unknown MODE(transactions.sessions.device) = mobile MODE(transactions.sessions.device) = desktop MODE(transactions.sessions.device) is unknown
customer_id
5 6 3 79 149.02 80.375443 7.55 5 -0.025941 44.095630 6349.66 ... False True False True False True False True False False
4 8 3 109 149.95 80.070459 5.73 5 -0.036348 45.068765 8727.68 ... False True False True False True False True False False
1 8 3 126 139.43 71.631905 5.81 5 0.019698 40.442059 9025.62 ... False True False True False True False True False False
3 6 3 93 149.15 67.060430 5.89 5 0.418230 43.683296 6236.62 ... False True False True False True False False True False
2 7 3 93 146.81 77.422366 8.73 5 0.098259 37.705178 7200.28 ... False True False True False True False False True False

5 rows × 121 columns

Now, we can use featuretools.save_features to save a list features to a json file

[3]:
ft.save_features(features_enc, "feature_definitions.json")

Calculating Feature Matrix for New Data

We can use featuretools.load_features to read in a list of saved features to calculate for our new entity set.

[4]:
saved_features = ft.load_features('feature_definitions.json')

After we load the features back in, we can calculate the feature matrix.

[5]:
feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)
feature_matrix
[5]:
zip_code = 60091 zip_code = 13244 zip_code is unknown COUNT(sessions) MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown NUM_UNIQUE(sessions.device) COUNT(transactions) MAX(transactions.amount) ... SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.STD(transactions.amount)) MODE(transactions.sessions.device) = mobile MODE(transactions.sessions.device) = desktop MODE(transactions.sessions.device) is unknown NUM_UNIQUE(transactions.sessions.device)
customer_id
1 True False False 6 False True False 3 73 147.64 ... 834.08 524.919674 198.92 25.0 -1.546156 217.064024 True False False 3
4 False True False 9 False True False 3 126 147.55 ... 1180.90 733.862898 193.08 43.0 -1.797214 319.497611 False True False 3
3 True False False 5 True False False 2 64 148.09 ... 715.80 407.390549 108.69 23.0 0.353061 215.417211 True False False 2
2 False True False 8 False True False 3 129 148.34 ... 1100.82 615.714934 136.01 39.0 -0.082021 315.817331 False True False 3
5 True False False 7 False True False 3 108 149.53 ... 997.48 584.302915 137.50 33.0 -0.595128 261.535265 False True False 3

5 rows × 121 columns

As you can see above, we have the exact same features as before, but calculated using the test data.

Exporting Feature Matrix

Save as csv

The feature matrix is a pandas DataFrame that we can save to disk

[6]:
feature_matrix.to_csv("feature_matrix.csv")

We can also read it back in as follows:

[7]:
import pandas as pd

saved_fm = pd.read_csv("feature_matrix.csv", index_col="customer_id")
saved_fm
[7]:
zip_code = 60091 zip_code = 13244 zip_code is unknown COUNT(sessions) MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown NUM_UNIQUE(sessions.device) COUNT(transactions) MAX(transactions.amount) ... SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.STD(transactions.amount)) MODE(transactions.sessions.device) = mobile MODE(transactions.sessions.device) = desktop MODE(transactions.sessions.device) is unknown NUM_UNIQUE(transactions.sessions.device)
customer_id
1 True False False 6 False True False 3 73 147.64 ... 834.08 524.919674 198.92 25.0 -1.546156 217.064024 True False False 3
4 False True False 9 False True False 3 126 147.55 ... 1180.90 733.862898 193.08 43.0 -1.797214 319.497611 False True False 3
3 True False False 5 True False False 2 64 148.09 ... 715.80 407.390549 108.69 23.0 0.353061 215.417211 True False False 2
2 False True False 8 False True False 3 129 148.34 ... 1100.82 615.714934 136.01 39.0 -0.082021 315.817331 False True False 3
5 True False False 7 False True False 3 108 149.53 ... 997.48 584.302915 137.50 33.0 -0.595128 261.535265 False True False 3

5 rows × 121 columns