NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.

Deployment¶

Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.

Saving Features¶

First, let’s build some generate some training and test data in the same format. We use a random seed to generate different data for the test.

Note

Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.

[1]:

import featuretools as ft

es_train = ft.demo.load_mock_customer(return_entityset=True)
es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)

Now let’s build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.

[2]:

feature_matrix, feature_defs = ft.dfs(entityset=es_train,
                                      target_dataframe_name="customers")

feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc

[2]:

	COUNT(sessions)	NUM_UNIQUE(sessions.device)	COUNT(transactions)	MAX(transactions.amount)	MEAN(transactions.amount)	MIN(transactions.amount)	NUM_UNIQUE(transactions.product_id)	SKEW(transactions.amount)	STD(transactions.amount)	SUM(transactions.amount)	...	MODE(sessions.MODE(transactions.product_id)) is unknown	MODE(sessions.MONTH(session_start)) = 1	MODE(sessions.MONTH(session_start)) is unknown	MODE(sessions.WEEKDAY(session_start)) = 2	MODE(sessions.WEEKDAY(session_start)) is unknown	MODE(sessions.YEAR(session_start)) = 2014	MODE(sessions.YEAR(session_start)) is unknown	MODE(transactions.sessions.device) = mobile	MODE(transactions.sessions.device) = desktop	MODE(transactions.sessions.device) is unknown
customer_id
5	6	3	79	149.02	80.375443	7.55	5	-0.025941	44.095630	6349.66	...	False	True	False	True	False	True	False	True	False	False
4	8	3	109	149.95	80.070459	5.73	5	-0.036348	45.068765	8727.68	...	False	True	False	True	False	True	False	True	False	False
1	8	3	126	139.43	71.631905	5.81	5	0.019698	40.442059	9025.62	...	False	True	False	True	False	True	False	True	False	False
3	6	3	93	149.15	67.060430	5.89	5	0.418230	43.683296	6236.62	...	False	True	False	True	False	True	False	False	True	False
2	7	3	93	146.81	77.422366	8.73	5	0.098259	37.705178	7200.28	...	False	True	False	True	False	True	False	False	True	False

5 rows × 121 columns

Now, we can use featuretools.save_features to save a list features to a json file

[3]:

ft.save_features(features_enc, "feature_definitions.json")

Calculating Feature Matrix for New Data¶

We can use featuretools.load_features to read in a list of saved features to calculate for our new entity set.

[4]:

saved_features = ft.load_features('feature_definitions.json')

After we load the features back in, we can calculate the feature matrix.

[5]:

feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)
feature_matrix

[5]:

	zip_code = 60091	zip_code = 13244	zip_code is unknown	COUNT(sessions)	MODE(sessions.device) = mobile	MODE(sessions.device) = desktop	MODE(sessions.device) is unknown	NUM_UNIQUE(sessions.device)	COUNT(transactions)	MAX(transactions.amount)	...	SUM(sessions.MAX(transactions.amount))	SUM(sessions.MEAN(transactions.amount))	SUM(sessions.MIN(transactions.amount))	SUM(sessions.NUM_UNIQUE(transactions.product_id))	SUM(sessions.SKEW(transactions.amount))	SUM(sessions.STD(transactions.amount))	MODE(transactions.sessions.device) = mobile	MODE(transactions.sessions.device) = desktop	MODE(transactions.sessions.device) is unknown	NUM_UNIQUE(transactions.sessions.device)
customer_id
1	True	False	False	6	False	True	False	3	73	147.64	...	834.08	524.919674	198.92	25.0	-1.546156	217.064024	True	False	False	3
4	False	True	False	9	False	True	False	3	126	147.55	...	1180.90	733.862898	193.08	43.0	-1.797214	319.497611	False	True	False	3
3	True	False	False	5	True	False	False	2	64	148.09	...	715.80	407.390549	108.69	23.0	0.353061	215.417211	True	False	False	2
2	False	True	False	8	False	True	False	3	129	148.34	...	1100.82	615.714934	136.01	39.0	-0.082021	315.817331	False	True	False	3
5	True	False	False	7	False	True	False	3	108	149.53	...	997.48	584.302915	137.50	33.0	-0.595128	261.535265	False	True	False	3

5 rows × 121 columns

As you can see above, we have the exact same features as before, but calculated using the test data.

Exporting Feature Matrix¶

Save as csv¶

The feature matrix is a pandas DataFrame that we can save to disk

[6]:

feature_matrix.to_csv("feature_matrix.csv")

We can also read it back in as follows:

[7]:

import pandas as pd

saved_fm = pd.read_csv("feature_matrix.csv", index_col="customer_id")
saved_fm

[7]:

	zip_code = 60091	zip_code = 13244	zip_code is unknown	COUNT(sessions)	MODE(sessions.device) = mobile	MODE(sessions.device) = desktop	MODE(sessions.device) is unknown	NUM_UNIQUE(sessions.device)	COUNT(transactions)	MAX(transactions.amount)	...	SUM(sessions.MAX(transactions.amount))	SUM(sessions.MEAN(transactions.amount))	SUM(sessions.MIN(transactions.amount))	SUM(sessions.NUM_UNIQUE(transactions.product_id))	SUM(sessions.SKEW(transactions.amount))	SUM(sessions.STD(transactions.amount))	MODE(transactions.sessions.device) = mobile	MODE(transactions.sessions.device) = desktop	MODE(transactions.sessions.device) is unknown	NUM_UNIQUE(transactions.sessions.device)
customer_id
1	True	False	False	6	False	True	False	3	73	147.64	...	834.08	524.919674	198.92	25.0	-1.546156	217.064024	True	False	False	3
4	False	True	False	9	False	True	False	3	126	147.55	...	1180.90	733.862898	193.08	43.0	-1.797214	319.497611	False	True	False	3
3	True	False	False	5	True	False	False	2	64	148.09	...	715.80	407.390549	108.69	23.0	0.353061	215.417211	True	False	False	2
2	False	True	False	8	False	True	False	3	129	148.34	...	1100.82	615.714934	136.01	39.0	-0.082021	315.817331	False	True	False	3
5	True	False	False	7	False	True	False	3	108	149.53	...	997.48	584.302915	137.50	33.0	-0.595128	261.535265	False	True	False	3

5 rows × 121 columns

Using Koalas EntitySets (BETA) Advanced Custom Primitives Guide