Deployment¶
Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.
Saving Features¶
First, let’s build some generate some training and test data in the same format. We use a random seed to generate different data for the test.
Note
Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.
[1]:
import featuretools as ft
es_train = ft.demo.load_mock_customer(return_entityset=True)
es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)
Now let’s build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.
[2]:
feature_matrix, feature_defs = ft.dfs(entityset=es_train,
target_dataframe_name="customers")
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc
[2]:
COUNT(sessions) | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | NUM_UNIQUE(transactions.product_id) | SKEW(transactions.amount) | STD(transactions.amount) | SUM(transactions.amount) | ... | MODE(sessions.MODE(transactions.product_id)) is unknown | MODE(sessions.MONTH(session_start)) = 1 | MODE(sessions.MONTH(session_start)) is unknown | MODE(sessions.WEEKDAY(session_start)) = 2 | MODE(sessions.WEEKDAY(session_start)) is unknown | MODE(sessions.YEAR(session_start)) = 2014 | MODE(sessions.YEAR(session_start)) is unknown | MODE(transactions.sessions.device) = mobile | MODE(transactions.sessions.device) = desktop | MODE(transactions.sessions.device) is unknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
5 | 6 | 3 | 79 | 149.02 | 80.375443 | 7.55 | 5 | -0.025941 | 44.095630 | 6349.66 | ... | False | True | False | True | False | True | False | True | False | False |
4 | 8 | 3 | 109 | 149.95 | 80.070459 | 5.73 | 5 | -0.036348 | 45.068765 | 8727.68 | ... | False | True | False | True | False | True | False | True | False | False |
1 | 8 | 3 | 126 | 139.43 | 71.631905 | 5.81 | 5 | 0.019698 | 40.442059 | 9025.62 | ... | False | True | False | True | False | True | False | True | False | False |
3 | 6 | 3 | 93 | 149.15 | 67.060430 | 5.89 | 5 | 0.418230 | 43.683296 | 6236.62 | ... | False | True | False | True | False | True | False | False | True | False |
2 | 7 | 3 | 93 | 146.81 | 77.422366 | 8.73 | 5 | 0.098259 | 37.705178 | 7200.28 | ... | False | True | False | True | False | True | False | False | True | False |
5 rows × 121 columns
Now, we can use featuretools.save_features to save a list features to a json file
[3]:
ft.save_features(features_enc, "feature_definitions.json")
Calculating Feature Matrix for New Data¶
We can use featuretools.load_features to read in a list of saved features to calculate for our new entity set.
[4]:
saved_features = ft.load_features('feature_definitions.json')
After we load the features back in, we can calculate the feature matrix.
[5]:
feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)
feature_matrix
[5]:
zip_code = 60091 | zip_code = 13244 | zip_code is unknown | COUNT(sessions) | MODE(sessions.device) = mobile | MODE(sessions.device) = desktop | MODE(sessions.device) is unknown | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | ... | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) = mobile | MODE(transactions.sessions.device) = desktop | MODE(transactions.sessions.device) is unknown | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
1 | True | False | False | 6 | False | True | False | 3 | 73 | 147.64 | ... | 834.08 | 524.919674 | 198.92 | 25.0 | -1.546156 | 217.064024 | True | False | False | 3 |
4 | False | True | False | 9 | False | True | False | 3 | 126 | 147.55 | ... | 1180.90 | 733.862898 | 193.08 | 43.0 | -1.797214 | 319.497611 | False | True | False | 3 |
3 | True | False | False | 5 | True | False | False | 2 | 64 | 148.09 | ... | 715.80 | 407.390549 | 108.69 | 23.0 | 0.353061 | 215.417211 | True | False | False | 2 |
2 | False | True | False | 8 | False | True | False | 3 | 129 | 148.34 | ... | 1100.82 | 615.714934 | 136.01 | 39.0 | -0.082021 | 315.817331 | False | True | False | 3 |
5 | True | False | False | 7 | False | True | False | 3 | 108 | 149.53 | ... | 997.48 | 584.302915 | 137.50 | 33.0 | -0.595128 | 261.535265 | False | True | False | 3 |
5 rows × 121 columns
As you can see above, we have the exact same features as before, but calculated using the test data.
Exporting Feature Matrix¶
Save as csv¶
The feature matrix is a pandas DataFrame that we can save to disk
[6]:
feature_matrix.to_csv("feature_matrix.csv")
We can also read it back in as follows:
[7]:
import pandas as pd
saved_fm = pd.read_csv("feature_matrix.csv", index_col="customer_id")
saved_fm
[7]:
zip_code = 60091 | zip_code = 13244 | zip_code is unknown | COUNT(sessions) | MODE(sessions.device) = mobile | MODE(sessions.device) = desktop | MODE(sessions.device) is unknown | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | ... | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) = mobile | MODE(transactions.sessions.device) = desktop | MODE(transactions.sessions.device) is unknown | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
1 | True | False | False | 6 | False | True | False | 3 | 73 | 147.64 | ... | 834.08 | 524.919674 | 198.92 | 25.0 | -1.546156 | 217.064024 | True | False | False | 3 |
4 | False | True | False | 9 | False | True | False | 3 | 126 | 147.55 | ... | 1180.90 | 733.862898 | 193.08 | 43.0 | -1.797214 | 319.497611 | False | True | False | 3 |
3 | True | False | False | 5 | True | False | False | 2 | 64 | 148.09 | ... | 715.80 | 407.390549 | 108.69 | 23.0 | 0.353061 | 215.417211 | True | False | False | 2 |
2 | False | True | False | 8 | False | True | False | 3 | 129 | 148.34 | ... | 1100.82 | 615.714934 | 136.01 | 39.0 | -0.082021 | 315.817331 | False | True | False | 3 |
5 | True | False | False | 7 | False | True | False | 3 | 108 | 149.53 | ... | 997.48 | 584.302915 | 137.50 | 33.0 | -0.595128 | 261.535265 | False | True | False | 3 |
5 rows × 121 columns