NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
There are several parameters that can be tuned to change the output of DFS.
In [1]: import featuretools as ft In [2]: es = ft.demo.load_mock_customer(return_entityset=True) In [3]: es Out[3]: Entityset: transactions Entities: transactions [Rows: 500, Columns: 5] products [Rows: 5, Columns: 2] sessions [Rows: 35, Columns: 4] customers [Rows: 5, Columns: 4] Relationships: transactions.product_id -> products.product_id transactions.session_id -> sessions.session_id sessions.customer_id -> customers.customer_id
Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.
By using seed features, we can include domain specific knowledge in feature engineering automation.
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125 In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["percent_true"], ...: seed_features=[expensive_purchase]) ...: In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']] Out[6]: PERCENT_TRUE(transactions.amount > 125) customer_id 5 0.227848 4 0.220183 1 0.119048 3 0.182796 2 0.129032
We can now see that PERCENT_TRUE was automatically applied to this boolean variable.
PERCENT_TRUE
Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.
By default, where clauses are built using the interesting_values of a variable.
interesting_values
In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]
We then specify the aggregation primitive to make where clauses for using where_primitives
where_primitives
In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["count", "avg_time_between"], ...: where_primitives=["count", "avg_time_between"], ...: trans_primitives=[]) ...: In [9]: feature_matrix Out[9]: zip_code AVG_TIME_BETWEEN(sessions.session_start) COUNT(sessions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(transactions) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) COUNT(sessions WHERE device = desktop) COUNT(sessions WHERE device = mobile) COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) COUNT(transactions WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = tablet) COUNT(transactions WHERE sessions.device = mobile) customer_id 5 60091 5577.000000 6 363.333333 79 9685.0 13942.500000 NaN 2 3 1 357.500000 345.892857 0.000000 796.714286 376.071429 65.000000 809.714286 29 14 36 4 60091 2516.428571 8 168.518519 109 4127.5 3336.666667 NaN 3 4 1 163.101852 223.108108 0.000000 192.500000 238.918919 65.000000 206.250000 38 18 53 1 60091 3305.714286 8 192.920000 126 7150.0 11570.000000 8807.5 2 3 3 185.120000 275.000000 419.404762 420.727273 302.500000 442.619048 438.454545 27 43 56 3 13244 5096.000000 6 287.554348 93 4745.0 NaN NaN 4 1 1 276.956522 233.360656 0.000000 0.000000 251.475410 65.000000 65.000000 62 15 16 2 13244 4907.500000 7 328.532609 93 6890.0 1690.000000 5330.0 3 2 2 320.054348 417.575758 197.407407 56.333333 435.303030 226.296296 82.333333 34 28 31
Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.
In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]] Out[10]: COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) customer_id 5 1 NaN 4 1 NaN 1 3 8807.5 3 1 NaN 2 2 5330.0
We can see that customer who only had 0 or 1 sessions on a tablet, had NaN values for average time between such sessions.
NaN
Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.
In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es, ....: target_entity="customers", ....: agg_primitives=["mode"], ....: max_depth=1) ....: In [12]: feature_matrix Out[12]: zip_code MODE(sessions.device) DAY(date_of_birth) DAY(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) YEAR(date_of_birth) YEAR(join_date) customer_id 5 60091 mobile 28 17 7 7 5 5 1984 2010 4 60091 mobile 15 8 8 4 1 4 2006 2011 1 60091 mobile 18 17 7 4 0 6 1994 2011 3 13244 desktop 21 13 11 8 4 5 2003 2011 2 13244 desktop 18 15 8 4 0 6 1986 2012
This feature matrix contains 2 categorical variables, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.
zip_code
MODE(sessions.device)
In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs) In [14]: feature_matrix_enc Out[14]: zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown DAY(date_of_birth) = 18 DAY(date_of_birth) = 28 DAY(date_of_birth) = 21 DAY(date_of_birth) = 15 DAY(date_of_birth) is unknown DAY(join_date) = 17 DAY(join_date) = 15 DAY(join_date) = 13 DAY(join_date) = 8 DAY(join_date) is unknown MONTH(date_of_birth) = 8 MONTH(date_of_birth) = 7 MONTH(date_of_birth) = 11 MONTH(date_of_birth) is unknown MONTH(join_date) = 4 MONTH(join_date) = 8 MONTH(join_date) = 7 MONTH(join_date) is unknown WEEKDAY(date_of_birth) = 0 WEEKDAY(date_of_birth) = 5 WEEKDAY(date_of_birth) = 4 WEEKDAY(date_of_birth) = 1 WEEKDAY(date_of_birth) is unknown WEEKDAY(join_date) = 6 WEEKDAY(join_date) = 5 WEEKDAY(join_date) = 4 WEEKDAY(join_date) is unknown YEAR(date_of_birth) = 2006 YEAR(date_of_birth) = 2003 YEAR(date_of_birth) = 1994 YEAR(date_of_birth) = 1986 YEAR(date_of_birth) = 1984 YEAR(date_of_birth) is unknown YEAR(join_date) = 2011 YEAR(join_date) = 2012 YEAR(join_date) = 2010 YEAR(join_date) is unknown customer_id 5 True False False True False False False True False False False True False False False False False True False False False False True False False True False False False False True False False False False False False True False False False True False 4 True False False True False False False False False True False False False False True False True False False False True False False False False False False True False False False True False True False False False False False True False False False 1 True False False True False False True False False False False True False False False False False True False False True False False False True False False False False True False False False False False True False False False True False False False 3 False True False False True False False False True False False False False True False False False False True False False True False False False False True False False False True False False False True False False False False True False False False 2 False True False False True False True False False False False False True False False False True False False False True False False False True False False False False True False False False False False False True False False False True False False
The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.
In [15]: print(features_enc) [<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>]
These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.