Tuning Deep Feature Synthesis¶
There are several parameters that can be tuned to change the output of DFS.
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_mock_customer(return_entityset=True)
In [3]: es
Out[3]:
Entityset: transactions
Entities:
transactions [Rows: 500, Columns: 5]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 4]
customers [Rows: 5, Columns: 4]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
Using “Seed Features”¶
Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.
By using seed features, we can include domain specific knowledge in feature engineering automation.
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125
In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["percent_true"],
...: seed_features=[expensive_purchase])
...:
In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]:
PERCENT_TRUE(transactions.amount > 125)
customer_id
5 0.227848
4 0.220183
1 0.119048
3 0.182796
2 0.129032
We can now see that PERCENT_TRUE
was automatically applied to this boolean variable.
Add “interesting” values to variables¶
Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.
By default, where clauses are built using the interesting_values
of a variable.
In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]
We then specify the aggregation primitive to make where clauses for using where_primitives
In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["count", "avg_time_between"],
...: where_primitives=["count", "avg_time_between"],
...: trans_primitives=[])
...:
In [9]: feature_matrix
Out[9]:
zip_code COUNT(sessions) AVG_TIME_BETWEEN(sessions.session_start) COUNT(transactions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(sessions WHERE device = mobile) COUNT(sessions WHERE device = tablet) COUNT(sessions WHERE device = desktop) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start) COUNT(transactions WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = mobile) COUNT(transactions WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet)
customer_id
5 60091 6 5577.000000 79 363.333333 3 1 2 13942.500000 NaN 9685.0 357.500000 29 36 14 345.892857 796.714286 0.000000 376.071429 809.714286 65.000000
4 60091 8 2516.428571 109 168.518519 4 1 3 3336.666667 NaN 4127.5 163.101852 38 53 18 223.108108 192.500000 0.000000 238.918919 206.250000 65.000000
1 60091 8 3305.714286 126 192.920000 3 3 2 11570.000000 8807.5 7150.0 185.120000 27 56 43 275.000000 420.727273 419.404762 302.500000 438.454545 442.619048
3 13244 6 5096.000000 93 287.554348 1 1 4 NaN NaN 4745.0 276.956522 62 16 15 233.360656 0.000000 0.000000 251.475410 65.000000 65.000000
2 13244 7 4907.500000 93 328.532609 2 2 3 1690.000000 5330.0 6890.0 320.054348 34 31 28 417.575758 56.333333 197.407407 435.303030 82.333333 226.296296
Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.
In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]:
COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id
5 1 NaN
4 1 NaN
1 3 8807.5
3 1 NaN
2 2 5330.0
We can see that customer who only had 0 or 1 sessions on a tablet, had NaN
values for average time between such sessions.
Encoding categorical features¶
Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.
In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
....: target_entity="customers",
....: agg_primitives=["mode"],
....: max_depth=1)
....:
In [12]: feature_matrix
Out[12]:
zip_code MODE(sessions.device) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date)
customer_id
5 60091 mobile 28 17 1984 2010 7 7 5 5
4 60091 mobile 15 8 2006 2011 8 4 1 4
1 60091 mobile 18 17 1994 2011 7 4 0 6
3 13244 desktop 21 13 2003 2011 11 8 4 5
2 13244 desktop 18 15 1986 2012 8 4 0 6
This feature matrix contains 2 categorical variables, zip_code
and MODE(sessions.device)
. We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.
In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
In [14]: feature_matrix_enc
Out[14]:
zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown DAY(date_of_birth) = 18 DAY(date_of_birth) = 28 DAY(date_of_birth) = 21 DAY(date_of_birth) = 15 DAY(date_of_birth) is unknown DAY(join_date) = 17 DAY(join_date) = 15 DAY(join_date) = 13 DAY(join_date) = 8 DAY(join_date) is unknown YEAR(date_of_birth) = 2006 YEAR(date_of_birth) = 2003 YEAR(date_of_birth) = 1994 YEAR(date_of_birth) = 1986 YEAR(date_of_birth) = 1984 YEAR(date_of_birth) is unknown YEAR(join_date) = 2011 YEAR(join_date) = 2012 YEAR(join_date) = 2010 YEAR(join_date) is unknown MONTH(date_of_birth) = 8 MONTH(date_of_birth) = 7 MONTH(date_of_birth) = 11 MONTH(date_of_birth) is unknown MONTH(join_date) = 4 MONTH(join_date) = 8 MONTH(join_date) = 7 MONTH(join_date) is unknown WEEKDAY(date_of_birth) = 0 WEEKDAY(date_of_birth) = 5 WEEKDAY(date_of_birth) = 4 WEEKDAY(date_of_birth) = 1 WEEKDAY(date_of_birth) is unknown WEEKDAY(join_date) = 6 WEEKDAY(join_date) = 5 WEEKDAY(join_date) = 4 WEEKDAY(join_date) is unknown
customer_id
5 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
3 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0
2 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.
In [15]: print(features_enc)
[<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>]
These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.