Tuning Deep Feature Synthesis¶
There are several parameters that can be tuned to change the output of DFS.
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_mock_customer(return_entityset=True)
In [3]: es
Out[3]:
Entityset: transactions
Entities:
transactions [Rows: 500, Columns: 5]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 4]
customers [Rows: 5, Columns: 4]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
Using “Seed Features”¶
Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.
By using seed features, we can include domain specific knowledge in feature engineering automation.
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125
In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["percent_true"],
...: seed_features=[expensive_purchase])
...:
In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]:
PERCENT_TRUE(transactions.amount > 125)
customer_id
5 0.227848
4 0.220183
1 0.119048
3 0.182796
2 0.129032
We can now see that PERCENT_TRUE
was automatically applied to this boolean variable.
Add “interesting” values to variables¶
Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.
By default, where clauses are built using the interesting_values
of a variable.
In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]
We then specify the aggregation primitive to make where clauses for using where_primitives
In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["count", "avg_time_between"],
...: where_primitives=["count", "avg_time_between"],
...: trans_primitives=[])
...:
In [9]: feature_matrix
Out[9]:
zip_code AVG_TIME_BETWEEN(sessions.session_start) COUNT(sessions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(transactions) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) COUNT(sessions WHERE device = mobile) COUNT(sessions WHERE device = tablet) COUNT(sessions WHERE device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = mobile) COUNT(transactions WHERE sessions.device = tablet) COUNT(transactions WHERE sessions.device = desktop)
customer_id
5 60091 5577.000000 6 363.333333 79 13942.500000 NaN 9685.0 3 1 2 357.500000 796.714286 0.000000 345.892857 809.714286 65.000000 376.071429 36 14 29
4 60091 2516.428571 8 168.518519 109 3336.666667 NaN 4127.5 4 1 3 163.101852 192.500000 0.000000 223.108108 206.250000 65.000000 238.918919 53 18 38
1 60091 3305.714286 8 192.920000 126 11570.000000 8807.5 7150.0 3 3 2 185.120000 420.727273 419.404762 275.000000 438.454545 442.619048 302.500000 56 43 27
3 13244 5096.000000 6 287.554348 93 NaN NaN 4745.0 1 1 4 276.956522 0.000000 0.000000 233.360656 65.000000 65.000000 251.475410 16 15 62
2 13244 4907.500000 7 328.532609 93 1690.000000 5330.0 6890.0 2 2 3 320.054348 56.333333 197.407407 417.575758 82.333333 226.296296 435.303030 31 28 34
Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.
In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]:
COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id
5 1 NaN
4 1 NaN
1 3 8807.5
3 1 NaN
2 2 5330.0
We can see that customer who only had 0 or 1 sessions on a tablet, had NaN
values for average time between such sessions.
Encoding categorical features¶
Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.
In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
....: target_entity="customers",
....: agg_primitives=["mode"],
....: max_depth=1)
....:
In [12]: feature_matrix
Out[12]:
zip_code MODE(sessions.device) DAY(date_of_birth) DAY(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) YEAR(date_of_birth) YEAR(join_date)
customer_id
5 60091 mobile 28 17 7 7 5 5 1984 2010
4 60091 mobile 15 8 8 4 1 4 2006 2011
1 60091 mobile 18 17 7 4 0 6 1994 2011
3 13244 desktop 21 13 11 8 4 5 2003 2011
2 13244 desktop 18 15 8 4 0 6 1986 2012
This feature matrix contains 2 categorical variables, zip_code
and MODE(sessions.device)
. We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.
In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
In [14]: feature_matrix_enc
Out[14]:
zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown DAY(date_of_birth) = 18 DAY(date_of_birth) = 28 DAY(date_of_birth) = 21 DAY(date_of_birth) = 15 DAY(date_of_birth) is unknown DAY(join_date) = 17 DAY(join_date) = 15 DAY(join_date) = 13 DAY(join_date) = 8 DAY(join_date) is unknown MONTH(date_of_birth) = 8 MONTH(date_of_birth) = 7 MONTH(date_of_birth) = 11 MONTH(date_of_birth) is unknown MONTH(join_date) = 4 MONTH(join_date) = 8 MONTH(join_date) = 7 MONTH(join_date) is unknown WEEKDAY(date_of_birth) = 0 WEEKDAY(date_of_birth) = 5 WEEKDAY(date_of_birth) = 4 WEEKDAY(date_of_birth) = 1 WEEKDAY(date_of_birth) is unknown WEEKDAY(join_date) = 6 WEEKDAY(join_date) = 5 WEEKDAY(join_date) = 4 WEEKDAY(join_date) is unknown YEAR(date_of_birth) = 2006 YEAR(date_of_birth) = 2003 YEAR(date_of_birth) = 1994 YEAR(date_of_birth) = 1986 YEAR(date_of_birth) = 1984 YEAR(date_of_birth) is unknown YEAR(join_date) = 2011 YEAR(join_date) = 2012 YEAR(join_date) = 2010 YEAR(join_date) is unknown
customer_id
5 True False False True False False False True False False False True False False False False False True False False False False True False False True False False False False True False False False False False False True False False False True False
4 True False False True False False False False False True False False False False True False True False False False True False False False False False False True False False False True False True False False False False False True False False False
1 True False False True False False True False False False False True False False False False False True False False True False False False True False False False False True False False False False False True False False False True False False False
3 False True False False True False False False True False False False False True False False False False True False False True False False False False True False False False True False False False True False False False False True False False False
2 False True False False True False True False False False False False True False False False True False False False True False False False True False False False False True False False False False False False True False False False True False False
The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.
In [15]: print(features_enc)
[<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>]
These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.