There are several parameters that can be tuned to change the output of DFS.
In [1]: import featuretools as ft In [2]: es = ft.demo.load_mock_customer(return_entityset=True) In [3]: es Out[3]: Entityset: transactions Entities: transactions [Rows: 500, Columns: 5] products [Rows: 5, Columns: 2] sessions [Rows: 35, Columns: 4] customers [Rows: 5, Columns: 4] Relationships: transactions.product_id -> products.product_id transactions.session_id -> sessions.session_id sessions.customer_id -> customers.customer_id
Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.
By using seed features, we can include domain specific knowledge in feature engineering automation.
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125 In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["percent_true"], ...: seed_features=[expensive_purchase]) ...: In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']] Out[6]: PERCENT_TRUE(transactions.amount > 125) customer_id 5 0.227848 4 0.220183 1 0.119048 3 0.182796 2 0.129032
We can now see that PERCENT_TRUE was automatically applied to this boolean variable.
PERCENT_TRUE
Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.
By default, where clauses are built using the interesting_values of a variable.
interesting_values
In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]
We then specify the aggregation primitive to make where clauses for using where_primitives
where_primitives
In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["count", "avg_time_between"], ...: where_primitives=["count", "avg_time_between"], ...: trans_primitives=[]) ...: In [9]: feature_matrix Out[9]: zip_code AVG_TIME_BETWEEN(sessions.session_start) COUNT(sessions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(transactions) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) COUNT(sessions WHERE device = mobile) COUNT(sessions WHERE device = desktop) COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = mobile) COUNT(transactions WHERE sessions.device = tablet) COUNT(transactions WHERE sessions.device = desktop) customer_id 5 60091 5577.000000 6 363.333333 79 13942.500000 9685.0 NaN 3 2 1 357.500000 796.714286 0.000000 345.892857 809.714286 65.000000 376.071429 36 14 29 4 60091 2516.428571 8 168.518519 109 3336.666667 4127.5 NaN 4 3 1 163.101852 192.500000 0.000000 223.108108 206.250000 65.000000 238.918919 53 18 38 1 60091 3305.714286 8 192.920000 126 11570.000000 7150.0 8807.5 3 2 3 185.120000 420.727273 419.404762 275.000000 438.454545 442.619048 302.500000 56 43 27 3 13244 5096.000000 6 287.554348 93 NaN 4745.0 NaN 1 4 1 276.956522 0.000000 0.000000 233.360656 65.000000 65.000000 251.475410 16 15 62 2 13244 4907.500000 7 328.532609 93 1690.000000 6890.0 5330.0 2 3 2 320.054348 56.333333 197.407407 417.575758 82.333333 226.296296 435.303030 31 28 34
Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.
In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]] Out[10]: COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) customer_id 5 1 NaN 4 1 NaN 1 3 8807.5 3 1 NaN 2 2 5330.0
We can see that customer who only had 0 or 1 sessions on a tablet, had NaN values for average time between such sessions.
NaN
Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.
In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es, ....: target_entity="customers", ....: agg_primitives=["mode"], ....: max_depth=1) ....: In [12]: feature_matrix Out[12]: zip_code MODE(sessions.device) DAY(date_of_birth) DAY(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) YEAR(date_of_birth) YEAR(join_date) customer_id 5 60091 mobile 28 17 7 7 5 5 1984 2010 4 60091 mobile 15 8 8 4 1 4 2006 2011 1 60091 mobile 18 17 7 4 0 6 1994 2011 3 13244 desktop 21 13 11 8 4 5 2003 2011 2 13244 desktop 18 15 8 4 0 6 1986 2012
This feature matrix contains 2 categorical variables, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.
zip_code
MODE(sessions.device)
In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs) In [14]: feature_matrix_enc Out[14]: zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown DAY(date_of_birth) = 18 DAY(date_of_birth) = 28 DAY(date_of_birth) = 21 DAY(date_of_birth) = 15 DAY(date_of_birth) is unknown DAY(join_date) = 17 DAY(join_date) = 15 DAY(join_date) = 13 DAY(join_date) = 8 DAY(join_date) is unknown MONTH(date_of_birth) = 8 MONTH(date_of_birth) = 7 MONTH(date_of_birth) = 11 MONTH(date_of_birth) is unknown MONTH(join_date) = 4 MONTH(join_date) = 8 MONTH(join_date) = 7 MONTH(join_date) is unknown WEEKDAY(date_of_birth) = 0 WEEKDAY(date_of_birth) = 5 WEEKDAY(date_of_birth) = 4 WEEKDAY(date_of_birth) = 1 WEEKDAY(date_of_birth) is unknown WEEKDAY(join_date) = 6 WEEKDAY(join_date) = 5 WEEKDAY(join_date) = 4 WEEKDAY(join_date) is unknown YEAR(date_of_birth) = 2006 YEAR(date_of_birth) = 2003 YEAR(date_of_birth) = 1994 YEAR(date_of_birth) = 1986 YEAR(date_of_birth) = 1984 YEAR(date_of_birth) is unknown YEAR(join_date) = 2011 YEAR(join_date) = 2012 YEAR(join_date) = 2010 YEAR(join_date) is unknown customer_id 5 True False False True False False False True False False False True False False False False False True False False False False True False False True False False False False True False False False False False False True False False False True False 4 True False False True False False False False False True False False False False True False True False False False True False False False False False False True False False False True False True False False False False False True False False False 1 True False False True False False True False False False False True False False False False False True False False True False False False True False False False False True False False False False False True False False False True False False False 3 False True False False True False False False True False False False False True False False False False True False False True False False False False True False False False True False False False True False False False False True False False False 2 False True False False True False True False False False False False True False False False True False False False True False False False True False False False False True False False False False False False True False False False True False False
The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.
In [15]: print(features_enc) [<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>]
These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.