NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.


Tuning Deep Feature Synthesis

There are several parameters that can be tuned to change the output of DFS.

In [1]: import featuretools as ft

In [2]: es = ft.demo.load_mock_customer(return_entityset=True)

In [3]: es
Out[3]: 
Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 4]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using “Seed Features”

Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation.

In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125

In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["percent_true"],
   ...:                                       seed_features=[expensive_purchase])
   ...: 

In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]: 
             PERCENT_TRUE(transactions.amount > 125)
customer_id                                         
5                                           0.227848
4                                           0.220183
1                                           0.119048
3                                           0.182796
2                                           0.129032

We can now see that PERCENT_TRUE was automatically applied to this boolean variable.

Add “interesting” values to variables

Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.

By default, where clauses are built using the interesting_values of a variable.

In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]

We then specify the aggregation primitive to make where clauses for using where_primitives

In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["count", "avg_time_between"],
   ...:                                       where_primitives=["count", "avg_time_between"],
   ...:                                       trans_primitives=[])
   ...: 

In [9]: feature_matrix
Out[9]: 
            zip_code  AVG_TIME_BETWEEN(sessions.session_start)  COUNT(sessions)  AVG_TIME_BETWEEN(transactions.transaction_time)  COUNT(transactions)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)  COUNT(sessions WHERE device = desktop)  COUNT(sessions WHERE device = mobile)  COUNT(sessions WHERE device = tablet)  AVG_TIME_BETWEEN(transactions.sessions.session_start)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile)  COUNT(transactions WHERE sessions.device = desktop)  COUNT(transactions WHERE sessions.device = tablet)  COUNT(transactions WHERE sessions.device = mobile)
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
5              60091                               5577.000000                6                                       363.333333                   79                                             9685.0                                                     13942.500000                                                             NaN                                                    2                                      3                                      1                                         357.500000                                             345.892857                                                                               0.000000                                                                            796.714286                                                                            376.071429                                                                        65.000000                                                                      809.714286                                                                              29                                                   14                                                  36 
4              60091                               2516.428571                8                                       168.518519                  109                                             4127.5                                                      3336.666667                                                             NaN                                                    3                                      4                                      1                                         163.101852                                             223.108108                                                                               0.000000                                                                            192.500000                                                                            238.918919                                                                        65.000000                                                                      206.250000                                                                              38                                                   18                                                  53 
1              60091                               3305.714286                8                                       192.920000                  126                                             7150.0                                                     11570.000000                                                          8807.5                                                    2                                      3                                      3                                         185.120000                                             275.000000                                                                             419.404762                                                                            420.727273                                                                            302.500000                                                                       442.619048                                                                      438.454545                                                                              27                                                   43                                                  56 
3              13244                               5096.000000                6                                       287.554348                   93                                             4745.0                                                              NaN                                                             NaN                                                    4                                      1                                      1                                         276.956522                                             233.360656                                                                               0.000000                                                                              0.000000                                                                            251.475410                                                                        65.000000                                                                       65.000000                                                                              62                                                   15                                                  16 
2              13244                               4907.500000                7                                       328.532609                   93                                             6890.0                                                      1690.000000                                                          5330.0                                                    3                                      2                                      2                                         320.054348                                             417.575758                                                                             197.407407                                                                             56.333333                                                                            435.303030                                                                       226.296296                                                                       82.333333                                                                              34                                                   28                                                  31 

Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.

In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]: 
             COUNT(sessions WHERE device = tablet)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id                                                                                                       
5                                                1                                                NaN             
4                                                1                                                NaN             
1                                                3                                             8807.5             
3                                                1                                                NaN             
2                                                2                                             5330.0             

We can see that customer who only had 0 or 1 sessions on a tablet, had NaN values for average time between such sessions.

Encoding categorical features

Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.

In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ....:                                       target_entity="customers",
   ....:                                       agg_primitives=["mode"],
   ....:                                       max_depth=1)
   ....: 

In [12]: feature_matrix
Out[12]: 
            zip_code MODE(sessions.device)  DAY(date_of_birth)  DAY(join_date)  MONTH(date_of_birth)  MONTH(join_date)  WEEKDAY(date_of_birth)  WEEKDAY(join_date)  YEAR(date_of_birth)  YEAR(join_date)
customer_id                                                                                                                                                                                             
5              60091                mobile                  28              17                     7                 7                       5                   5                 1984             2010
4              60091                mobile                  15               8                     8                 4                       1                   4                 2006             2011
1              60091                mobile                  18              17                     7                 4                       0                   6                 1994             2011
3              13244               desktop                  21              13                    11                 8                       4                   5                 2003             2011
2              13244               desktop                  18              15                     8                 4                       0                   6                 1986             2012

This feature matrix contains 2 categorical variables, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

In [14]: feature_matrix_enc
Out[14]: 
             zip_code = 60091  zip_code = 13244  zip_code is unknown  MODE(sessions.device) = mobile  MODE(sessions.device) = desktop  MODE(sessions.device) is unknown  DAY(date_of_birth) = 18  DAY(date_of_birth) = 28  DAY(date_of_birth) = 21  DAY(date_of_birth) = 15  DAY(date_of_birth) is unknown  DAY(join_date) = 17  DAY(join_date) = 15  DAY(join_date) = 13  DAY(join_date) = 8  DAY(join_date) is unknown  MONTH(date_of_birth) = 8  MONTH(date_of_birth) = 7  MONTH(date_of_birth) = 11  MONTH(date_of_birth) is unknown  MONTH(join_date) = 4  MONTH(join_date) = 8  MONTH(join_date) = 7  MONTH(join_date) is unknown  WEEKDAY(date_of_birth) = 0  WEEKDAY(date_of_birth) = 5  WEEKDAY(date_of_birth) = 4  WEEKDAY(date_of_birth) = 1  WEEKDAY(date_of_birth) is unknown  WEEKDAY(join_date) = 6  WEEKDAY(join_date) = 5  WEEKDAY(join_date) = 4  WEEKDAY(join_date) is unknown  YEAR(date_of_birth) = 2006  YEAR(date_of_birth) = 2003  YEAR(date_of_birth) = 1994  YEAR(date_of_birth) = 1986  YEAR(date_of_birth) = 1984  YEAR(date_of_birth) is unknown  YEAR(join_date) = 2011  YEAR(join_date) = 2012  YEAR(join_date) = 2010  YEAR(join_date) is unknown
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
5                        True             False                False                            True                            False                             False                    False                     True                    False                    False                          False                 True                False                False               False                      False                     False                      True                      False                            False                 False                 False                  True                        False                       False                        True                       False                       False                              False                   False                    True                   False                          False                       False                       False                       False                       False                        True                           False                   False                   False                    True                       False
4                        True             False                False                            True                            False                             False                    False                    False                    False                     True                          False                False                False                False                True                      False                      True                     False                      False                            False                  True                 False                 False                        False                       False                       False                       False                        True                              False                   False                   False                    True                          False                        True                       False                       False                       False                       False                           False                    True                   False                   False                       False
1                        True             False                False                            True                            False                             False                     True                    False                    False                    False                          False                 True                False                False               False                      False                     False                      True                      False                            False                  True                 False                 False                        False                        True                       False                       False                       False                              False                    True                   False                   False                          False                       False                       False                        True                       False                       False                           False                    True                   False                   False                       False
3                       False              True                False                           False                             True                             False                    False                    False                     True                    False                          False                False                False                 True               False                      False                     False                     False                       True                            False                 False                  True                 False                        False                       False                       False                        True                       False                              False                   False                    True                   False                          False                       False                        True                       False                       False                       False                           False                    True                   False                   False                       False
2                       False              True                False                           False                             True                             False                     True                    False                    False                    False                          False                False                 True                False               False                      False                      True                     False                      False                            False                  True                 False                 False                        False                        True                       False                       False                       False                              False                    True                   False                   False                          False                       False                       False                       False                        True                       False                           False                   False                    True                   False                       False

The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.

In [15]: print(features_enc)
[<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>]

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.