Tuning Deep Feature Synthesis

There are several parameters that can be tuned to change the output of DFS.

In [1]: import featuretools as ft

In [2]: es = ft.demo.load_mock_customer(return_entityset=True)

In [3]: es
Out[3]: 
Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 4]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using “Seed Features”

Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation.

In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125

In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["percent_true"],
   ...:                                       seed_features=[expensive_purchase])
   ...: 

In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]: 
             PERCENT_TRUE(transactions.amount > 125)
customer_id                                         
5                                           0.227848
4                                           0.220183
1                                           0.119048
3                                           0.182796
2                                           0.129032

We can now see that PERCENT_TRUE was automatically applied to this boolean variable.

Add “interesting” values to variables

Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.

By default, where clauses are built using the interesting_values of a variable.

In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]

We then specify the aggregation primitive to make where clauses for using where_primitives

In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["count", "avg_time_between"],
   ...:                                       where_primitives=["count", "avg_time_between"],
   ...:                                       trans_primitives=[])
   ...: 

In [9]: feature_matrix
Out[9]: 
            zip_code  COUNT(sessions)  AVG_TIME_BETWEEN(sessions.session_start)  COUNT(transactions)  AVG_TIME_BETWEEN(transactions.transaction_time)  COUNT(sessions WHERE device = tablet)  COUNT(sessions WHERE device = mobile)  COUNT(sessions WHERE device = desktop)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop)  AVG_TIME_BETWEEN(transactions.sessions.session_start)  COUNT(transactions WHERE sessions.device = mobile)  COUNT(transactions WHERE sessions.device = tablet)  COUNT(transactions WHERE sessions.device = desktop)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet)  AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet)  AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop)
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
5              60091                6                               5577.000000                   79                                       363.333333                                      1                                      3                                       2                                                NaN                                                    13942.500000                                                          9685.0                                                       357.500000                                                     36                                                  14                                                  29                                           809.714286                                                                       65.000000                                                                      376.071429                                                                       796.714286                                                                              0.000000                                                                            345.892857                                    
4              60091                8                               2516.428571                  109                                       168.518519                                      1                                      4                                       3                                                NaN                                                     3336.666667                                                          4127.5                                                       163.101852                                                     53                                                  18                                                  38                                           206.250000                                                                       65.000000                                                                      238.918919                                                                       192.500000                                                                              0.000000                                                                            223.108108                                    
1              60091                8                               3305.714286                  126                                       192.920000                                      3                                      3                                       2                                             8807.5                                                    11570.000000                                                          7150.0                                                       185.120000                                                     56                                                  43                                                  27                                           438.454545                                                                      442.619048                                                                      302.500000                                                                       420.727273                                                                            419.404762                                                                            275.000000                                    
3              13244                6                               5096.000000                   93                                       287.554348                                      1                                      1                                       4                                                NaN                                                             NaN                                                          4745.0                                                       276.956522                                                     16                                                  15                                                  62                                            65.000000                                                                       65.000000                                                                      251.475410                                                                         0.000000                                                                              0.000000                                                                            233.360656                                    
2              13244                7                               4907.500000                   93                                       328.532609                                      2                                      2                                       3                                             5330.0                                                     1690.000000                                                          6890.0                                                       320.054348                                                     31                                                  28                                                  34                                            82.333333                                                                      226.296296                                                                      435.303030                                                                        56.333333                                                                            197.407407                                                                            417.575758                                    

Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.

In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]: 
             COUNT(sessions WHERE device = tablet)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id                                                                                                       
5                                                1                                                NaN             
4                                                1                                                NaN             
1                                                3                                             8807.5             
3                                                1                                                NaN             
2                                                2                                             5330.0             

We can see that customer who only had 0 or 1 sessions on a tablet, had NaN values for average time between such sessions.

Encoding categorical features

Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.

In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ....:                                       target_entity="customers",
   ....:                                       agg_primitives=["mode"],
   ....:                                       max_depth=1)
   ....: 

In [12]: feature_matrix
Out[12]: 
            zip_code MODE(sessions.device)  DAY(date_of_birth)  DAY(join_date)  YEAR(date_of_birth)  YEAR(join_date)  MONTH(date_of_birth)  MONTH(join_date)  WEEKDAY(date_of_birth)  WEEKDAY(join_date)
customer_id                                                                                                                                                                                             
5              60091                mobile                  28              17                 1984             2010                     7                 7                       5                   5
4              60091                mobile                  15               8                 2006             2011                     8                 4                       1                   4
1              60091                mobile                  18              17                 1994             2011                     7                 4                       0                   6
3              13244               desktop                  21              13                 2003             2011                    11                 8                       4                   5
2              13244               desktop                  18              15                 1986             2012                     8                 4                       0                   6

This feature matrix contains 2 categorical variables, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

In [14]: feature_matrix_enc
Out[14]: 
             zip_code = 60091  zip_code = 13244  zip_code is unknown  MODE(sessions.device) = mobile  MODE(sessions.device) = desktop  MODE(sessions.device) is unknown  DAY(date_of_birth) = 18  DAY(date_of_birth) = 28  DAY(date_of_birth) = 21  DAY(date_of_birth) = 15  DAY(date_of_birth) is unknown  DAY(join_date) = 17  DAY(join_date) = 15  DAY(join_date) = 13  DAY(join_date) = 8  DAY(join_date) is unknown  YEAR(date_of_birth) = 2006  YEAR(date_of_birth) = 2003  YEAR(date_of_birth) = 1994  YEAR(date_of_birth) = 1986  YEAR(date_of_birth) = 1984  YEAR(date_of_birth) is unknown  YEAR(join_date) = 2011  YEAR(join_date) = 2012  YEAR(join_date) = 2010  YEAR(join_date) is unknown  MONTH(date_of_birth) = 8  MONTH(date_of_birth) = 7  MONTH(date_of_birth) = 11  MONTH(date_of_birth) is unknown  MONTH(join_date) = 4  MONTH(join_date) = 8  MONTH(join_date) = 7  MONTH(join_date) is unknown  WEEKDAY(date_of_birth) = 0  WEEKDAY(date_of_birth) = 5  WEEKDAY(date_of_birth) = 4  WEEKDAY(date_of_birth) = 1  WEEKDAY(date_of_birth) is unknown  WEEKDAY(join_date) = 6  WEEKDAY(join_date) = 5  WEEKDAY(join_date) = 4  WEEKDAY(join_date) is unknown
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
5                           1                 0                    0                               1                                0                                 0                        0                        1                        0                        0                              0                    1                    0                    0                   0                          0                           0                           0                           0                           0                           1                               0                       0                       0                       1                           0                         0                         1                          0                                0                     0                     0                     1                            0                           0                           1                           0                           0                                  0                       0                       1                       0                              0
4                           1                 0                    0                               1                                0                                 0                        0                        0                        0                        1                              0                    0                    0                    0                   1                          0                           1                           0                           0                           0                           0                               0                       1                       0                       0                           0                         1                         0                          0                                0                     1                     0                     0                            0                           0                           0                           0                           1                                  0                       0                       0                       1                              0
1                           1                 0                    0                               1                                0                                 0                        1                        0                        0                        0                              0                    1                    0                    0                   0                          0                           0                           0                           1                           0                           0                               0                       1                       0                       0                           0                         0                         1                          0                                0                     1                     0                     0                            0                           1                           0                           0                           0                                  0                       1                       0                       0                              0
3                           0                 1                    0                               0                                1                                 0                        0                        0                        1                        0                              0                    0                    0                    1                   0                          0                           0                           1                           0                           0                           0                               0                       1                       0                       0                           0                         0                         0                          1                                0                     0                     1                     0                            0                           0                           0                           1                           0                                  0                       0                       1                       0                              0
2                           0                 1                    0                               0                                1                                 0                        1                        0                        0                        0                              0                    0                    1                    0                   0                          0                           0                           0                           0                           1                           0                               0                       0                       1                       0                           0                         1                         0                          0                                0                     1                     0                     0                            0                           1                           0                           0                           0                                  0                       1                       0                       0                              0

The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.

In [15]: print(features_enc)
[<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>]

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.