Tuning Deep Feature Synthesis#

There are several parameters that can be tuned to change the output of DFS. We’ll explore these parameters using the following transactions EntitySet.

[1]:

import featuretools as ft

es = ft.demo.load_mock_customer(return_entityset=True)
es

[1]:

Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using “Seed Features”#

Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation. For the seed feature below, the domain knowlege may be that, for a specific retailer, a transaction above $125 would be considered an expensive purchase.

[2]:

expensive_purchase = ft.Feature(es["transactions"].ww["amount"]) > 125

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["percent_true"],
    seed_features=[expensive_purchase],
)
feature_matrix[["PERCENT_TRUE(transactions.amount > 125)"]]

[2]:

	PERCENT_TRUE(transactions.amount > 125)
customer_id
5	0.227848
4	0.220183
1	0.119048
3	0.182796
2	0.129032

We can now see that the PERCENT_TRUE primitive was automatically applied to the boolean expensive_purchase feature from the transactions table. The feature produced as a result can be understood as the percentage of transactions for a customer that are considered expensive.

Add “interesting” values to columns#

Sometimes we want to create features that are conditioned on a second value before calculations are performed. We call this extra filter a “where clause”. Where clauses are used in Deep Feature Synthesis by including primitives in the where_primitives parameter to DFS.

By default, where clauses are built using the interesting_values of a column.

Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling es.add_interesting_values().

Note that Dask and Spark EntitySets cannot have interesting values determined automatically for their DataFrames. For those EntitySets, or when interesting values are already known for columns, the dataframe_name and values parameters can be used to set interesting values for individual columns in a DataFrame in an EntitySet.

[3]:

values_dict = {"device": ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name="sessions", values=values_dict)

Interesting values are stored in the DataFrame’s Woodwork typing information.

[4]:

es["sessions"].ww.columns["device"].metadata

[4]:

{'dataframe_name': 'sessions',
 'entityset_id': 'transactions',
 'interesting_values': ['desktop', 'mobile', 'tablet']}

Now that interesting values are set for the device column in the sessions table, we can specify the aggregation primitives for which we want where clauses using the where_primitives parameter to DFS.

[5]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "avg_time_between"],
    where_primitives=["count", "avg_time_between"],
    trans_primitives=[],
)
feature_matrix

[5]:

	zip_code	AVG_TIME_BETWEEN(sessions.session_start)	COUNT(sessions)	AVG_TIME_BETWEEN(transactions.transaction_time)	COUNT(transactions)	AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)	AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop)	AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile)	COUNT(sessions WHERE device = tablet)	COUNT(sessions WHERE device = desktop)	...	AVG_TIME_BETWEEN(transactions.sessions.session_start)	AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop)	AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile)	AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet)	AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop)	AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile)	AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet)	COUNT(transactions WHERE sessions.device = desktop)	COUNT(transactions WHERE sessions.device = mobile)	COUNT(transactions WHERE sessions.device = tablet)
customer_id
5	60091	5577.000000	6	363.333333	79	NaN	9685.0	13942.500000	1	2	...	357.500000	345.892857	796.714286	0.000000	376.071429	809.714286	65.000000	29	36	14
4	60091	2516.428571	8	168.518519	109	NaN	4127.5	3336.666667	1	3	...	163.101852	223.108108	192.500000	0.000000	238.918919	206.250000	65.000000	38	53	18
1	60091	3305.714286	8	192.920000	126	8807.5	7150.0	11570.000000	3	2	...	185.120000	275.000000	420.727273	419.404762	302.500000	438.454545	442.619048	27	56	43
3	13244	5096.000000	6	287.554348	93	NaN	4745.0	NaN	1	4	...	276.956522	233.360656	0.000000	0.000000	251.475410	65.000000	65.000000	62	16	15
2	13244	4907.500000	7	328.532609	93	5330.0	6890.0	1690.000000	2	3	...	320.054348	417.575758	56.333333	197.407407	435.303030	82.333333	226.296296	34	31	28

5 rows × 21 columns

Now, we have several new potentially useful features. Here are two of them that are built off of the where clause “where the device used was a tablet”:

[6]:

feature_matrix[
    [
        "COUNT(sessions WHERE device = tablet)",
        "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)",
    ]
]

[6]:

	COUNT(sessions WHERE device = tablet)	AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id
5	1	NaN
4	1	NaN
1	3	8807.5
3	1	NaN
2	2	5330.0

The first geature, COUNT(sessions WHERE device = tablet), can be understood as indicating how many sessions a customer completed on a tablet.

The second feature, AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet), calculates the time between those sessions.

We can see that customer who only had 0 or 1 sessions on a tablet had NaN values for average time between such sessions.

Encoding categorical features#

Machine learning algorithms typically expect all numeric data or data that has defined numeric representations, like boolean values corresponding to 0 and 1. When Deep Feature Synthesis generates categorical features, we can encode them using Featureools.

[7]:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mode"],
    trans_primitives=["time_since"],
    max_depth=1,
)

feature_matrix

[7]:

	zip_code	MODE(sessions.device)	TIME_SINCE(birthday)	TIME_SINCE(join_date)
customer_id
5	60091	mobile	1.215383e+09	3.958597e+08
4	60091	mobile	5.196041e+08	3.729108e+08
1	60091	mobile	9.007145e+08	3.721668e+08
3	13244	desktop	6.058313e+08	3.619540e+08
2	13244	desktop	1.150497e+09	3.406715e+08

This feature matrix contains 2 columns that are categorical in nature, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values into boolean values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

[8]:

feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc

[8]:

	TIME_SINCE(birthday)	TIME_SINCE(join_date)	zip_code = 60091	zip_code = 13244	zip_code is unknown	MODE(sessions.device) = mobile	MODE(sessions.device) = desktop	MODE(sessions.device) is unknown
customer_id
5	1.215383e+09	3.958597e+08	True	False	False	True	False	False
4	5.196041e+08	3.729108e+08	True	False	False	True	False	False
1	9.007145e+08	3.721668e+08	True	False	False	True	False	False
3	6.058313e+08	3.619540e+08	False	True	False	False	True	False
2	1.150497e+09	3.406715e+08	False	True	False	False	True	False

The returned feature matrix is now encoded in a way that is interpretable to machine learning algorithms. Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values.

[9]:

features_enc

[9]:

[<Feature: zip_code = 60091>,
 <Feature: zip_code = 13244>,
 <Feature: zip_code is unknown>,
 <Feature: MODE(sessions.device) = mobile>,
 <Feature: MODE(sessions.device) = desktop>,
 <Feature: MODE(sessions.device) is unknown>,
 <Feature: TIME_SINCE(birthday)>,
 <Feature: TIME_SINCE(join_date)>]

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read the Deployment guide.

Table of Contents

Previous topic

Next topic

This Page

Tuning Deep Feature Synthesis#

Using “Seed Features”#

Add “interesting” values to columns#

Encoding categorical features#

Table of Contents

Previous topic

Next topic

This Page

Quick search

Tuning Deep Feature Synthesis#

Using “Seed Features”#

Add “interesting” values to columns#

Encoding categorical features#