NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.


Specifying Primitive Options

By default, DFS will apply primitives across all dataframes and columns. This behavior can be altered through a few different parameters. Dataframes and columns can be optionally ignored or included for an entire DFS run or on a per-primitive basis, enabling greater control over features and less run time overhead.

[1]:
import featuretools as ft
from featuretools.tests.testing_utils import make_ecommerce_entityset

es = make_ecommerce_entityset()

features_list = ft.dfs(entityset=es,
                       target_dataframe_name='customers',
                       agg_primitives=['mode'],
                       trans_primitives=['weekday'],
                       features_only=True)
features_list
[1]:
[<Feature: age>,
 <Feature: région_id>,
 <Feature: cohort>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_name)>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: MODE(log.countrycode)>,
 <Feature: MODE(log.priority_level)>,
 <Feature: MODE(log.product_id)>,
 <Feature: MODE(log.subregioncode)>,
 <Feature: MODE(log.zipcode)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: MODE(sessions.MODE(log.countrycode))>,
 <Feature: MODE(sessions.MODE(log.priority_level))>,
 <Feature: MODE(sessions.MODE(log.product_id))>,
 <Feature: MODE(sessions.MODE(log.subregioncode))>,
 <Feature: MODE(sessions.MODE(log.zipcode))>,
 <Feature: MODE(log.sessions.device_name)>,
 <Feature: MODE(log.sessions.device_type)>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(customers.engagement_level)>,
 <Feature: cohorts.MODE(customers.région_id)>,
 <Feature: cohorts.MODE(sessions.device_name)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: cohorts.MODE(log.countrycode)>,
 <Feature: cohorts.MODE(log.priority_level)>,
 <Feature: cohorts.MODE(log.product_id)>,
 <Feature: cohorts.MODE(log.subregioncode)>,
 <Feature: cohorts.MODE(log.zipcode)>,
 <Feature: cohorts.WEEKDAY(cohort_end)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_name)>,
 <Feature: régions.MODE(sessions.device_type)>,
 <Feature: régions.MODE(log.countrycode)>,
 <Feature: régions.MODE(log.priority_level)>,
 <Feature: régions.MODE(log.product_id)>,
 <Feature: régions.MODE(log.subregioncode)>,
 <Feature: régions.MODE(log.zipcode)>]

Specifying Options for an Entire Run

The ignore_dataframes and ignore_columns parameters of DFS control dataframes and columns that should be ignored for all primitives. This is useful for ignoring columns or dataframes that don’t relate to the problem or otherwise shouldn’t be included in the DFS run.

[2]:
# ignore the 'log' and 'cohorts' dataframes entirely
# ignore the 'date_of_birth' column in 'customers' and the 'device_name' column in 'sessions'
features_list = ft.dfs(entityset=es,
                       target_dataframe_name='customers',
                       agg_primitives=['mode'],
                       trans_primitives=['weekday'],
                       ignore_dataframes=['log', 'cohorts'],
                       ignore_columns={'sessions': ['device_name'],
                                       'customers': ['date_of_birth']},
                       features_only=True)
features_list
[2]:
[<Feature: age>,
 <Feature: région_id>,
 <Feature: cohort>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: régions.language>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_type)>]

DFS completely ignores the log and cohorts dataframes when creating features. It also ignores the columns device_name and date_of_birth in sessions and customers respectively. However, both of these options can be overridden by individual primitive options in the primitive_options parameter.

Specifying for Individual Primitives

Options for individual primitives or groups of primitives are set by the primitive_options parameter of DFS. This parameter maps any desired options to specific primitives. In the case of conflicting options, options set at this level will override options set at the entire DFS run level, and the include options will always take priority over their ignore counterparts.

Using the string primitive name or the primitive type will apply the options to all primitives of the same name. You can also set options for a specific instance of a primitive by using the primitive instance as a key in the primitive_options dictionary. Note, however, that specifying options for a specific instance will result in that instance ignoring any options set for the generic primitive through options with the primitive name or class as the key.

Specifying Dataframes for Individual Primitives

Which dataframes to include/ignore can also be specified for a single primitive or a group of primitives. Dataframes can be ignored using the ignore_dataframes option in primitive_options, while dataframes to explicitly include are set by the include_dataframes option. When include_dataframes is given, all dataframes not listed are ignored by the primitive. No columns from any excluded dataframe will be used to generate features with the given primitive.

[3]:
# ignore the 'cohorts' and 'log' dataframes, but only for the primitive 'mode'
# include only the 'customers' dataframe for the primitives 'weekday' and 'day'
features_list = ft.dfs(entityset=es,
                       target_dataframe_name='customers',
                       agg_primitives=['mode'],
                       trans_primitives=['weekday', 'day'],
                       primitive_options={
                           'mode': {'ignore_dataframes': ['cohorts', 'log']},
                           ('weekday', 'day'): {'include_dataframes': ['customers']}
                       },
                       features_only=True)
features_list
[3]:
[<Feature: age>,
 <Feature: région_id>,
 <Feature: cohort>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_name)>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: DAY(cancel_date)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: DAY(signup_date)>,
 <Feature: DAY(upgrade_date)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(customers.engagement_level)>,
 <Feature: cohorts.MODE(customers.région_id)>,
 <Feature: cohorts.MODE(sessions.device_name)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_name)>,
 <Feature: régions.MODE(sessions.device_type)>]

In this example, DFS would only use the customers dataframe for both weekday and day, and would use all dataframes except cohorts and log for mode.

Specifying Columns for Individual Primitives

Specific columns can also be explicitly included/ignored for a primitive or group of primitives. Columns to ignore is set by the ignore_columns option, while columns to include are set by include_columns. When the include_columns option is set, no other columns from that dataframe will be used to make features with the given primitive.

[4]:
# Include the columns 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'
# Ignore the columns 'signup_date' and 'cancel_date' for 'weekday'
features_list = ft.dfs(entityset=es,
                       target_dataframe_name='customers',
                       agg_primitives=['mode'],
                       trans_primitives=['weekday'],
                       primitive_options={
                          'mode': {'include_columns': {'log': ['product_id', 'zipcode'],
                                                       'sessions': ['device_type'],
                                                       'customers': ['cancel_reason']}},
                          'weekday': {'ignore_columns': {'customers': ['signup_date', 'cancel_date']}}
                       },
                       features_only=True)
features_list
[4]:
[<Feature: age>,
 <Feature: région_id>,
 <Feature: cohort>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: MODE(log.product_id)>,
 <Feature: MODE(log.zipcode)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: MODE(sessions.MODE(log.product_id))>,
 <Feature: MODE(sessions.MODE(log.zipcode))>,
 <Feature: MODE(log.sessions.device_type)>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: cohorts.MODE(log.product_id)>,
 <Feature: cohorts.MODE(log.zipcode)>,
 <Feature: cohorts.WEEKDAY(cohort_end)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(sessions.device_type)>,
 <Feature: régions.MODE(log.product_id)>,
 <Feature: régions.MODE(log.zipcode)>]

Here, mode will only use the columns product_id and zipcode from the dataframe log, device_type from the dataframe sessions, and cancel_reason from customers. For any other dataframe, mode will use all columns. The weekday primitive will use all columns in all dataframes except for signup_date and cancel_date from the customers dataframe.

Specifying GroupBy Options

GroupBy Transform Primitives also have the additional options include_groupby_dataframes, ignore_groupby_dataframes, include_groupby_columns, and ignore_groupby_columns. These options are used to specify dataframes and columns to include/ignore as groupings for inputs. By default, DFS only groups by foreign key columns. Specifying include_groupby_columns overrides this default, and will only group by columns given. On the other hand, ignore_groupby_columns will continue to use only the foreign key columns, ignoring any columns specified that are also foreign key columns. Note that if including non-foreign key columns to group by, the included columns must be categorical columns.

[5]:
features_list = ft.dfs(entityset=es,
                       target_dataframe_name='log',
                       agg_primitives=[],
                       trans_primitives=[],
                       groupby_trans_primitives=['cum_sum', 'cum_count'],
                       primitive_options={
                           'cum_sum': {'ignore_groupby_columns': {'log': ['product_id']}},
                           'cum_count': {'include_groupby_columns': {'log': ['product_id',
                                                                             'priority_level']},
                                         'ignore_groupby_dataframes': ['sessions']}
                       },
                       features_only=True)
features_list
[5]:
[<Feature: session_id>,
 <Feature: product_id>,
 <Feature: value>,
 <Feature: value_2>,
 <Feature: zipcode>,
 <Feature: countrycode>,
 <Feature: subregioncode>,
 <Feature: value_many_nans>,
 <Feature: priority_level>,
 <Feature: purchased>,
 <Feature: CUM_COUNT(product_id) by priority_level>,
 <Feature: CUM_COUNT(product_id) by product_id>,
 <Feature: CUM_SUM(value) by session_id>,
 <Feature: CUM_SUM(value_2) by session_id>,
 <Feature: CUM_SUM(value_many_nans) by session_id>,
 <Feature: sessions.customer_id>,
 <Feature: sessions.device_type>,
 <Feature: sessions.device_name>,
 <Feature: products.department>,
 <Feature: products.rating>,
 <Feature: sessions.customers.age>,
 <Feature: sessions.customers.région_id>,
 <Feature: sessions.customers.cohort>,
 <Feature: sessions.customers.loves_ice_cream>,
 <Feature: sessions.customers.cancel_reason>,
 <Feature: sessions.customers.engagement_level>,
 <Feature: CUM_SUM(products.rating) by session_id>,
 <Feature: CUM_SUM(products.rating) by sessions.customer_id>]

We ignore product_id as a groupby for cum_sum but still use any other foreign key columns in that or any other dataframe. For cum_count, we use only product_id and priority_level as groupbys. Note that cum_sum doesn’t use priority_level because it’s not a foreign key column, but we explicitly include it for cum_count. Finally, note that specifying groupby options doesn’t affect what features the primitive is applied to. For example, cum_count ignores the dataframe sessions for groupbys, but the feature <Feature: CUM_COUNT(sessions.device_name) by product_id> is still made. The groupby is from the target dataframe log, so the feature is valid given the associated options. To ignore the sessions dataframe for cum_count, the ignore_dataframes option for cum_count would need to include sessions.

Specifying for each Input for Multiple Input Primitives

For primitives that take multiple columns as input, such as Trend, the above options can be specified for each input by passing them in as a list. If only one option dictionary is given, it is used for all inputs. The length of the list provided must match the number of inputs the primitive takes.

[6]:
features_list = ft.dfs(entityset=es,
                       target_dataframe_name='customers',
                       agg_primitives=['trend'],
                       trans_primitives=[],
                       primitive_options={
                           'trend': [{'ignore_columns': {'log': ['value_many_nans']}},
                                     {'include_columns': {'customers': ['signup_date'],
                                                          'log': ['datetime']}}]
                       },
                       features_only=True)
features_list
[6]:
[<Feature: age>,
 <Feature: région_id>,
 <Feature: cohort>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: TREND(log.value, datetime)>,
 <Feature: TREND(log.value_2, datetime)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: cohorts.TREND(customers.age, signup_date)>,
 <Feature: cohorts.TREND(log.value, datetime)>,
 <Feature: cohorts.TREND(log.value_2, datetime)>,
 <Feature: régions.TREND(customers.age, signup_date)>,
 <Feature: régions.TREND(log.value, datetime)>,
 <Feature: régions.TREND(log.value_2, datetime)>]

Here, we pass in a list of primitive options for trend. We ignore the column value_many_nans for the first input to trend, and include the column signup_date from customers for the second input.