Specifying Primitive Options

By default, DFS will apply primitives across all entities and columns. This behavior can be altered through a few different parameters. Entities and variables can be optionally ignored or included for an entire DFS run or on a per-primitive basis, enabling greater control over features and less run time overhead.

In [1]: from featuretools.tests.testing_utils import make_ecommerce_entityset

In [2]: es = make_ecommerce_entityset()

In [3]: feature_matrix, features_list = ft.dfs(entityset=es,
   ...:                                        target_entity='customers',
   ...:                                        agg_primitives=['mode'],
   ...:                                        trans_primitives=['weekday'])
   ...: 

In [4]: features_list
Out[4]: 
[<Feature: cohort>,
 <Feature: age>,
 <Feature: région_id>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: MODE(sessions.device_name)>,
 <Feature: MODE(log.zipcode)>,
 <Feature: MODE(log.priority_level)>,
 <Feature: MODE(log.product_id)>,
 <Feature: MODE(log.countrycode)>,
 <Feature: MODE(log.subregioncode)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: MODE(sessions.MODE(log.product_id))>,
 <Feature: MODE(sessions.MODE(log.zipcode))>,
 <Feature: MODE(sessions.MODE(log.countrycode))>,
 <Feature: MODE(sessions.MODE(log.priority_level))>,
 <Feature: MODE(sessions.MODE(log.subregioncode))>,
 <Feature: MODE(log.sessions.device_type)>,
 <Feature: MODE(log.sessions.customer_id)>,
 <Feature: MODE(log.sessions.device_name)>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(customers.région_id)>,
 <Feature: cohorts.MODE(customers.engagement_level)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: cohorts.MODE(sessions.device_name)>,
 <Feature: cohorts.MODE(log.zipcode)>,
 <Feature: cohorts.MODE(log.priority_level)>,
 <Feature: cohorts.MODE(log.product_id)>,
 <Feature: cohorts.MODE(log.countrycode)>,
 <Feature: cohorts.MODE(log.subregioncode)>,
 <Feature: cohorts.WEEKDAY(cohort_end)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.cohort)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_type)>,
 <Feature: régions.MODE(sessions.device_name)>,
 <Feature: régions.MODE(log.zipcode)>,
 <Feature: régions.MODE(log.priority_level)>,
 <Feature: régions.MODE(log.product_id)>,
 <Feature: régions.MODE(log.countrycode)>,
 <Feature: régions.MODE(log.subregioncode)>]

Specifying Options for an Entire Run

The ignore_entities and ignore_variables parameters of DFS control entities and variables (columns) that should be ignored for all primitives. This is useful for ignoring columns or entities that don’t relate to the problem or otherwise shouldn’t be included in the DFS run.

# ignore the 'log' and 'cohorts' entities entirely
# ignore the 'date_of_birth' variable in 'customers' and the 'device_name' variable in 'sessions'
In [5]: feature_matrix, features_list = ft.dfs(entityset=es,
   ...:                                        target_entity='customers',
   ...:                                        agg_primitives=['mode'],
   ...:                                        trans_primitives=['weekday'],
   ...:                                        ignore_entities=['log', 'cohorts'],
   ...:                                        ignore_variables={
   ...:                                            'sessions': ['device_name'],
   ...:                                            'customers': ['date_of_birth']})
   ...: 

In [6]: features_list
Out[6]: 
[<Feature: cohort>,
 <Feature: age>,
 <Feature: région_id>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: régions.language>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.cohort)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_type)>]

DFS completely ignores the 'log' and 'cohorts' entities when creating features. It also ignores the variables 'device_name' and 'date_of_birth' in 'sessions' and 'customers' respectively. However, both of these options can be overridden by individual primitive options in the primitive_options parameter.

Specifying for Individual Primitives

Options for individual primitives or groups of primitives are set by the primitive_options parameter of DFS. This parameter maps any desired options to specific primitives. In the case of conflicting options, options set at this level will override options set at the entire DFS run level, and the include options will always take priority over their ignore counterparts. Using the string primitive name or the primitive type will apply the options to all primitives of the same name. You can also set options for a specific instance of a primitive by using the primitive instance as a key in the primitive_options dictionary. Note, however, that specifying options for a specific instance will result in that instance ignoring any options set for the generic primitive through options with the primitive name or class as the key.

Specifying Entities for Individual Primitives

Which entities to include/ignore can also be specified for a single primitive or a group of primitives. Entities can be ignored using the ignore_entities option in primitive_options, while entities to explicitly include are set by the include_entities option. When include_entities is given, all entities not listed are ignored by the primitive. No variables from any excluded entity will be used to generate features with the given primitive.

# ignore the 'cohorts' and 'log' entities, but only for the primitive 'mode'
# include only the 'customers' entity for the primitives 'weekday' and 'day'
In [7]: feature_matrix, features_list = ft.dfs(entityset=es,
   ...:                                        target_entity='customers',
   ...:                                        agg_primitives=['mode'],
   ...:                                        trans_primitives=['weekday', 'day'],
   ...:                                        primitive_options={
   ...:                                            'mode': {'ignore_entities': ['cohorts', 'log']},
   ...:                                            ('weekday', 'day'): {'include_entities': ['customers']}
   ...:                                        })
   ...: 

In [8]: features_list
Out[8]: 
[<Feature: cohort>,
 <Feature: age>,
 <Feature: région_id>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: MODE(sessions.device_name)>,
 <Feature: WEEKDAY(signup_date)>,
 <Feature: WEEKDAY(cancel_date)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: DAY(signup_date)>,
 <Feature: DAY(cancel_date)>,
 <Feature: DAY(upgrade_date)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(customers.région_id)>,
 <Feature: cohorts.MODE(customers.engagement_level)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: cohorts.MODE(sessions.device_name)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(customers.cohort)>,
 <Feature: régions.MODE(customers.engagement_level)>,
 <Feature: régions.MODE(sessions.device_type)>,
 <Feature: régions.MODE(sessions.device_name)>]

In this example, DFS would only use the 'customers' entity for both weekday and day, and would use all entities except 'cohorts' and 'log' for mode.

Specifying Columns for Individual Primitives

Specific variables (columns) can also be explicitly included/ignored for a primitive or group of primitives. Variables to ignore is set by the ignore_variables option, while variables to include is set by include_variables. When the include_variables option is set, no other variables from that entity will be used to make features with the given primitive.

# Include the variables 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'
# Ignore the variables 'signup_date' and 'cancel_date' for 'weekday'
In [9]: feature_matrix, features_list = ft.dfs(entityset=es,
   ...:                                        target_entity='customers',
   ...:                                        agg_primitives=['mode'],
   ...:                                        trans_primitives=['weekday'],
   ...:                                        primitive_options={
   ...:                                            'mode': {'include_variables': {'log': ['product_id', 'zipcode'],
   ...:                                                                           'sessions': ['device_type'],
   ...:                                                                           'customers': ['cancel_reason']}},
   ...:                                            'weekday': {'ignore_variables': {'customers':
   ...:                                                                                 ['signup_date',
   ...:                                                                                  'cancel_date']}}})
   ...: 

In [10]: features_list
Out[10]: 
[<Feature: cohort>,
 <Feature: age>,
 <Feature: région_id>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: MODE(sessions.device_type)>,
 <Feature: MODE(log.zipcode)>,
 <Feature: MODE(log.product_id)>,
 <Feature: WEEKDAY(upgrade_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: MODE(sessions.MODE(log.product_id))>,
 <Feature: MODE(sessions.MODE(log.zipcode))>,
 <Feature: MODE(log.sessions.device_type)>,
 <Feature: cohorts.MODE(customers.cancel_reason)>,
 <Feature: cohorts.MODE(sessions.device_type)>,
 <Feature: cohorts.MODE(log.zipcode)>,
 <Feature: cohorts.MODE(log.product_id)>,
 <Feature: cohorts.WEEKDAY(cohort_end)>,
 <Feature: régions.MODE(customers.cancel_reason)>,
 <Feature: régions.MODE(sessions.device_type)>,
 <Feature: régions.MODE(log.zipcode)>,
 <Feature: régions.MODE(log.product_id)>]

Here, mode will only use the variables 'product_id' and 'zipcode' from the entity 'log', 'device_type' from the entity 'sessions', and 'cancel_reason' from 'customers'. For any other entity, mode will use all variables. The weekday primitive will use all variables in all entities except for 'signup_date' and 'cancel_date' from the 'customers' entity.

Specifying GroupBy Options

GroupBy Transform Primitives also have the additional options include_groupby_entities, ignore_groupby_entities, include_groupby_variables, and ignore_groupby_variables. These options are used to specify entities and columns to include/ignore as groupings for inputs. By default, DFS only groups by ID columns. Specifying include_groupby_variables overrides this default, and will only group by variables given. On the other hand, ignore_groupby_variables will continue to use only the ID columns, ignoring any variables specified that are also ID columns. Note that if including non-ID columns to group by, the included columns must also be a discrete type.

In [11]: feature_matrix, features_list = ft.dfs(entityset=es,
   ....:                                        target_entity='log',
   ....:                                        agg_primitives=[],
   ....:                                        trans_primitives=[],
   ....:                                        groupby_trans_primitives=['cum_sum',
   ....:                                                                  'cum_count'],
   ....:                                        primitive_options={
   ....:                                              'cum_sum': {'ignore_groupby_variables': {'log': ['product_id']}},
   ....:                                              'cum_count': {'include_groupby_variables': {'log': ['product_id',
   ....:                                                                                                  'priority_level']},
   ....:                                                            'ignore_groupby_entities': ['sessions']}})
   ....: 

In [12]: features_list
Out[12]: 
[<Feature: session_id>,
 <Feature: product_id>,
 <Feature: value>,
 <Feature: value_2>,
 <Feature: zipcode>,
 <Feature: countrycode>,
 <Feature: subregioncode>,
 <Feature: value_many_nans>,
 <Feature: priority_level>,
 <Feature: purchased>,
 <Feature: CUM_SUM(value_many_nans) by session_id>,
 <Feature: CUM_SUM(value) by session_id>,
 <Feature: CUM_SUM(value_2) by session_id>,
 <Feature: CUM_COUNT(session_id) by priority_level>,
 <Feature: CUM_COUNT(session_id) by product_id>,
 <Feature: CUM_COUNT(product_id) by priority_level>,
 <Feature: CUM_COUNT(product_id) by product_id>,
 <Feature: sessions.device_name>,
 <Feature: sessions.customer_id>,
 <Feature: sessions.device_type>,
 <Feature: products.rating>,
 <Feature: products.department>,
 <Feature: sessions.customers.cohort>,
 <Feature: sessions.customers.age>,
 <Feature: sessions.customers.région_id>,
 <Feature: sessions.customers.loves_ice_cream>,
 <Feature: sessions.customers.cancel_reason>,
 <Feature: sessions.customers.engagement_level>,
 <Feature: CUM_SUM(products.rating) by session_id>,
 <Feature: CUM_SUM(products.rating) by sessions.customer_id>,
 <Feature: CUM_COUNT(sessions.customer_id) by priority_level>,
 <Feature: CUM_COUNT(sessions.customer_id) by products.department>,
 <Feature: CUM_COUNT(sessions.customer_id) by product_id>]

We ignore 'product_id' as a groupby for cum_sum but still use any other ID columns in that or any other entity. For ‘cum_count’, we use only 'product_id' and 'priority_level' as groupbys. Note that cum_sum doesn’t use 'priority_level' because it’s not an ID column, but we explicitly include it for cum_count. Finally, note that specifying groupby options doesn’t affect what features the primitive is applied to. For example, cum_count ignores the entity sessions for groupbys, but the feature <Feature: CUM_COUNT(sessions.customer_id) by product_id> is still made. The groupby is from the target entity log, so the feature is valid given the associated options. To ignore the sessions entity for cum_count, the ignore_entities option for cum_count would need to include sessions.

Specifying for each Input for Multiple Input Primitives

For primitives that take multiple columns as input, such as Trend, the above options can be specified for each input by passing them in as a list. If only one option dictionary is given, it is used for all inputs. The length of the list provided must match the number of inputs the primitive takes.

In [13]: feature_matrix, features_list = ft.dfs(entityset=es,
   ....:                                        target_entity='customers',
   ....:                                        agg_primitives=['trend'],
   ....:                                        trans_primitives=[],
   ....:                                        primitive_options={
   ....:                                              'trend': [{'ignore_variables': {'log': ['value_many_nans']}},
   ....:                                                        {'include_variables': {'customers': ['signup_date'],
   ....:                                                                               'log': ['datetime']}}]})
   ....: 

In [14]: features_list
Out[14]: 
[<Feature: cohort>,
 <Feature: age>,
 <Feature: région_id>,
 <Feature: loves_ice_cream>,
 <Feature: cancel_reason>,
 <Feature: engagement_level>,
 <Feature: TREND(log.value_2, datetime)>,
 <Feature: TREND(log.value, datetime)>,
 <Feature: cohorts.cohort_name>,
 <Feature: régions.language>,
 <Feature: cohorts.TREND(customers.age, signup_date)>,
 <Feature: cohorts.TREND(log.value_2, datetime)>,
 <Feature: cohorts.TREND(log.value, datetime)>,
 <Feature: régions.TREND(customers.age, signup_date)>,
 <Feature: régions.TREND(log.value_2, datetime)>,
 <Feature: régions.TREND(log.value, datetime)>]

Here, we pass in a list of primitive options for trend. We ignore the variable 'value_many_nans' for the first input to trend, and include the variables 'signup_date' from 'customers' for the second input.