Generating Feature Descriptions

As features become more complicated, their names can become harder to understand. Both the featuretools.describe_feature() function and the featuretools.graph_feature() function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the describe_feature function can be augmented by providing custom definitions and templates to improve the resulting descriptions.

By default, describe_feature uses the existing variable and entity names and the default primitive description templates to generate feature descriptions.

In [1]: feature_defs[8]
Out[1]: <Feature: HOUR(join_date)>

In [2]: ft.describe_feature(feature_defs[8])
Out[2]: 'The hour value of the "join_date".'
In [3]: feature_defs[12]
Out[3]: <Feature: MEAN(sessions.SUM(transactions.amount))>

In [4]: ft.describe_feature(feature_defs[12])
Out[4]: 'The average of the sum of the "amount" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'

Improving Descriptions

While the default descriptions can be helpful, they can also be further improved by providing custom definitions of variables and features, and by providing alternative templates for primitive descriptions.

Feature Descriptions

Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a variable or feature is, or to provide descriptions that take advantage of a user’s existing knowledge about the data or domain.

In [5]: feature_descriptions = {'customers: join_date': 'the date the customer joined'}

In [6]: ft.describe_feature(feature_defs[8], feature_descriptions=feature_descriptions)
Out[6]: 'The hour value of the date the customer joined.'

For example, the above replaces the variable name "join_date" with a more descriptive definition of what that variable represents in the dataset. Variable descriptions can also be set directly on the variable through the description attribute:

In [7]: es['customers']['join_date'].description = 'the date the customer joined'

In [8]: feature = ft.TransformFeature(es['customers']['join_date'], ft.primitives.Hour)

In [9]: feature
Out[9]: <Feature: HOUR(join_date)>

In [10]: ft.describe_feature(feature)
Out[10]: 'The hour value of the date the customer joined.'

Variable descriptions must be set on the variable before the feature is created in order for descriptions to propagate. Note that if a description is set directly on a variable and a description is passed to describe_feature with feature_descriptions, describe_feature will use the description found in feature_descriptions. Feature descriptions can also be provided for generated features.

In [11]: feature_descriptions = {
   ....:     'sessions: SUM(transactions.amount)': 'the total transaction amount for a session'}
   ....: 

In [12]: ft.describe_feature(feature_defs[12], feature_descriptions=feature_descriptions)
Out[12]: 'The average of the total transaction amount for a session of all instances of "sessions" for each "customer_id" in "customers".'

Here, we create and pass in a custom description of the intermediate feature SUM(transactions.amount). The description for MEAN(sessions.SUM(transactions.amount)), which is built on top of SUM(transactions.amount), uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form "[entity_name]: [feature_name]", as shown above.

Primitive Templates

Primitives descriptions are generated using primitive templates. By default, these are defined using the description_template attribute on the primitive. Primitives without a template default to using the name attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into describe_feature through the primitive_templates argument.

In [13]: primitive_templates = {'sum': 'the total of {}'}

In [14]: feature_defs[6]
Out[14]: <Feature: SUM(transactions.amount)>

In [15]: ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)
Out[15]: 'The total of the "amount" of all instances of "transactions" for each "customer_id" in "customers".'

In this example, we override the default template of 'the sum of {}' with our custom template 'the total of {}'. The description uses our custom template instead of the default.

Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the “nth” form is available through the nth_slice keyword.

In [16]: feature = feature_defs[5]

In [17]: feature
Out[17]: <Feature: N_MOST_COMMON(transactions.product_id)>

In [18]: primitive_templates = {
   ....:     'n_most_common': [
   ....:         'the 3 most common elements of {}', # generic multi-output feature
   ....:         'the {nth_slice} most common element of {}']} # template for each slice
   ....: 

In [19]: ft.describe_feature(feature, primitive_templates=primitive_templates)
Out[19]: 'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:

In [20]: ft.describe_feature(feature[0], primitive_templates=primitive_templates)
Out[20]: 'The 1st most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

In [21]: ft.describe_feature(feature[1], primitive_templates=primitive_templates)
Out[21]: 'The 2nd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

In [22]: ft.describe_feature(feature[2], primitive_templates=primitive_templates)
Out[22]: 'The 3rd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template.

In [23]: primitive_templates = {
   ....:     'n_most_common': [
   ....:         'the 3 most common elements of {}',
   ....:         'the most common element of {}',
   ....:         'the second most common element of {}',
   ....:         'the third most common element of {}']}
   ....: 

In [24]: ft.describe_feature(feature, primitive_templates=primitive_templates)
Out[24]: 'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

In [25]: ft.describe_feature(feature[0], primitive_templates=primitive_templates)
Out[25]: 'The most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

In [26]: ft.describe_feature(feature[1], primitive_templates=primitive_templates)
Out[26]: 'The second most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

In [27]: ft.describe_feature(feature[2], primitive_templates=primitive_templates)
Out[27]: 'The third most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the describe_feature function using the metadata_file keyword argument. Descriptions passed in directly through the feature_descriptions and primitive_templates keyword arguments will take precedence over any descriptions provided in the JSON metadata file.