Generating Feature Descriptions

As features become more complicated, their names can become harder to understand. Both the describe_feature function and the graph_feature function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the describe_feature function can be augmented by providing custom definitions and templates to improve the resulting descriptions.

By default, describe_feature uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions.

[2]:
feature_defs[9]
[2]:
<Feature: MONTH(birthday)>
[3]:
ft.describe_feature(feature_defs[9])
[3]:
'The month of the "birthday".'
[4]:
feature_defs[14]
[4]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[5]:
ft.describe_feature(feature_defs[14])
[5]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'

Improving Descriptions

While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions.

Feature Descriptions

Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a ColumnSchema or feature is, or to provide descriptions that take advantage of a user’s existing knowledge about the data or domain.

[6]:
feature_descriptions = {'customers: join_date': 'the date the customer joined'}

ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)
[6]:
'The month of the "birthday".'

For example, the above replaces the column name, "join_date", with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the description attribute present on each ColumnSchema:

[7]:
join_date_column_schema = es['customers'].ww.columns['join_date']
join_date_column_schema.description = 'the date the customer joined'

es['customers'].ww.columns['join_date'].description
[7]:
'the date the customer joined'
[8]:
feature = ft.TransformFeature(es['customers'].ww['join_date'], ft.primitives.Hour)
feature
[8]:
<Feature: HOUR(join_date)>
[9]:
ft.describe_feature(feature)
[9]:
'The hour value of the date the customer joined.'
.. note:: When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description.

Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to describe_feature with feature_descriptions, the description in the feature_descriptions parameter will take presedence.

Feature descriptions can also be provided for generated features.

[10]:
feature_descriptions = {
    'sessions: SUM(transactions.amount)': 'the total transaction amount for a session'}

feature_defs[14]
[10]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[11]:
ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)
[11]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'

Here, we create and pass in a custom description of the intermediate feature SUM(transactions.amount). The description for MEAN(sessions.SUM(transactions.amount)), which is built on top of SUM(transactions.amount), uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form "[dataframe_name]: [feature_name]", as shown above.

Primitive Templates

Primitives descriptions are generated using primitive templates. By default, these are defined using the description_template attribute on the primitive. Primitives without a template default to using the name attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into describe_feature through the primitive_templates argument.

[12]:
primitive_templates = {'sum': 'the total of {}'}

feature_defs[6]
[12]:
<Feature: SUM(transactions.amount)>
[13]:
ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)
[13]:
'The total of the "amount" of all instances of "transactions" for each "customer_id" in "customers".'

In this example, we override the default template of 'the sum of {}' with our custom template 'the total of {}'. The description uses our custom template instead of the default.

Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the “nth” form is available through the nth_slice keyword.

[14]:
feature = feature_defs[5]
feature
[14]:
<Feature: N_MOST_COMMON(transactions.product_id)>
[15]:
primitive_templates = {
    'n_most_common': [
        'the 3 most common elements of {}', # generic multi-output feature
        'the {nth_slice} most common element of {}']} # template for each slice

ft.describe_feature(feature, primitive_templates=primitive_templates)
[15]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:

[16]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[16]:
'The 1st most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[17]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[17]:
'The 2nd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[18]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[18]:
'The 3rd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template.

[19]:
primitive_templates = {
    'n_most_common': [
        'the 3 most common elements of {}',
        'the most common element of {}',
        'the second most common element of {}',
        'the third most common element of {}']}

ft.describe_feature(feature, primitive_templates=primitive_templates)
[19]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[20]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[20]:
'The most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[21]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[21]:
'The second most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[22]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[22]:
'The third most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'

Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the describe_feature function using the metadata_file keyword argument. Descriptions passed in directly through the feature_descriptions and primitive_templates keyword arguments will take precedence over any descriptions provided in the JSON metadata file.