Generating Feature Descriptions¶
As features become more complicated, their names can become harder to understand. Both the describe_feature function and the graph_feature function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the describe_feature
function can be augmented by providing
custom definitions and templates to improve the resulting descriptions.
By default, describe_feature
uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions.
[2]:
feature_defs[9]
[2]:
<Feature: MONTH(birthday)>
[3]:
ft.describe_feature(feature_defs[9])
[3]:
'The month of the "birthday".'
[4]:
feature_defs[14]
[4]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[5]:
ft.describe_feature(feature_defs[14])
[5]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'
Improving Descriptions¶
While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions.
Feature Descriptions¶
Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a ColumnSchema
or feature is, or to provide descriptions that take advantage of a user’s existing knowledge about the data or domain.
[6]:
feature_descriptions = {'customers: join_date': 'the date the customer joined'}
ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)
[6]:
'The month of the "birthday".'
For example, the above replaces the column name, "join_date"
, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the description
attribute present on each ColumnSchema
:
[7]:
join_date_column_schema = es['customers'].ww.columns['join_date']
join_date_column_schema.description = 'the date the customer joined'
es['customers'].ww.columns['join_date'].description
[7]:
'the date the customer joined'
[8]:
feature = ft.TransformFeature(es['customers'].ww['join_date'], ft.primitives.Hour)
feature
[8]:
<Feature: HOUR(join_date)>
[9]:
ft.describe_feature(feature)
[9]:
'The hour value of the date the customer joined.'
Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to describe_feature
with feature_descriptions
, the description in the feature_descriptions
parameter will take presedence.
Feature descriptions can also be provided for generated features.
[10]:
feature_descriptions = {
'sessions: SUM(transactions.amount)': 'the total transaction amount for a session'}
feature_defs[14]
[10]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[11]:
ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)
[11]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'
Here, we create and pass in a custom description of the intermediate feature SUM(transactions.amount)
. The description for MEAN(sessions.SUM(transactions.amount))
, which is built on top of SUM(transactions.amount)
, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form
"[dataframe_name]: [feature_name]"
, as shown above.
Primitive Templates¶
Primitives descriptions are generated using primitive templates. By default, these are defined using the description_template
attribute on the primitive. Primitives without a template default to using the name
attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom
templates and passing them into describe_feature
through the primitive_templates
argument.
[12]:
primitive_templates = {'sum': 'the total of {}'}
feature_defs[6]
[12]:
<Feature: SUM(transactions.amount)>
[13]:
ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)
[13]:
'The total of the "amount" of all instances of "transactions" for each "customer_id" in "customers".'
In this example, we override the default template of 'the sum of {}'
with our custom template 'the total of {}'
. The description uses our custom template instead of the default.
Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the “nth” form is available through the nth_slice
keyword.
[14]:
feature = feature_defs[5]
feature
[14]:
<Feature: N_MOST_COMMON(transactions.product_id)>
[15]:
primitive_templates = {
'n_most_common': [
'the 3 most common elements of {}', # generic multi-output feature
'the {nth_slice} most common element of {}']} # template for each slice
ft.describe_feature(feature, primitive_templates=primitive_templates)
[15]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:
[16]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[16]:
'The 1st most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[17]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[17]:
'The 2nd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[18]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[18]:
'The 3rd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template.
[19]:
primitive_templates = {
'n_most_common': [
'the 3 most common elements of {}',
'the most common element of {}',
'the second most common element of {}',
'the third most common element of {}']}
ft.describe_feature(feature, primitive_templates=primitive_templates)
[19]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[20]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[20]:
'The most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[21]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[21]:
'The second most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[22]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[22]:
'The third most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the describe_feature
function using the metadata_file
keyword argument. Descriptions passed in directly through the feature_descriptions
and primitive_templates
keyword arguments will take precedence over any descriptions provided in the JSON metadata file.