{ "cells": [ { "cell_type": "markdown", "id": "1557274d", "metadata": {}, "source": [ "# Generating Feature Descriptions\n", "\n", "As features become more complicated, their names can become harder to understand. Both the [describe_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.graph_feature.html) function and the [graph_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.describe_feature.html) function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the ``describe_feature`` function can be augmented by providing custom definitions and templates to improve the resulting descriptions. " ] }, { "cell_type": "code", "execution_count": null, "id": "cdb8b3eb", "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\", \"sum\", \"mode\", \"n_most_common\"],\n", " trans_primitives=[\"month\", \"hour\"],\n", " max_depth=2,\n", " features_only=True,\n", ")" ] }, { "cell_type": "markdown", "id": "01f8209c", "metadata": {}, "source": [ "By default, ``describe_feature`` uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions. " ] }, { "cell_type": "code", "execution_count": null, "id": "35b86722", "metadata": {}, "outputs": [], "source": [ "feature_defs[9]" ] }, { "cell_type": "code", "execution_count": null, "id": "e24bee8d", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[9])" ] }, { "cell_type": "code", "execution_count": null, "id": "5402e848", "metadata": {}, "outputs": [], "source": [ "feature_defs[14]" ] }, { "cell_type": "code", "execution_count": null, "id": "ac22c09c", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[14])" ] }, { "cell_type": "markdown", "id": "ff9b7b35", "metadata": {}, "source": [ "## Improving Descriptions\n", "\n", "While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions. \n", "\n", "#### Feature Descriptions\n", "Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a `ColumnSchema` or feature is, or to provide descriptions that take advantage of a user's existing knowledge about the data or domain. " ] }, { "cell_type": "code", "execution_count": null, "id": "33b2f8e5", "metadata": {}, "outputs": [], "source": [ "feature_descriptions = {\"customers: join_date\": \"the date the customer joined\"}\n", "\n", "ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)" ] }, { "cell_type": "markdown", "id": "218147f4", "metadata": {}, "source": [ "For example, the above replaces the column name, ``\"join_date\"``, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the ``description`` attribute present on each `ColumnSchema`:" ] }, { "cell_type": "code", "execution_count": null, "id": "597e20a6", "metadata": {}, "outputs": [], "source": [ "join_date_column_schema = es[\"customers\"].ww.columns[\"join_date\"]\n", "join_date_column_schema.description = \"the date the customer joined\"\n", "\n", "es[\"customers\"].ww.columns[\"join_date\"].description" ] }, { "cell_type": "code", "execution_count": null, "id": "6c013615", "metadata": {}, "outputs": [], "source": [ "feature = ft.TransformFeature(es[\"customers\"].ww[\"join_date\"], ft.primitives.Hour)\n", "feature" ] }, { "cell_type": "code", "execution_count": null, "id": "03e828b4", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature)" ] }, { "cell_type": "raw", "id": "689cbd98", "metadata": {}, "source": [ ".. note::\n", "\n", " When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description. " ] }, { "cell_type": "markdown", "id": "10e779f5", "metadata": {}, "source": [ "Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to ``describe_feature`` with ``feature_descriptions``, the description in the `feature_descriptions` parameter will take presedence.\n", "\n", "Feature descriptions can also be provided for generated features." ] }, { "cell_type": "code", "execution_count": null, "id": "5d1f8667", "metadata": {}, "outputs": [], "source": [ "feature_descriptions = {\n", " \"sessions: SUM(transactions.amount)\": \"the total transaction amount for a session\"\n", "}\n", "\n", "feature_defs[14]" ] }, { "cell_type": "code", "execution_count": null, "id": "b90b8e4e", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)" ] }, { "cell_type": "markdown", "id": "83217b19", "metadata": {}, "source": [ "Here, we create and pass in a custom description of the intermediate feature ``SUM(transactions.amount)``. The description for ``MEAN(sessions.SUM(transactions.amount))``, which is built on top of ``SUM(transactions.amount)``, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form ``\"[dataframe_name]: [feature_name]\"``, as shown above.\n", "\n", "#### Primitive Templates\n", "Primitives descriptions are generated using primitive templates. By default, these are defined using the ``description_template`` attribute on the primitive. Primitives without a template default to using the ``name`` attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into ``describe_feature`` through the ``primitive_templates`` argument. " ] }, { "cell_type": "code", "execution_count": null, "id": "50f1bfb8", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\"sum\": \"the total of {}\"}\n", "\n", "feature_defs[6]" ] }, { "cell_type": "code", "execution_count": null, "id": "c1fb53a3", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "9b9cceca", "metadata": {}, "source": [ "In this example, we override the default template of ``'the sum of {}'`` with our custom template ``'the total of {}'``. The description uses our custom template instead of the default.\n", "\n", "Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the \"nth\" form is available through the ``nth_slice`` keyword." ] }, { "cell_type": "code", "execution_count": null, "id": "15ed472c", "metadata": {}, "outputs": [], "source": [ "feature = feature_defs[5]\n", "feature" ] }, { "cell_type": "code", "execution_count": null, "id": "54a5a6fd", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\n", " \"n_most_common\": [\n", " \"the 3 most common elements of {}\", # generic multi-output feature\n", " \"the {nth_slice} most common element of {}\",\n", " ]\n", "} # template for each slice\n", "\n", "ft.describe_feature(feature, primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "49aae7d2", "metadata": {}, "source": [ "Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:" ] }, { "cell_type": "code", "execution_count": null, "id": "1bd3a3cf", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[0], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "607299ff", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[1], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "30f4235f", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[2], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "17953d54", "metadata": {}, "source": [ "Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template." ] }, { "cell_type": "code", "execution_count": null, "id": "bad05646", "metadata": {}, "outputs": [], "source": [ "primitive_templates = {\n", " \"n_most_common\": [\n", " \"the 3 most common elements of {}\",\n", " \"the most common element of {}\",\n", " \"the second most common element of {}\",\n", " \"the third most common element of {}\",\n", " ]\n", "}\n", "\n", "ft.describe_feature(feature, primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "fdad1868", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[0], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "90a85bd0", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[1], primitive_templates=primitive_templates)" ] }, { "cell_type": "code", "execution_count": null, "id": "b63d47a7", "metadata": {}, "outputs": [], "source": [ "ft.describe_feature(feature[2], primitive_templates=primitive_templates)" ] }, { "cell_type": "markdown", "id": "1942ea49", "metadata": {}, "source": [ "Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the ``describe_feature`` function using the ``metadata_file`` keyword argument. Descriptions passed in directly through the ``feature_descriptions`` and ``primitive_templates`` keyword arguments will take precedence over any descriptions provided in the JSON metadata file." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 5 }