{ "cells": [ { "cell_type": "markdown", "id": "a4329c7d", "metadata": {}, "source": [ "# Tuning Deep Feature Synthesis\n", "\n", "There are several parameters that can be tuned to change the output of DFS. We'll explore these parameters using the following `transactions` EntitySet." ] }, { "cell_type": "code", "execution_count": null, "id": "12607fd8", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "es" ] }, { "cell_type": "markdown", "id": "6ef15160", "metadata": {}, "source": [ "## Using \"Seed Features\"\n", "\n", "Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.\n", "\n", "By using seed features, we can include domain specific knowledge in feature engineering automation. For the seed feature below, the domain knowlege may be that, for a specific retailer, a transaction above $125 would be considered an expensive purchase." ] }, { "cell_type": "code", "execution_count": null, "id": "b35f388e", "metadata": {}, "outputs": [], "source": [ "expensive_purchase = ft.Feature(es[\"transactions\"].ww[\"amount\"]) > 125\n", "\n", "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"percent_true\"],\n", " seed_features=[expensive_purchase],\n", ")\n", "feature_matrix[[\"PERCENT_TRUE(transactions.amount > 125)\"]]" ] }, { "cell_type": "markdown", "id": "8703d4b3", "metadata": {}, "source": [ "We can now see that the ``PERCENT_TRUE`` primitive was automatically applied to the boolean `expensive_purchase` feature from the `transactions` table. The feature produced as a result can be understood as the percentage of transactions for a customer that are considered expensive.\n", "\n", "## Add \"interesting\" values to columns\n", "\n", "Sometimes we want to create features that are conditioned on a second value before calculations are performed. We call this extra filter a \"where clause\". Where clauses are used in Deep Feature Synthesis by including primitives in the `where_primitives` parameter to DFS.\n", "\n", "By default, where clauses are built using the ``interesting_values`` of a column.\n", "\n", "Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling `es.add_interesting_values()`. \n", "\n", "Note that Dask and Spark EntitySets cannot have interesting values determined automatically for their DataFrames. For those EntitySets, or when interesting values are already known for columns, the `dataframe_name` and `values` parameters can be used to set interesting values for individual columns in a DataFrame in an EntitySet." ] }, { "cell_type": "code", "execution_count": null, "id": "b6e88923", "metadata": {}, "outputs": [], "source": [ "values_dict = {\"device\": [\"desktop\", \"mobile\", \"tablet\"]}\n", "es.add_interesting_values(dataframe_name=\"sessions\", values=values_dict)" ] }, { "cell_type": "markdown", "id": "beee9073", "metadata": {}, "source": [ "Interesting values are stored in the DataFrame's Woodwork typing information." ] }, { "cell_type": "code", "execution_count": null, "id": "c70ff02e", "metadata": {}, "outputs": [], "source": [ "es[\"sessions\"].ww.columns[\"device\"].metadata" ] }, { "cell_type": "markdown", "id": "ddec8e5a", "metadata": {}, "source": [ "Now that interesting values are set for the `device` column in the `sessions` table, we can specify the aggregation primitives for which we want where clauses using the ``where_primitives`` parameter to DFS." ] }, { "cell_type": "code", "execution_count": null, "id": "6eaabad8", "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"count\", \"avg_time_between\"],\n", " where_primitives=[\"count\", \"avg_time_between\"],\n", " trans_primitives=[],\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "681a19db", "metadata": {}, "source": [ "Now, we have several new potentially useful features. Here are two of them that are built off of the where clause \"where the device used was a tablet\":" ] }, { "cell_type": "code", "execution_count": null, "id": "31a2a94e", "metadata": {}, "outputs": [], "source": [ "feature_matrix[\n", " [\n", " \"COUNT(sessions WHERE device = tablet)\",\n", " \"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "id": "7b43a4a5", "metadata": {}, "source": [ "The first geature, `COUNT(sessions WHERE device = tablet)`, can be understood as indicating *how many sessions a customer completed on a tablet*.\n", "\n", "The second feature, `AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)`, calculates *the time between those sessions*.\n", "\n", "We can see that customer who only had 0 or 1 sessions on a tablet had ``NaN`` values for average time between such sessions.\n", "\n", "\n", "## Encoding categorical features\n", "\n", "Machine learning algorithms typically expect all numeric data or data that has defined numeric representations, like boolean values corresponding to `0` and `1`. When Deep Feature Synthesis generates categorical features, we can encode them using Featureools." ] }, { "cell_type": "code", "execution_count": null, "id": "a2ccb27b", "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mode\"],\n", " trans_primitives=[\"time_since\"],\n", " max_depth=1,\n", ")\n", "\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "a50adb54", "metadata": {}, "source": [ "This feature matrix contains 2 columns that are categorical in nature, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values into boolean values. Featuretools offers functionality to apply one hot encoding to the output of DFS." ] }, { "cell_type": "code", "execution_count": null, "id": "088672ac", "metadata": {}, "outputs": [], "source": [ "feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\n", "feature_matrix_enc" ] }, { "cell_type": "markdown", "id": "54076098", "metadata": {}, "source": [ "The returned feature matrix is now encoded in a way that is interpretable to machine learning algorithms. Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values." ] }, { "cell_type": "code", "execution_count": null, "id": "db8dd84b", "metadata": {}, "outputs": [], "source": [ "features_enc" ] }, { "cell_type": "markdown", "id": "b4bda3a2", "metadata": {}, "source": [ "These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read the [Deployment](deployment.ipynb) guide." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 5 }