{ "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _primitives:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature primitives\n", "Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.\n", "\n", "## Why primitives?\n", "The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.\n", "\n", "A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: *average time between events*. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.\n", "\n", "DFS achieves the same feature by stacking two primitives `\"time_since_previous\"` and `\"mean\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "\n", "es = ft.demo.load_mock_customer(return_entityset=True)\n", "\n", "feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\"],\n", " trans_primitives=[\"time_since_previous\"],\n", " features_only=True,\n", ")\n", "\n", "feature_defs" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " Browse the entire catalog of included `Featuretools Primitives here `_ \n", " \n", " \n", ".. note:: \n", "\n", " The primitive arguments to DFS (eg. ``agg_primitives`` and ``trans_primitives`` in the example above) accept ``snake_case``, ``camelCase``, or ``TitleCase`` strings of included Featuretools primitives (ie. ``time_since_previous``, ``timeSincePrevious``, and ``TimeSincePrevious`` are all acceptable inputs).\n", "\n", ".. note::\n", "\n", " When ``dfs`` is called with ``features_only=True``, only feature definitions are returned as output. By default this parameter is set to ``False``. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"customers\",\n", " agg_primitives=[\"mean\", \"max\", \"min\", \"std\", \"skew\"],\n", " trans_primitives=[\"time_since_previous\"],\n", ")\n", "\n", "feature_matrix[[\n", " \"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"STD(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", " \"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n", "]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation vs Transform Primitive\n", "\n", "In the example above, we use two types of primitives.\n", "\n", "**Aggregation primitives:** These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an EntitySet. E.g: `\"count\"`, `\"sum\"`, `\"avg_time_between\"`." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. graphviz:: graphs/agg_feat.dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Transform primitives:** These primitives take one or more columns from a dataframe as an input and output a new column for that dataframe. They are applied to a single dataframe. E.g: `\"hour\"`, `\"time_since_previous\"`, `\"absolute\"`." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. graphviz:: graphs/trans_feat.dot\n", "\n", "\n", "The above graphs were generated using the :func:`graph_feature ` function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a DataFrame that lists and describes each built-in primitive in Featuretools, call `ft.list_primitives()`. In addition, a list of all available primitives can be obtained by visiting [primitives.featurelabs.com](https://primitives.featurelabs.com/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ft.list_primitives().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Custom Primitives\n", "\n", "The library of primitives in Featuretools is constantly expanding. Users can define their own primitive using the APIs below. To define a primitive, a user will\n", "\n", "\n", " * Specify the type of primitive `Aggregation` or `Transform`\n", " * Define the input and output data types\n", " * Write a function in python to do the calculation\n", " * Annotate with attributes to constrain how it is applied\n", "\n", "\n", "Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from featuretools.primitives import AggregationPrimitive, TransformPrimitive\n", "from featuretools.tests.testing_utils import make_ecommerce_entityset\n", "from woodwork.column_schema import ColumnSchema\n", "from woodwork.logical_types import Datetime, NaturalLanguage\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "### Simple Custom Primitives" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Absolute(TransformPrimitive):\n", " name = 'absolute'\n", " input_types = [ColumnSchema(semantic_tags={'numeric'})]\n", " return_type = ColumnSchema(semantic_tags={'numeric'})\n", "\n", " def get_function(self):\n", " def absolute(column):\n", " return abs(column)\n", "\n", " return absolute" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "Above, we created a new transform primitive that can be used with Deep Feature Synthesis by deriving a new primitive class using `TransformPrimitive` as a base and overriding `get_function` to return a function that calculates the feature. Additionally, we set the input data types that the primitive applies to and the return data type. Input and return data types are defined using a Woodwork ColumnSchema. A full guide on Woodwork logical types and semantic tags can be found in the Woodwork [Understanding Types and Tags](https://woodwork.alteryx.com/en/stable/guides/understanding_types_and_tags.html) guide.\n", "\n", "Similarly, we can make a new aggregation primitive using `AggregationPrimitive`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Maximum(AggregationPrimitive):\n", " name = 'maximum'\n", " input_types = [ColumnSchema(semantic_tags={'numeric'})]\n", " return_type = ColumnSchema(semantic_tags={'numeric'})\n", "\n", " def get_function(self):\n", " def maximum(column):\n", " return max(column)\n", "\n", " return maximum" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "Because we defined an aggregation primitive, the function takes in a list of values but only returns one.\n", "\n", "Now that we've defined two primitives, we can use them with the dfs function as if they were built-in primitives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, feature_defs = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[Maximum],\n", " trans_primitives=[Absolute],\n", " max_depth=2,\n", ")\n", "\n", "feature_matrix.head(5)[[\n", " \"customers.MAXIMUM(transactions.amount)\",\n", " \"MAXIMUM(transactions.ABSOLUTE(amount))\",\n", "]]" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "### Word Count Example\n", "\n", "Here we define a transform primitive, `WordCount`, which counts the number of words in each row of an input and returns a list of the counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class WordCount(TransformPrimitive):\n", " '''\n", " Counts the number of words in each row of the column. Returns a list\n", " of the counts for each row.\n", " '''\n", " name = 'word_count'\n", " input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n", " return_type = ColumnSchema(semantic_tags={'numeric'})\n", "\n", " def get_function(self):\n", " def word_count(column):\n", " word_counts = []\n", " for value in column:\n", " words = value.split(None)\n", " word_counts.append(len(words))\n", " return word_counts\n", "\n", " return word_count" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es = make_ecommerce_entityset()\n", "\n", "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[\"sum\", \"mean\", \"std\"],\n", " trans_primitives=[WordCount],\n", ")\n", "\n", "feature_matrix[[\n", " \"customers.WORD_COUNT(favorite_quote)\",\n", " \"STD(log.WORD_COUNT(comments))\",\n", " \"SUM(log.WORD_COUNT(comments))\",\n", " \"MEAN(log.WORD_COUNT(comments))\",\n", "]]" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.\n", "\n", "### Multiple Input Types\n", "\n", "If a primitive requires multiple features as input, `input_types` has multiple elements, eg `[ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]` would mean the primitive requires two columns with the semantic tag `numeric` as input. Below is an example of a primitive that has multiple input features." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "class MeanSunday(AggregationPrimitive):\n", " '''\n", " Finds the mean of non-null values of a feature that occurred on Sundays\n", " '''\n", " name = 'mean_sunday'\n", " input_types = [ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(logical_type=Datetime)]\n", " return_type = ColumnSchema(semantic_tags={'numeric'})\n", "\n", " def get_function(self):\n", " def mean_sunday(numeric, datetime):\n", " days = pd.DatetimeIndex(datetime).weekday.values\n", " df = pd.DataFrame({'numeric': numeric, 'time': days})\n", " return df[df['time'] == 6]['numeric'].mean()\n", "\n", " return mean_sunday" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"sessions\",\n", " agg_primitives=[MeanSunday],\n", " trans_primitives=[],\n", " max_depth=1,\n", ")\n", "\n", "feature_matrix[[\n", " \"MEAN_SUNDAY(log.value, datetime)\",\n", " \"MEAN_SUNDAY(log.value_2, datetime)\",\n", "]]" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }