NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.
Deep Feature Synthesis requires structured datasets in order to perform feature engineering. To demonstrate the capabilities of DFS, we will use a mock customer transactions dataset.
Note
Before using DFS, it is recommended that you prepare your data as an EntitySet. See Representing Data with EntitySets to learn how.
EntitySet
In [1]: import featuretools as ft In [2]: es = ft.demo.load_mock_customer(return_entityset=True) In [3]: es Out[3]: Entityset: transactions Entities: transactions [Rows: 500, Columns: 5] products [Rows: 5, Columns: 2] sessions [Rows: 35, Columns: 4] customers [Rows: 5, Columns: 4] Relationships: transactions.product_id -> products.product_id transactions.session_id -> sessions.session_id sessions.customer_id -> customers.customer_id
Once data is prepared as an EntitySet, we are ready to automatically generate features for a target entity - e.g. customers.
customers
Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer’s behavior. In this example, an expert might be interested in features such as: total number of sessions or month the customer signed up.
These features can be generated by DFS when we specify the target_entity as customers and "count" and "month" as primitives.
"count"
"month"
In [4]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["count"], ...: trans_primitives=["month"], ...: max_depth=1) ...: In [5]: feature_matrix Out[5]: zip_code COUNT(sessions) MONTH(date_of_birth) MONTH(join_date) customer_id 5 60091 6 7 7 4 60091 8 8 4 1 60091 8 7 4 3 13244 6 11 8 2 13244 7 8 4
In the example above, "count" is an aggregation primitive because it computes a single value based on many sessions related to one customer. "month" is called a transform primitive because it takes one value for a customer transforms it to another.
Feature primitives are a fundamental component to Featuretools. To learn more read Feature primitives.
The name Deep Feature Synthesis comes from the algorithm’s ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the “depth” of a feature. The max_depth parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with max_depth=2
max_depth
max_depth=2
In [6]: feature_matrix, feature_defs = ft.dfs(entityset=es, ...: target_entity="customers", ...: agg_primitives=["mean", "sum", "mode"], ...: trans_primitives=["month", "hour"], ...: max_depth=2) ...: In [7]: feature_matrix Out[7]: zip_code ... MODE(transactions.sessions.device) customer_id ... 5 60091 ... mobile 4 60091 ... mobile 1 60091 ... mobile 3 13244 ... desktop 2 13244 ... desktop [5 rows x 17 columns]
With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this paper. In the returned feature matrix, let us understand one of the depth 2 features
In [8]: feature_matrix[['MEAN(sessions.SUM(transactions.amount))']] Out[8]: MEAN(sessions.SUM(transactions.amount)) customer_id 5 1058.276667 4 1090.960000 1 1128.202500 3 1039.436667 2 1028.611429
For each customer this feature
calculates the sum of all transaction amounts per session to get total amount per session,
sum
then applies the mean to the total amounts across multiple sessions to identify the average amount spent per session
mean
We call this feature a “deep feature” with a depth of 2.
Let’s look at another depth 2 feature that calculates for every customer the most common hour of the day when they start a session
In [9]: feature_matrix[['MODE(sessions.HOUR(session_start))']] Out[9]: MODE(sessions.HOUR(session_start)) customer_id 5 0 4 1 1 6 3 5 2 3
For each customer this feature calculates
The hour of the day each of his or her sessions started, then
hour
uses the statistical function mode to identify the most common hour he or she started a session
mode
Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning.
You can graphically visualize the lineage of a feature by calling featuretools.graph_feature() on it. You can also generate an English description of the feature with featuretools.describe_feature(). See Generating Feature Descriptions for more details.
featuretools.graph_feature()
featuretools.describe_feature()
DFS is powerful because we can create a feature matrix for any entity in our dataset. If we switch our target entity to “sessions”, we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session.
In [10]: feature_matrix, feature_defs = ft.dfs(entityset=es, ....: target_entity="sessions", ....: agg_primitives=["mean", "sum", "mode"], ....: trans_primitives=["month", "hour"], ....: max_depth=2) ....: In [11]: feature_matrix.head(5) Out[11]: customer_id ... customers.MONTH(join_date) session_id ... 1 2 ... 4 2 5 ... 7 3 4 ... 4 4 1 ... 4 5 4 ... 4 [5 rows x 19 columns]
As we can see, DFS will also build deep features based on a parent entity, in this case the customer of a particular session. For example, the feature below calculates the mean transaction amount of the customer of the session.
In [12]: feature_matrix[['customers.MEAN(transactions.amount)']].head(5) Out[12]: customers.MEAN(transactions.amount) session_id 1 77.422366 2 80.375443 3 80.070459 4 71.631905 5 80.070459
To learn about the parameters to change in DFS read Tuning Deep Feature Synthesis.