NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.

What is Featuretools?¶

Featuretools

Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

5 Minute Quick Start¶

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

[1]:

import featuretools as ft

Load Mock Data¶

[2]:

data = ft.demo.load_mock_customer()

Prepare data¶

In this toy dataset, there are 3 DataFrames.

customers: unique customers who had sessions
sessions: unique sessions and associated attributes
transactions: list of events in this session

[3]:

customers_df = data["customers"]
customers_df

[3]:

	customer_id	zip_code	join_date	date_of_birth
0	1	60091	2011-04-17 10:48:33	1994-07-18
1	2	13244	2012-04-15 23:31:04	1986-08-18
2	3	13244	2011-08-13 15:42:34	2003-11-21
3	4	60091	2011-04-08 20:08:14	2006-08-15
4	5	60091	2010-07-17 05:27:50	1984-07-28

[4]:

sessions_df = data["sessions"]
sessions_df.sample(5)

[4]:

	session_id	customer_id	device	session_start
13	14	1	tablet	2014-01-01 03:28:00
6	7	3	tablet	2014-01-01 01:39:40
1	2	5	mobile	2014-01-01 00:17:20
28	29	1	mobile	2014-01-01 07:10:05
24	25	3	desktop	2014-01-01 05:59:40

[5]:

transactions_df = data["transactions"]
transactions_df.sample(5)

[5]:

	transaction_id	session_id	transaction_time	product_id	amount
74	232	5	2014-01-01 01:20:10	1	139.20
231	27	17	2014-01-01 04:10:15	2	90.79
434	36	31	2014-01-01 07:50:10	3	62.35
420	56	30	2014-01-01 07:35:00	3	72.70
54	444	4	2014-01-01 00:58:30	4	43.59

First, we specify a dictionary with all the DataFrames in our dataset. The DataFrames are passed in with their index column and time index column if one exists for the DataFrame.

[6]:

dataframes = {
   "customers" : (customers_df, "customer_id"),
   "sessions" : (sessions_df, "session_id", "session_start"),
   "transactions" : (transactions_df, "transaction_id", "transaction_time")
}

Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:

(parent_dataframe, parent_column, child_dataframe, child_column)

In this dataset we have two relationships

[7]:

relationships = [("sessions", "session_id", "transactions", "session_id"),
                 ("customers", "customer_id", "sessions", "customer_id")]

Note

To manage setting up DataFrames and relationships, we recommend using the EntitySet class which offers convenient APIs for managing data like this. See Representing Data with EntitySets for more information.

Run Deep Feature Synthesis¶

A minimal input to DFS is a dictionary of DataFrames, a list of relationships, and the name of the target DataFrame whose features we want to calculate. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.

Let’s first create a feature matrix for each customer in the data

[8]:

feature_matrix_customers, features_defs = ft.dfs(dataframes=dataframes,
                                                 relationships=relationships,
                                                 target_dataframe_name="customers")
feature_matrix_customers

[8]:

	COUNT(sessions)	MODE(sessions.device)	NUM_UNIQUE(sessions.device)	COUNT(transactions)	MAX(transactions.amount)	MEAN(transactions.amount)	MIN(transactions.amount)	MODE(transactions.product_id)	NUM_UNIQUE(transactions.product_id)	SKEW(transactions.amount)	...	STD(sessions.SKEW(transactions.amount))	STD(sessions.SUM(transactions.amount))	SUM(sessions.MAX(transactions.amount))	SUM(sessions.MEAN(transactions.amount))	SUM(sessions.MIN(transactions.amount))	SUM(sessions.NUM_UNIQUE(transactions.product_id))	SUM(sessions.SKEW(transactions.amount))	SUM(sessions.STD(transactions.amount))	MODE(transactions.sessions.device)	NUM_UNIQUE(transactions.sessions.device)
customer_id
1	8	mobile	3	126	139.43	71.631905	5.81	4	5	0.019698	...	0.589386	279.510713	1057.97	582.193117	78.59	40.0	-0.476122	312.745952	mobile	3
2	7	desktop	3	93	146.81	77.422366	8.73	4	5	0.098259	...	0.509798	251.609234	931.63	548.905851	154.60	35.0	-0.277640	258.700528	desktop	3
3	6	desktop	3	93	149.15	67.060430	5.89	1	5	0.418230	...	0.429374	219.021420	847.63	405.237462	66.21	29.0	2.286086	257.299895	desktop	3
4	8	mobile	3	109	149.95	80.070459	5.73	2	5	-0.036348	...	0.387884	235.992478	1157.99	649.657515	131.51	37.0	0.002764	356.125829	mobile	3
5	6	mobile	3	79	149.02	80.375443	7.55	5	5	-0.025941	...	0.415426	402.775486	839.76	472.231119	86.49	30.0	0.014384	259.873954	mobile	3

5 rows × 74 columns

We now have dozens of new features to describe a customer’s behavior.

Change target DataFrame¶

One of the reasons DFS is so powerful is that it can create a feature matrix for any DataFrame in our EntitySet. For example, if we wanted to build features for sessions.

[10]:

feature_matrix_sessions, features_defs = ft.dfs(dataframes=dataframes,
                                                relationships=relationships,
                                                target_dataframe_name="sessions")
feature_matrix_sessions.head(5)

[10]:

	customer_id	device	COUNT(transactions)	MAX(transactions.amount)	MEAN(transactions.amount)	MIN(transactions.amount)	MODE(transactions.product_id)	NUM_UNIQUE(transactions.product_id)	SKEW(transactions.amount)	STD(transactions.amount)	...	customers.STD(transactions.amount)	customers.SUM(transactions.amount)	customers.DAY(date_of_birth)	customers.DAY(join_date)	customers.MONTH(date_of_birth)	customers.MONTH(join_date)	customers.WEEKDAY(date_of_birth)	customers.WEEKDAY(join_date)	customers.YEAR(date_of_birth)	customers.YEAR(join_date)
session_id
1	2	desktop	16	141.66	76.813125	20.91	3	5	0.295458	41.600976	...	37.705178	7200.28	18	15	8	4	0	6	1986	2012
2	5	mobile	10	135.25	74.696000	9.32	5	5	-0.160550	45.893591	...	44.095630	6349.66	28	17	7	7	5	5	1984	2010
3	4	mobile	15	147.73	88.600000	8.70	1	5	-0.324012	46.240016	...	45.068765	8727.68	15	8	8	4	1	4	2006	2011
4	1	mobile	25	129.00	64.557200	6.29	5	5	0.234349	40.187205	...	40.442059	9025.62	18	17	7	4	0	6	1994	2011
5	4	mobile	11	139.20	70.638182	7.43	5	5	0.336381	48.918663	...	45.068765	8727.68	15	8	8	4	1	4	2006	2011

5 rows × 43 columns

Understanding Feature Output¶

In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, featuretools.graph_feature() and featuretools.describe_feature(), to help explain what a feature is and the steps Featuretools took to generate it. Let’s look at this example feature:

[11]:

feature = features_defs[18]
feature

[11]:

<Feature: MODE(transactions.YEAR(transaction_time))>

Feature lineage graphs¶

Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature.

[12]:

ft.graph_feature(feature)

[12]:

$digraph "MODE(transactions.WEEKDAY(transaction_time))" { graph [bb="0,0,1213,156", rankdir=LR ]; node [label="\N", shape=box ]; edge [arrowhead=none, dir=forward, style=dotted ]; { graph [rank=min]; "1_WEEKDAY(transaction_time)_weekday" [height=0.94444, label=<<FONT POINT-SIZE="12"><B>Step 1:</B> Transform<BR></BR></FONT>WEEKDAY>, pos="111,60", shape=diamond, width=3.0833]; } sessions [height=1.1389, label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10"> <TR> <TD colspan="1" bgcolor="#A9A9A9"><B>★ sessions (target)</B></TD> </TR> <TR> <TD ALIGN="LEFT" port="MODE(transactions.WEEKDAY(transaction_time))" BGCOLOR="#D9EAD3">MODE(transactions.WEEKDAY(transaction_time))</TD> </TR> </TABLE>>, pos="1050.5,60", shape=plaintext, width=4.5139]; transactions [height=2.1667, label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10"> <TR> <TD colspan="1" bgcolor="#A9A9A9"><B>transactions</B></TD> </TR><TR><TD ALIGN="LEFT" port="transaction_time">transaction_time</TD></TR> <TR><TD ALIGN="LEFT" port="session_id">session_id</TD></TR> <TR><TD ALIGN="LEFT" port="WEEKDAY(transaction_time)">WEEKDAY(transaction_time)</TD></TR> </TABLE>>, pos="361.5,78", shape=plaintext, width=2.875]; transactions:transaction_time -> "1_WEEKDAY(transaction_time)_weekday" [arrowhead="", pos="e,162.11,78.5 266.5,97 234.73,97 200.27,89.598 171.88,81.416", style=solid]; "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [height=0.52778, label="group by session_id", pos="537.5,41", width=1.0139]; transactions:"WEEKDAY(transaction_time)" -> "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [arrowhead="", pos="e,500.89,28.539 457.5,22 468.54,22 480.21,23.734 491.12,26.153", style=solid]; transactions:session_id -> "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" [pos="457.5,59 471.88,59 487.36,56.198 500.84,52.805"]; "0_MODE(transactions.WEEKDAY(transaction_time))_mode" [height=0.94444, label=<<FONT POINT-SIZE="12"><B>Step 2:</B> Aggregation<BR></BR></FONT>MODE>, pos="731,41", shape=diamond, width=3.3611]; "0_MODE(transactions.WEEKDAY(transaction_time))_mode" -> sessions:"MODE(transactions.WEEKDAY(transaction_time))" [arrowhead="", pos="e,896.5,41 852.12,41 863.56,41 875.08,41 886.29,41", style=solid]; "1_WEEKDAY(transaction_time)_weekday" -> transactions:"WEEKDAY(transaction_time)" [arrowhead="", pos="e,266.5,22 161.37,41.223 188.63,32.547 223.38,23.809 256.34,22.246", style=solid]; "MODE(transactions.WEEKDAY(transaction_time))_groupby_transactions--session_id" -> "0_MODE(transactions.WEEKDAY(transaction_time))_mode" [arrowhead="", pos="e,609.66,41 574.35,41 582,41 590.47,41 599.4,41", style=solid]; }$

Feature descriptions¶

Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See Generating Feature Descriptions for more details on how to customize automatically generated feature descriptions.

[13]:

ft.describe_feature(feature)

[13]:

'The most frequently occurring value of the year of the "transaction_time" of all instances of "transactions" for each "session_id" in "sessions".'

What’s next?¶

Learn about Representing Data with EntitySets
Apply automated feature engineering with Deep Feature Synthesis
Explore runnable demos based on real world use cases
Can’t find what you’re looking for? Ask for help

Table of contents¶

Install

Resources and References

What is Featuretools?¶

5 Minute Quick Start¶

Load Mock Data¶

Prepare data¶

Run Deep Feature Synthesis¶

Change target DataFrame¶

Understanding Feature Output¶

Feature lineage graphs¶

Feature descriptions¶

What’s next?¶

Table of contents¶

Other links¶