What is Featuretools?#
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.
5 Minute Quick Start#
Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.
[1]:
import featuretools as ft
Load Mock Data#
[2]:
data = ft.demo.load_mock_customer()
Prepare data#
In this toy dataset, there are 3 DataFrames.
customers: unique customers who had sessions
sessions: unique sessions and associated attributes
transactions: list of events in this session
[3]:
customers_df = data["customers"]
customers_df
[3]:
customer_id | zip_code | join_date | birthday | |
---|---|---|---|---|
0 | 1 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
1 | 2 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 3 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
3 | 4 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
4 | 5 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
[4]:
sessions_df = data["sessions"]
sessions_df.sample(5)
[4]:
session_id | customer_id | device | session_start | |
---|---|---|---|---|
13 | 14 | 1 | tablet | 2014-01-01 03:28:00 |
6 | 7 | 3 | tablet | 2014-01-01 01:39:40 |
1 | 2 | 5 | mobile | 2014-01-01 00:17:20 |
28 | 29 | 1 | mobile | 2014-01-01 07:10:05 |
24 | 25 | 3 | desktop | 2014-01-01 05:59:40 |
[5]:
transactions_df = data["transactions"]
transactions_df.sample(5)
[5]:
transaction_id | session_id | transaction_time | product_id | amount | |
---|---|---|---|---|---|
74 | 232 | 5 | 2014-01-01 01:20:10 | 1 | 139.20 |
231 | 27 | 17 | 2014-01-01 04:10:15 | 2 | 90.79 |
434 | 36 | 31 | 2014-01-01 07:50:10 | 3 | 62.35 |
420 | 56 | 30 | 2014-01-01 07:35:00 | 3 | 72.70 |
54 | 444 | 4 | 2014-01-01 00:58:30 | 4 | 43.59 |
First, we specify a dictionary with all the DataFrames in our dataset. The DataFrames are passed in with their index column and time index column if one exists for the DataFrame.
[6]:
dataframes = {
"customers": (customers_df, "customer_id"),
"sessions": (sessions_df, "session_id", "session_start"),
"transactions": (transactions_df, "transaction_id", "transaction_time"),
}
Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:
(parent_dataframe, parent_column, child_dataframe, child_column)
In this dataset we have two relationships
[7]:
relationships = [
("sessions", "session_id", "transactions", "session_id"),
("customers", "customer_id", "sessions", "customer_id"),
]
Note
To manage setting up DataFrames and relationships, we recommend using the EntitySet
class which offers convenient APIs for managing data like this. See Representing Data with EntitySets for more information.
Run Deep Feature Synthesis#
A minimal input to DFS is a dictionary of DataFrames, a list of relationships, and the name of the target DataFrame whose features we want to calculate. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.
Let’s first create a feature matrix for each customer in the data
[8]:
feature_matrix_customers, features_defs = ft.dfs(
dataframes=dataframes,
relationships=relationships,
target_dataframe_name="customers",
)
feature_matrix_customers
[8]:
COUNT(sessions) | MODE(sessions.device) | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | SKEW(transactions.amount) | ... | STD(sessions.SKEW(transactions.amount)) | STD(sessions.SUM(transactions.amount)) | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
1 | 8 | mobile | 3 | 126 | 139.43 | 71.631905 | 5.81 | 4 | 5 | 0.019698 | ... | 0.589386 | 279.510713 | 1057.97 | 582.193117 | 78.59 | 40.0 | -0.476122 | 312.745952 | mobile | 3 |
2 | 7 | desktop | 3 | 93 | 146.81 | 77.422366 | 8.73 | 4 | 5 | 0.098259 | ... | 0.509798 | 251.609234 | 931.63 | 548.905851 | 154.60 | 35.0 | -0.277640 | 258.700528 | desktop | 3 |
3 | 6 | desktop | 3 | 93 | 149.15 | 67.060430 | 5.89 | 1 | 5 | 0.418230 | ... | 0.429374 | 219.021420 | 847.63 | 405.237462 | 66.21 | 29.0 | 2.286086 | 257.299895 | desktop | 3 |
4 | 8 | mobile | 3 | 109 | 149.95 | 80.070459 | 5.73 | 2 | 5 | -0.036348 | ... | 0.387884 | 235.992478 | 1157.99 | 649.657515 | 131.51 | 37.0 | 0.002764 | 356.125829 | mobile | 3 |
5 | 6 | mobile | 3 | 79 | 149.02 | 80.375443 | 7.55 | 5 | 5 | -0.025941 | ... | 0.415426 | 402.775486 | 839.76 | 472.231119 | 86.49 | 30.0 | 0.014384 | 259.873954 | mobile | 3 |
5 rows × 74 columns
We now have dozens of new features to describe a customer’s behavior.
Change target DataFrame#
One of the reasons DFS is so powerful is that it can create a feature matrix for any DataFrame in our EntitySet. For example, if we wanted to build features for sessions.
[10]:
feature_matrix_sessions, features_defs = ft.dfs(
dataframes=dataframes, relationships=relationships, target_dataframe_name="sessions"
)
feature_matrix_sessions.head(5)
[10]:
customer_id | device | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | SKEW(transactions.amount) | STD(transactions.amount) | ... | customers.STD(transactions.amount) | customers.SUM(transactions.amount) | customers.DAY(birthday) | customers.DAY(join_date) | customers.MONTH(birthday) | customers.MONTH(join_date) | customers.WEEKDAY(birthday) | customers.WEEKDAY(join_date) | customers.YEAR(birthday) | customers.YEAR(join_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
session_id | |||||||||||||||||||||
1 | 2 | desktop | 16 | 141.66 | 76.813125 | 20.91 | 3 | 5 | 0.295458 | 41.600976 | ... | 37.705178 | 7200.28 | 18 | 15 | 8 | 4 | 0 | 6 | 1986 | 2012 |
2 | 5 | mobile | 10 | 135.25 | 74.696000 | 9.32 | 5 | 5 | -0.160550 | 45.893591 | ... | 44.095630 | 6349.66 | 28 | 17 | 7 | 7 | 5 | 5 | 1984 | 2010 |
3 | 4 | mobile | 15 | 147.73 | 88.600000 | 8.70 | 1 | 5 | -0.324012 | 46.240016 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
4 | 1 | mobile | 25 | 129.00 | 64.557200 | 6.29 | 5 | 5 | 0.234349 | 40.187205 | ... | 40.442059 | 9025.62 | 18 | 17 | 7 | 4 | 0 | 6 | 1994 | 2011 |
5 | 4 | mobile | 11 | 139.20 | 70.638182 | 7.43 | 5 | 5 | 0.336381 | 48.918663 | ... | 45.068765 | 8727.68 | 15 | 8 | 8 | 4 | 1 | 4 | 2006 | 2011 |
5 rows × 43 columns
Understanding Feature Output#
In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, featuretools.graph_feature()
and featuretools.describe_feature()
, to help explain what a feature is and the steps Featuretools took to generate it. Let’s look at this example feature:
[11]:
feature = features_defs[18]
feature
[11]:
<Feature: MODE(transactions.YEAR(transaction_time))>
Feature lineage graphs#
Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature.
[12]:
ft.graph_feature(feature)
[12]:
Feature descriptions#
Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See Generating Feature Descriptions for more details on how to customize automatically generated feature descriptions.
[13]:
ft.describe_feature(feature)
[13]:
'The most frequently occurring value of the year of the "transaction_time" of all instances of "transactions" for each "session_id" in "sessions".'
What’s next?#
Learn about Representing Data with EntitySets
Apply automated feature engineering with Deep Feature Synthesis
Explore runnable demos based on real world use cases
Can’t find what you’re looking for? Ask for help
Table of contents#
- Getting Started
- Guides
- Tuning Deep Feature Synthesis
- Specifying Primitive Options
- Improving Computational Performance
- Using Dask EntitySets (BETA)
- Using Spark EntitySets (BETA)
- Deployment
- Advanced Custom Primitives Guide
- Generating Feature Descriptions
- Feature Selection
- Feature Engineering for Time Series Problems
- SQL Database Integration