NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.
Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.
In [1]: import featuretools as ft
In [2]: data = ft.demo.load_mock_customer()
In this toy dataset, there are 3 tables. Each table is called an entity in Featuretools.
entity
customers: unique customers who had sessions
sessions: unique sessions and associated attributes
transactions: list of events in this session
In [3]: customers_df = data["customers"] In [4]: customers_df Out[4]: customer_id zip_code join_date date_of_birth 0 1 60091 2011-04-17 10:48:33 1994-07-18 1 2 13244 2012-04-15 23:31:04 1986-08-18 2 3 13244 2011-08-13 15:42:34 2003-11-21 3 4 60091 2011-04-08 20:08:14 2006-08-15 4 5 60091 2010-07-17 05:27:50 1984-07-28 In [5]: sessions_df = data["sessions"] In [6]: sessions_df.sample(5) Out[6]: session_id customer_id device session_start 13 14 1 tablet 2014-01-01 03:28:00 6 7 3 tablet 2014-01-01 01:39:40 1 2 5 mobile 2014-01-01 00:17:20 28 29 1 mobile 2014-01-01 07:10:05 24 25 3 desktop 2014-01-01 05:59:40 In [7]: transactions_df = data["transactions"] In [8]: transactions_df.sample(5) Out[8]: transaction_id session_id transaction_time product_id amount 74 232 5 2014-01-01 01:20:10 1 139.20 231 27 17 2014-01-01 04:10:15 2 90.79 434 36 31 2014-01-01 07:50:10 3 62.35 420 56 30 2014-01-01 07:35:00 3 72.70 54 444 4 2014-01-01 00:58:30 4 43.59
First, we specify a dictionary with all the entities in our dataset.
In [9]: entities = { ...: "customers" : (customers_df, "customer_id"), ...: "sessions" : (sessions_df, "session_id", "session_start"), ...: "transactions" : (transactions_df, "transaction_id", "transaction_time") ...: } ...:
Second, we specify how the entities are related. When two entities have a one-to-many relationship, we call the “one” enitity, the “parent entity”. A relationship between a parent and child is defined like this:
(parent_entity, parent_variable, child_entity, child_variable)
In this dataset we have two relationships
In [10]: relationships = [("sessions", "session_id", "transactions", "session_id"), ....: ("customers", "customer_id", "sessions", "customer_id")] ....:
Note
To manage setting up entities and relationships, we recommend using the EntitySet class which offers convenient APIs for managing data like this. See Representing Data with EntitySets for more information.
EntitySet
A minimal input to DFS is a set of entities, a list of relationships, and the “target_entity” to calculate features for. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.
Let’s first create a feature matrix for each customer in the data
In [11]: feature_matrix_customers, features_defs = ft.dfs(entities=entities, ....: relationships=relationships, ....: target_entity="customers") ....: In [12]: feature_matrix_customers Out[12]: zip_code COUNT(sessions) MODE(sessions.device) NUM_UNIQUE(sessions.device) COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) MODE(transactions.product_id) NUM_UNIQUE(transactions.product_id) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) DAY(date_of_birth) DAY(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) YEAR(date_of_birth) YEAR(join_date) MAX(sessions.COUNT(transactions)) MAX(sessions.MEAN(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.STD(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.SKEW(transactions.amount)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MIN(sessions.COUNT(transactions)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.SKEW(transactions.amount)) MIN(sessions.STD(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MODE(sessions.DAY(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.MONTH(session_start)) MODE(sessions.WEEKDAY(session_start)) MODE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(sessions.YEAR(session_start)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) STD(sessions.COUNT(transactions)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.SKEW(transactions.amount)) STD(sessions.SUM(transactions.amount)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.STD(transactions.amount)) MODE(transactions.sessions.customer_id) MODE(transactions.sessions.device) NUM_UNIQUE(transactions.sessions.customer_id) NUM_UNIQUE(transactions.sessions.device) customer_id 1 60091 8 mobile 3 126 139.43 71.631905 5.81 4 5 0.019698 40.442059 9025.62 18 17 7 4 0 6 1994 2011 25 88.755625 26.36 5 0.640252 46.905665 1613.93 15.750000 132.246250 72.774140 9.823750 5.000000 -0.059515 39.093244 1128.202500 12 118.90 50.623125 5 -1.038434 30.450261 809.97 1 4 1 2 2014 1 4 1 1 1 1.946018 -0.780493 -0.424949 2.440005 0.000000 -0.312355 0.778170 4.062019 7.322191 13.759314 6.954507 0.000000 0.589386 279.510713 1057.97 582.193117 78.59 40 -0.476122 312.745952 1 mobile 1 3 2 13244 7 desktop 3 93 146.81 77.422366 8.73 4 5 0.098259 37.705178 7200.28 18 15 8 4 0 6 1986 2012 18 96.581000 56.46 5 0.755711 47.935920 1320.64 13.285714 133.090000 78.415122 22.085714 5.000000 -0.039663 36.957218 1028.611429 8 100.04 61.910000 5 -0.763603 27.839228 634.84 1 3 1 2 2014 1 4 1 1 1 -0.303276 -1.539467 0.235296 2.154929 0.000000 0.013087 -0.440929 3.450328 17.221593 11.477071 15.874374 0.000000 0.509798 251.609234 931.63 548.905851 154.60 35 -0.277640 258.700528 2 desktop 1 3 3 13244 6 desktop 3 93 149.15 67.060430 5.89 1 5 0.418230 43.683296 6236.62 21 13 11 8 4 5 2003 2011 18 82.109444 20.06 5 0.854976 50.110120 1477.97 15.500000 141.271667 67.539577 11.035000 4.833333 0.381014 42.883316 1039.436667 11 126.74 55.579412 4 -0.289466 35.704680 889.21 1 1 1 2 2014 1 4 1 1 1 -1.507217 -0.941078 0.678544 1.000771 -2.449490 -0.245703 2.246479 2.428992 10.724241 11.174282 5.424407 0.408248 0.429374 219.021420 847.63 405.237462 66.21 29 2.286086 257.299895 3 desktop 1 3 4 60091 8 mobile 3 109 149.95 80.070459 5.73 2 5 -0.036348 45.068765 8727.68 15 8 8 4 1 4 2006 2011 18 110.450000 54.83 5 0.382868 54.293903 1351.46 13.625000 144.748750 81.207189 16.438750 4.625000 0.000346 44.515729 1090.960000 10 139.20 70.638182 4 -0.711744 29.026424 771.68 1 1 1 2 2014 1 5 1 1 1 0.282488 0.027256 1.980948 2.103510 -0.644061 -1.065663 -0.391805 3.335416 3.514421 13.027258 16.960575 0.517549 0.387884 235.992478 1157.99 649.657515 131.51 37 0.002764 356.125829 4 mobile 1 3 5 60091 6 mobile 3 79 149.02 80.375443 7.55 5 5 -0.025941 44.095630 6349.66 28 17 7 7 5 5 1984 2010 18 94.481667 20.65 5 0.602209 51.149250 1700.67 13.166667 139.960000 78.705187 14.415000 5.000000 0.002397 43.312326 1058.276667 8 128.51 66.666667 5 -0.539060 36.734681 543.18 1 3 1 2 2014 1 5 1 1 1 -0.317685 -0.333796 0.335175 -0.470410 0.000000 0.204548 0.472342 3.600926 7.928001 11.007471 4.961414 0.000000 0.415426 402.775486 839.76 472.231119 86.49 30 0.014384 259.873954 5 mobile 1 3
We now have dozens of new features to describe a customer’s behavior.
One of the reasons DFS is so powerful is that it can create a feature matrix for any entity in our data. For example, if we wanted to build features for sessions.
In [13]: feature_matrix_sessions, features_defs = ft.dfs(entities=entities, ....: relationships=relationships, ....: target_entity="sessions") ....: In [14]: feature_matrix_sessions.head(5) Out[14]: customer_id device COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) MODE(transactions.product_id) NUM_UNIQUE(transactions.product_id) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) DAY(session_start) MONTH(session_start) WEEKDAY(session_start) YEAR(session_start) customers.zip_code MODE(transactions.DAY(transaction_time)) MODE(transactions.MONTH(transaction_time)) MODE(transactions.WEEKDAY(transaction_time)) MODE(transactions.YEAR(transaction_time)) NUM_UNIQUE(transactions.DAY(transaction_time)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) NUM_UNIQUE(transactions.YEAR(transaction_time)) customers.COUNT(sessions) customers.MODE(sessions.device) customers.NUM_UNIQUE(sessions.device) customers.COUNT(transactions) customers.MAX(transactions.amount) customers.MEAN(transactions.amount) customers.MIN(transactions.amount) customers.MODE(transactions.product_id) customers.NUM_UNIQUE(transactions.product_id) customers.SKEW(transactions.amount) customers.STD(transactions.amount) customers.SUM(transactions.amount) customers.DAY(date_of_birth) customers.DAY(join_date) customers.MONTH(date_of_birth) customers.MONTH(join_date) customers.WEEKDAY(date_of_birth) customers.WEEKDAY(join_date) customers.YEAR(date_of_birth) customers.YEAR(join_date) session_id 1 2 desktop 16 141.66 76.813125 20.91 3 5 0.295458 41.600976 1229.01 1 1 2 2014 13244 1 1 2 2014 1 1 1 1 7 desktop 3 93 146.81 77.422366 8.73 4 5 0.098259 37.705178 7200.28 18 15 8 4 0 6 1986 2012 2 5 mobile 10 135.25 74.696000 9.32 5 5 -0.160550 45.893591 746.96 1 1 2 2014 60091 1 1 2 2014 1 1 1 1 6 mobile 3 79 149.02 80.375443 7.55 5 5 -0.025941 44.095630 6349.66 28 17 7 7 5 5 1984 2010 3 4 mobile 15 147.73 88.600000 8.70 1 5 -0.324012 46.240016 1329.00 1 1 2 2014 60091 1 1 2 2014 1 1 1 1 8 mobile 3 109 149.95 80.070459 5.73 2 5 -0.036348 45.068765 8727.68 15 8 8 4 1 4 2006 2011 4 1 mobile 25 129.00 64.557200 6.29 5 5 0.234349 40.187205 1613.93 1 1 2 2014 60091 1 1 2 2014 1 1 1 1 8 mobile 3 126 139.43 71.631905 5.81 4 5 0.019698 40.442059 9025.62 18 17 7 4 0 6 1994 2011 5 4 mobile 11 139.20 70.638182 7.43 5 5 0.336381 48.918663 777.02 1 1 2 2014 60091 1 1 2 2014 1 1 1 1 8 mobile 3 109 149.95 80.070459 5.73 2 5 -0.036348 45.068765 8727.68 15 8 8 4 1 4 2006 2011
In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, featuretools.graph_feature() and featuretools.describe_feature(), to help explain what a feature is and the steps Featuretools took to generate it. [let’s look at this example feature]
featuretools.graph_feature()
featuretools.describe_feature()
In [15]: feature = features_defs[18] In [16]: feature Out[16]: <Feature: MODE(transactions.WEEKDAY(transaction_time))>
Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature.
In [17]: ft.graph_feature(feature) Out[17]: <graphviz.dot.Digraph at 0x7fce49c11ad0>
Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See Generating Feature Descriptions for more details on how to customize automatically generated feature descriptions.
In [18]: ft.describe_feature(feature) Out[18]: 'The most frequently occurring value of the day of the week of the "transaction_time" of all instances of "transactions" for each "session_id" in "sessions".'
Learn about Representing Data with EntitySets
Apply automated feature engineering with Deep Feature Synthesis
Explore runnable demos based on real world use cases
Can’t find what you’re looking for? Ask for Help
Resources and References
Index
Search Page