NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
An EntitySet is a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take entities and relationships as separate arguments, it is recommended to create an EntitySet, so you can more easily manipulate your data as needed.
EntitySet
entities
relationships
Below we have a two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:
In [1]: import featuretools as ft In [2]: data = ft.demo.load_mock_customer() In [3]: transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"]) In [4]: transactions_df.sample(10) Out[4]: transaction_id session_id transaction_time product_id amount customer_id device session_start zip_code join_date date_of_birth 264 380 21 2014-01-01 05:14:10 5 57.09 4 desktop 2014-01-01 05:02:15 60091 2011-04-08 20:08:14 2006-08-15 19 244 10 2014-01-01 02:34:55 2 116.95 2 tablet 2014-01-01 02:31:40 13244 2012-04-15 23:31:04 1986-08-18 314 299 6 2014-01-01 01:32:05 4 64.99 1 tablet 2014-01-01 01:23:25 60091 2011-04-17 10:48:33 1994-07-18 290 78 4 2014-01-01 00:54:10 1 37.50 1 mobile 2014-01-01 00:44:25 60091 2011-04-17 10:48:33 1994-07-18 379 457 27 2014-01-01 06:37:35 1 19.16 1 mobile 2014-01-01 06:34:20 60091 2011-04-17 10:48:33 1994-07-18 335 477 9 2014-01-01 02:30:35 3 41.70 1 desktop 2014-01-01 02:15:25 60091 2011-04-17 10:48:33 1994-07-18 293 103 4 2014-01-01 00:57:25 5 20.79 1 mobile 2014-01-01 00:44:25 60091 2011-04-17 10:48:33 1994-07-18 271 390 22 2014-01-01 05:21:45 2 54.83 4 desktop 2014-01-01 05:21:45 60091 2011-04-08 20:08:14 2006-08-15 404 476 29 2014-01-01 07:24:10 4 121.59 1 mobile 2014-01-01 07:10:05 60091 2011-04-17 10:48:33 1994-07-18 179 90 3 2014-01-01 00:35:45 1 75.73 4 mobile 2014-01-01 00:28:10 60091 2011-04-08 20:08:14 2006-08-15
And the second dataframe is a list of products involved in those transactions.
In [5]: products_df = data["products"] In [6]: products_df Out[6]: product_id brand 0 1 B 1 2 B 2 3 B 3 4 B 4 5 A
First, we initialize an EntitySet. If you’d like to give it name, you can optionally provide an id to the constructor.
id
In [7]: es = ft.EntitySet(id="customer_data")
To get started, we load the transactions dataframe as an entity.
In [8]: es = es.entity_from_dataframe(entity_id="transactions", ...: dataframe=transactions_df, ...: index="transaction_id", ...: time_index="transaction_time", ...: variable_types={"product_id": ft.variable_types.Categorical, ...: "zip_code": ft.variable_types.ZIPCode}) ...: In [9]: es Out[9]: Entityset: customer_data Entities: transactions [Rows: 500, Columns: 11] Relationships: No relationships
Note
You can also display your entity set structure graphically by calling EntitySet.plot().
EntitySet.plot()
This method loads each column in the dataframe in as a variable. We can see the variables in an entity using the code below.
In [10]: es["transactions"].variables Out[10]: [<Variable: transaction_id (dtype = index)>, <Variable: session_id (dtype = numeric)>, <Variable: transaction_time (dtype: datetime_time_index, format: None)>, <Variable: amount (dtype = numeric)>, <Variable: customer_id (dtype = numeric)>, <Variable: device (dtype = categorical)>, <Variable: session_start (dtype: datetime, format: None)>, <Variable: join_date (dtype: datetime, format: None)>, <Variable: date_of_birth (dtype: datetime, format: None)>, <Variable: product_id (dtype = categorical)>, <Variable: zip_code (dtype = zip_code)>]
In the call to entity_from_dataframe, we specified three important parameters
entity_from_dataframe
The index parameter specifies the column that uniquely identifies rows in the dataframe
index
The time_index parameter tells Featuretools when the data was created.
time_index
The variable_types parameter indicates that “product_id” should be interpreted as a Categorical variable, even though it just an integer in the underlying data.
variable_types
Now, we can do that same thing with our products dataframe
In [11]: es = es.entity_from_dataframe(entity_id="products", ....: dataframe=products_df, ....: index="product_id") ....: In [12]: es Out[12]: Entityset: customer_data Entities: transactions [Rows: 500, Columns: 11] products [Rows: 5, Columns: 2] Relationships: No relationships
With two entities in our entity set, we can add a relationship between them.
We want to relate these two entities by the columns called “product_id” in each entity. Each product has multiple transactions associated with it, so it is called it the parent entity, while the transactions entity is known as the child entity. When specifying relationships we list the variable in the parent entity first. Note that each ft.Relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many.
In [13]: new_relationship = ft.Relationship(es["products"]["product_id"], ....: es["transactions"]["product_id"]) ....: In [14]: es = es.add_relationship(new_relationship) In [15]: es Out[15]: Entityset: customer_data Entities: transactions [Rows: 500, Columns: 11] products [Rows: 5, Columns: 2] Relationships: transactions.product_id -> products.product_id
Now, we see the relationship has been added to our entity set.
When working with raw data, it is common to have sufficient information to justify the creation of new entities. In order to create a new entity and relationship for sessions, we “normalize” the transaction entity.
In [16]: es = es.normalize_entity(base_entity_id="transactions", ....: new_entity_id="sessions", ....: index="session_id", ....: make_time_index="session_start", ....: additional_variables=["device", "customer_id", "zip_code", "session_start", "join_date"]) ....: In [17]: es Out[17]: Entityset: customer_data Entities: transactions [Rows: 500, Columns: 6] products [Rows: 5, Columns: 2] sessions [Rows: 35, Columns: 6] Relationships: transactions.product_id -> products.product_id transactions.session_id -> sessions.session_id
Looking at the output above, we see this method did two operations
It created a new entity called “sessions” based on the “session_id” and “session_start” variables in “transactions”
It added a relationship connecting “transactions” and “sessions”.
If we look at the variables in transactions and the new sessions entity, we see two more operations that were performed automatically.
In [18]: es["transactions"].variables Out[18]: [<Variable: transaction_id (dtype = index)>, <Variable: session_id (dtype = id)>, <Variable: transaction_time (dtype: datetime_time_index, format: None)>, <Variable: amount (dtype = numeric)>, <Variable: date_of_birth (dtype: datetime, format: None)>, <Variable: product_id (dtype = id)>] In [19]: es["sessions"].variables Out[19]: [<Variable: session_id (dtype = index)>, <Variable: device (dtype = categorical)>, <Variable: customer_id (dtype = numeric)>, <Variable: zip_code (dtype = zip_code)>, <Variable: session_start (dtype: datetime_time_index, format: None)>, <Variable: join_date (dtype: datetime, format: None)>]
It removed “device”, “customer_id”, “zip_code” and “join_date” from “transactions” and created a new variables in the sessions entity. This reduces redundant information as the those properties of a session don’t change between transactions.
It copied and marked “session_start” as a time index variable into the new sessions entity to indicate the beginning of a session. If the base entity has a time index and make_time_index is not set, normalize entity will create a time index for the new entity. In this case it would create a new time index called “first_transactions_time” using the time of the first transaction of each session. If we don’t want this time index to be created, we can set make_time_index=False.
make_time_index
normalize entity
make_time_index=False
If we look at the dataframes, can see what the normalize_entity did to the actual data.
normalize_entity
In [20]: es["sessions"].df.head(5) Out[20]: session_id device customer_id zip_code session_start join_date 1 1 desktop 2 13244 2014-01-01 00:00:00 2012-04-15 23:31:04 2 2 mobile 5 60091 2014-01-01 00:17:20 2010-07-17 05:27:50 3 3 mobile 4 60091 2014-01-01 00:28:10 2011-04-08 20:08:14 4 4 mobile 1 60091 2014-01-01 00:44:25 2011-04-17 10:48:33 5 5 mobile 4 60091 2014-01-01 01:11:30 2011-04-08 20:08:14 In [21]: es["transactions"].df.head(5) Out[21]: transaction_id session_id transaction_time amount date_of_birth product_id 298 298 1 2014-01-01 00:00:00 127.64 1986-08-18 5 2 2 1 2014-01-01 00:01:05 109.48 1986-08-18 2 308 308 1 2014-01-01 00:02:10 95.06 1986-08-18 3 116 116 1 2014-01-01 00:03:15 78.92 1986-08-18 4 371 371 1 2014-01-01 00:04:20 31.54 1986-08-18 3
To finish preparing this dataset, create a “customers” entity using the same method call.
In [22]: es = es.normalize_entity(base_entity_id="sessions", ....: new_entity_id="customers", ....: index="customer_id", ....: make_time_index="join_date", ....: additional_variables=["zip_code", "join_date"]) ....: In [23]: es Out[23]: Entityset: customer_data Entities: transactions [Rows: 500, Columns: 6] products [Rows: 5, Columns: 2] sessions [Rows: 35, Columns: 4] customers [Rows: 5, Columns: 3] Relationships: transactions.product_id -> products.product_id transactions.session_id -> sessions.session_id sessions.customer_id -> customers.customer_id
Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let’s build a feature matrix for each product in our dataset.
In [24]: feature_matrix, feature_defs = ft.dfs(entityset=es, ....: target_entity="products") ....: In [25]: feature_matrix Out[25]: brand COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) MODE(transactions.session_id) NUM_UNIQUE(transactions.session_id) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) MODE(transactions.DAY(date_of_birth)) MODE(transactions.DAY(transaction_time)) MODE(transactions.MONTH(date_of_birth)) MODE(transactions.MONTH(transaction_time)) MODE(transactions.WEEKDAY(date_of_birth)) MODE(transactions.WEEKDAY(transaction_time)) MODE(transactions.YEAR(date_of_birth)) MODE(transactions.YEAR(transaction_time)) MODE(transactions.sessions.customer_id) MODE(transactions.sessions.device) NUM_UNIQUE(transactions.DAY(date_of_birth)) NUM_UNIQUE(transactions.DAY(transaction_time)) NUM_UNIQUE(transactions.MONTH(date_of_birth)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.WEEKDAY(date_of_birth)) NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) NUM_UNIQUE(transactions.YEAR(date_of_birth)) NUM_UNIQUE(transactions.YEAR(transaction_time)) NUM_UNIQUE(transactions.sessions.customer_id) NUM_UNIQUE(transactions.sessions.device) product_id 1 B 102 149.56 73.429314 6.84 3 34 0.125525 42.479989 7489.79 18 1 7 1 0 2 1994 2014 1 desktop 4 1 3 1 4 1 5 1 5 3 2 B 92 149.95 76.319891 5.73 28 34 0.151934 46.336308 7021.43 18 1 8 1 0 2 2006 2014 4 desktop 4 1 3 1 4 1 5 1 5 3 3 B 96 148.31 73.001250 5.89 1 35 0.223938 38.871405 7008.12 18 1 8 1 0 2 2006 2014 4 desktop 4 1 3 1 4 1 5 1 5 3 4 B 106 146.46 76.311038 5.81 29 34 -0.132077 42.492501 8088.97 18 1 7 1 0 2 1994 2014 1 desktop 4 1 3 1 4 1 5 1 5 3 5 A 104 149.02 76.264904 5.91 4 34 0.098248 42.131902 7931.55 18 1 7 1 0 2 1994 2014 1 mobile 4 1 3 1 4 1 5 1 5 3
As we can see, the features from DFS use the relational structure of our entity set. Therefore it is important to think carefully about the entities that we create.
EntitySets can also be created using Dask dataframes or Koalas dataframes. For more information refer to Using Dask EntitySets (BETA) and Using Koalas EntitySets (BETA).