NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.
The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of Woodwork. Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, EvalML, which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.
Other benefits of using Woodwork for managing typing in Featuretools include:
Simplified code - custom type management code has been removed
Seamless integration of new types and improvements to type integration as Woodwork improves
Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data.
The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types
Both the Entity and Variable classes have been removed from Featuretools
Entity
Variable
Several key Featuretools methods have been moved or updated
Featuretools < 1.0
Featuretools 1.0
Description
Woodwork DataFrame
stores typing information for all columns
ColumnSchema
stores typing information for a single column
Variable subclass
LogicalType and semantic_tags
elements used to define a column type
The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.
Older Versions
EntitySet.entity_from_dataframe
EntitySet.add_dataframe
EntitySet.normalize_entity
EntitySet.normalize_dataframe
EntitySet.update_data
EntitySet.replace_dataframe
Entity.variable_types
es[‘dataframe_name’].ww
es[‘entity_id’][‘variable_name’]
es[‘dataframe_name’].ww.columns[’ column_name’]
Entity.convert_variable_type
es[‘dataframe_name’].ww.set_type s
Entity.add_interesting_values
es.add_interesting_values(datafr ame_name=’df_name’, …)
Entity.set_secondary_time_index
es.set_secondary_time_index(dat aframe_name=’df_name’, …)
Feature(es[‘entity_id’][‘variable _name’])
Feature(es[‘dataframe_name’].ww[’ column_name’])
dfs(target_entity=’entity_id’, …)
dfs(target_dataframe_name=’dataf rame_name’, …)
For more information on how Woodwork manages typing information, refer to the Woodwork Understanding Types and Tags guide.
Removing these classes required moving several methods from the Entity to the EntitySet object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.
EntitySet
All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible.
In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.
When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. As before, Featuretools supports creating EntitySets from pandas, Dask and Koalas dataframes. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.
Below are some examples to illustrate this process. First we will create two small dataframes to use for the example.
[1]:
import featuretools as ft import pandas as pd import woodwork as ww
[2]:
orders_df = pd.DataFrame({ 'order_id': [0, 1, 2], 'order_date': ['2021-01-02', '2021-01-03', '2021-01-04'] }) items_df = pd.DataFrame({ 'id': [0, 1, 2, 3, 4], 'order_id': [0, 1, 1, 2, 2], 'item_price': [29.95, 4.99, 10.25, 20.50, 15.99], 'on_sale': [False, True, False, True, False] })
With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling entity_from_dataframe as shown below.
entity_from_dataframe
es = ft.EntitySet('old_es') es.entity_from_dataframe(dataframe=orders_df, entity_id='orders', index='order_id', time_index='order_date') es.entity_from_dataframe(dataframe=items_df, entity_id='items', index='id')
Entityset: old_es Entities: orders [Rows: 3, Columns: 2] items [Rows: 5, Columns: 3] Relationships: No relationships
With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call EntitySet.add_dataframe in place of the previous EntitySet.entity_from_dataframe call. Note that the name of the dataframe is specified in the dataframe_name argument, which was previously called entity_id.
dataframe_name
entity_id
[3]:
es = ft.EntitySet('new_es') es.add_dataframe(dataframe=orders_df, dataframe_name='orders', index='order_id', time_index='order_date')
Entityset: new_es DataFrames: orders [Rows: 3, Columns: 2] Relationships: No relationships
You can also define the name, index, and time index by first initializing Woodwork on the dataframe and then passing the Woodwork initialized dataframe directly to the add_dataframe call. For this example we will initialize Woodwork on items_df, setting the dataframe name as items and specifying that the index should be the id column.
add_dataframe
items_df
items
id
[4]:
items_df.ww.init(name='items', index='id') items_df.ww
With Woodwork initialized, we no longer need to specify values for the dataframe_name or index arguments when calling add_dataframe as Featuretools will simply use the values that were already specified when Woodwork was initialized.
index
[5]:
es.add_dataframe(dataframe=items_df)
Entityset: new_es DataFrames: orders [Rows: 3, Columns: 2] items [Rows: 5, Columns: 4] Relationships: No relationships
Previously, column variable type information could be accessed for an entire Entity through Entity.variable_types or for an individual column by selecting the individual column first through es['entity_id']['col_id'].
es['entity_id']['col_id']
es['items'].variable_types
{'id': featuretools.variable_types.variable.Index, 'order_id': featuretools.variable_types.variable.Numeric, 'item_price': featuretools.variable_types.variable.Numeric}
es['items']['item_price']
<Variable: item_price (dtype = numeric)>
With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the .ww namespace on the dataframe. First, select the dataframe from the EntitySet with es['dataframe_name'] and then access the typing information by chaining a .ww call on the end as shown below.
.ww
es['dataframe_name']
[6]:
es['items'].ww
The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a Woodwork.ColumnSchema object that stores the typing information:
Woodwork.ColumnSchema
[7]:
es['items'].ww.columns['item_price']
<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>
Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the Entity.variable_types dictionary.
Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling EntitySet.add_dataframe.
As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user’s full name as the Woodwork PersonFullName logical type.
PersonFullName
[8]:
users_df = pd.DataFrame({ 'id': [0, 1, 2], 'name': ['John Doe', 'Rita Book', 'Teri Dactyl'] })
[9]:
es.add_dataframe(dataframe=users_df, dataframe_name='users', index='id', logical_types={'name': 'PersonFullName'}) es['users'].ww
Looking at the typing information above, we can see that the logical type for the name column was set to PersonFullName as we specified.
name
Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork set_types method. Let’s say we want the order_id column of the orders dataframe to have a Categorical logical type instead of the Integer type that was inferred. Previously, this would have accomplished through the Entity.convert_variable_type method.
set_types
order_id
orders
Categorical
Integer
from featuretools.variable_types import Categorical es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)
Now, we can perform this same update using Woodwork:
[10]:
es['items'].ww.set_types(logical_types={'order_id': 'Categorical'}) es['items'].ww
For additional information on Woodwork typing and how it is used in Featuretools, refer to Woodwork Typing in Featuretools.
Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.
To add interesting values for all of the dataframes in an EntitySet, simply call EntitySet.add_interesting_values, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.
EntitySet.add_interesting_values
Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call Entity.add_interesting_values():
Entity.add_interesting_values()
es['items'].add_interesting_values()
Now, in order to specify interesting values for a single dataframe, you call add_interesting_values on the EntitySet, and pass the name of the dataframe for which you want interesting values added:
add_interesting_values
[11]:
es.add_interesting_values(dataframe_name='items')
Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:
es['items']['order_id'].interesting_values = [1, 2]
Now, this is done through EntitySet.add_interesting_values, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of [1, 2] to the order_id column of the items dataframe, use the following approach:
[1, 2]
[12]:
es.add_interesting_values(dataframe_name='items', values={'order_id': [1, 2]})
Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the values parameter.
values
Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:
es['items']['order_id'].interesting_values
Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:
[13]:
es['items'].ww.columns['order_id'].metadata['interesting_values']
In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling Entity.set_secondary_time_index.
es_flight = ft.demo.load_flight(nrows=100) arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay', 'national_airspace_delay', 'security_delay', 'late_aircraft_delay', 'canceled', 'diverted', 'taxi_in', 'taxi_out', 'air_time', 'dep_time'] es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})
Since the Entity class has been removed in Featuretools 1.0, this now needs to be done through the EntitySet instead:
[14]:
es_flight = ft.demo.load_flight(nrows=100) arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay', 'national_airspace_delay', 'security_delay', 'late_aircraft_delay', 'canceled', 'diverted', 'taxi_in', 'taxi_out', 'air_time', 'dep_time'] es_flight.set_secondary_time_index(dataframe_name='trip_logs', secondary_time_index={'arr_time': arr_time_columns})
Downloading data ...
Previously, the secondary time index could be accessed directly from the Entity with es_flight['trip_logs'].secondary_time_index. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below.
es_flight['trip_logs'].secondary_time_index
[15]:
es_flight['trip_logs'].ww.metadata['secondary_time_index']
{'arr_time': ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay', 'national_airspace_delay', 'security_delay', 'late_aircraft_delay', 'canceled', 'diverted', 'taxi_in', 'taxi_out', 'air_time', 'dep_time', 'arr_time']}
EntitySet.normalize_entity has been renamed to EntitySet.normalize_dataframe in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.
Old Parameter Name
New Parameter Name
base_entity_id
base_dataframe_name
new_entity_id
new_dataframe_name
additional_variables
additional_columns
copy_variables
copy_columns
new_entity_time_index
new_dataframe_time_index
new_entity_secondary_time_index
new_dataframe_secondary_time_index
In earlier versions of Featuretools, relationships were defined by creating a Relationship object, which took two Variables as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a Relationship and then add it to the EntitySet:
Relationship
Variables
relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id']) es.add_relationship(relationship)
With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to EntitySet.add_relationship, and another is to pass a previously created Relationship object to the relationship keyword argument. Both approaches are demonstrated below.
EntitySet.add_relationship
relationship
[16]:
# Undo change from above and change child column logical type to match parent and prevent warning # NOTE: This cell is hidden in the docs build es['items'].ww.set_types(logical_types={'order_id': 'Integer'})
[17]:
es.add_relationship(parent_dataframe_name='orders', parent_column_name='order_id', child_dataframe_name='items', child_column_name='order_id')
Entityset: new_es DataFrames: orders [Rows: 3, Columns: 2] items [Rows: 5, Columns: 4] users [Rows: 3, Columns: 2] Relationships: items.order_id -> orders.order_id
[18]:
# Reset the relationship so we can add it again # NOTE: This cell is hidden in the docs build es.relationships = []
Alternatively, we can first create a Relationship and pass that to EntitySet.add_relationship. When defining a Relationship we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column.
[19]:
relationship = ft.Relationship(entityset=es, parent_dataframe_name='orders', parent_column_name='order_id', child_dataframe_name='items', child_column_name='order_id') es.add_relationship(relationship=relationship)
Previously to update (replace) the data associated with an Entity, users could call Entity.update_data and pass in the new dataframe. As an example, let’s update the data in our users Entity:
Entity.update_data
users
new_users_df = pd.DataFrame({ 'id': [3, 4], 'name': ['Anne Teak', 'Art Decco'] }) es['users'].update_data(df=new_users_df)
To accomplish this task with Featuretools 1.0, we will use the EntitySet.replace_dataframe method instead:
[20]:
new_users_df = pd.DataFrame({ 'id': [0, 1], 'name': ['Anne Teak', 'Art Decco'] }) es.replace_dataframe(dataframe_name='users', df=new_users_df) es['users']
The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.
feature = ft.Feature(es['items']['item_price'])
Starting with Featuretools 1.0, a similar syntax can be used, but because es['items'] will now return a Woodwork dataframe instead of an Entity, we need to update the syntax slightly to access the Woodwork column. To update, simply add .ww between the dataframe name selector and the column selector as shown below.
es['items']
[21]:
feature = ft.Feature(es['items'].ww['item_price'])
In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate Variable class. Starting in version 1.0, the input and return types are defined by Woodwork ColumnSchema objects.
To illustrate this change, let’s look closer at the Age transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person’s age. In previous versions of Featuretools, the input type was defined by specifying the DateOfBirth variable type and the return type was specified by the Numeric variable type:
Age
DateOfBirth
Numeric
input_types = [DateOfBirth] return_type = Numeric
Woodwork does not have a specific DateOfBirth logical type, but rather identifies a column as a date of birth column by specifying the logical type as Datetime with a semantic tag of date_of_birth. There is also no Numeric logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of numeric. Furthermore, we know the Age primitive will return a floating point number, which would correspond to a Woodwork logical type of Double. With these items in mind, we can redefine the Age input types and return types with ColumnSchema objects as follows:
Datetime
date_of_birth
numeric
Double
input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})] return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})
Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged.
Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by ColumnSchema objects, the approximate mapping is shown below.
Featuretools Variable
Woodwork Column Schema
Boolean
ColumnSchema(logical_type= Boolean) or ColumnSchema(logical_type= BooleanNullable)
ColumnSchema(logical_type= Categorical)
CountryCode
ColumnSchema(logical_type= CountryCode)
ColumnSchema(logical_type= Datetime)
ColumnSchema(logical_type= Datetime, semantic_tags={‘date_of_ birth’})
DatetimeTimeIndex
ColumnSchema(logical_type= Datetime, semantic_tags={‘time_inde x’})
Discrete
ColumnSchema(semantic_tags ={‘category’})
EmailAddress
ColumnSchema(logical_type= EmailAddress)
FilePath
ColumnSchema(logical_type= Filepath)
FullName
ColumnSchema(logical_type= PersonFullName)
Id
ColumnSchema(semantic_tags ={‘foreign_key’})
Index
ColumnSchema(semantic_tags ={‘index’})
IPAddress
ColumnSchema(logical_type= IPAddress)
LatLong
ColumnSchema(logical_type= LatLong)
NaturalLanguage
ColumnSchema(logical_type= NaturalLanguage)
ColumnSchema(semantic_tags ={‘numeric’})
NumericTimeIndex
ColumnSchema(semantic_tags ={‘numeric’, ‘time_index’})
Ordinal
ColumnSchema(logical_type= Ordinal)
PhoneNumber
ColumnSchema(logical_type= PhoneNumber)
SubRegionCode
ColumnSchema(logical_type= SubRegionCode)
Timedelta
ColumnSchema(logical_type= Timedelta)
TimeIndex
ColumnSchema(semantic_tags ={‘time_index’})
URL
ColumnSchema(logical_type= URL)
Unknown
ColumnSchema(logical_type= Unknown)
ZIPCode
ColumnSchema(logical_type= PostalCode)
The argument names for both featuretools.dfs and featuretools.calculate_feature_matrix have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:
featuretools.dfs
featuretools.calculate_feature_matrix
features = ft.dfs(entityset=es, target_entity='items', features_only=True)
In Featuretools 1.0, the target_entity argument has been renamed to target_dataframe_name, but otherwise this basic call remains the same.
target_entity
target_dataframe_name
[22]:
features = ft.dfs(entityset=es, target_dataframe_name='items', features_only=True) features
[<Feature: order_id>, <Feature: item_price>, <Feature: on_sale>, <Feature: orders.COUNT(items)>, <Feature: orders.MAX(items.item_price)>, <Feature: orders.MEAN(items.item_price)>, <Feature: orders.MIN(items.item_price)>, <Feature: orders.SKEW(items.item_price)>, <Feature: orders.STD(items.item_price)>, <Feature: orders.SUM(items.item_price)>, <Feature: orders.DAY(order_date)>, <Feature: orders.MONTH(order_date)>, <Feature: orders.WEEKDAY(order_date)>, <Feature: orders.YEAR(order_date)>]
In addition, the dfs argument ignore_entities was renamed to ignore_dataframes and ignore_variables was renamed to ignore_columns. Similarly, if specifying primitive options, all references to entities should be replaced with dataframes and references to variables should be replaced with columns. For example, the primitive option of include_groupby_entities is now include_groupby_dataframes and include_variables is now include_columns.
dfs
ignore_entities
ignore_dataframes
ignore_variables
ignore_columns
entities
dataframes
variables
include_groupby_entities
include_groupby_dataframes
include_variables
include_columns
The basic call to featuretools.calculate_feature_matrix remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling calculate_feature_matrix by passing in a list of entities and relationships should note that the entities argument has been renamed to dataframes and the values in the dictionary values should now include Woodwork logical types instead of Featuretools Variable classes.
calculate_feature_matrix
relationships
[23]:
feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es) feature_matrix
In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously Id) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as Integer, features will not be generated from these columns. Manually converting foreign key columns to Categorical as shown above will result in features much closer to those achieved with previous versions.
Also, because Woodwork’s type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.
Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows.
[24]:
feature_matrix.ww
Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork origin attribute for the column. Columns that were in the original data will be labeled with base and features that were created by Featuretools will be labeled with engineered.
origin
base
engineered
As a demonstration of how to access this information, let’s compare two features in the feature matrix: item_price and orders.MEAN(items.item_price). item_price was present in the original data, and orders.MEAN(items.item_price) was created by Featuretools.
item_price
orders.MEAN(items.item_price)
[25]:
feature_matrix.ww['item_price'].ww.origin
'base'
[26]:
feature_matrix.ww['orders.MEAN(items.item_price)'].ww.origin
'engineered'
In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.
Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.
For LatLong columns, older versions of Featuretools would replace single nan values in the columns with a tuple (nan, nan). This is no longer the case, and single nan values will now remain in the LatLong column. Based on the behavior in Woodwork, any values of (nan, nan) in a LatLong column will be replaced with a single nan value.
nan
(nan, nan)
Since Featuretools no longer defines Variable objects with relationships between them, the featuretools.variable_types.graph_variable_types function has been removed.
featuretools.variable_types.graph_variable_types
The featuretools.variable_types.list_variable_types utility function has been deprecated and replaced with two corresponding Woodwork functions: woodwork.list_logical_types and woodwork.list_semantic_tags. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns.
featuretools.variable_types.list_variable_types
woodwork.list_logical_types
woodwork.list_semantic_tags