Transitioning to Featuretools Version 1.0¶
Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.
Background and Introduction¶
Why make these changes?¶
The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of Woodwork. Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, EvalML, which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.
Other benefits of using Woodwork for managing typing in Featuretools include:
Simplified code - custom type management code has been removed
Seamless integration of new types and improvements to type integration as Woodwork improves
Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data.
What has changed?¶
The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types
Both the
Entity
andVariable
classes have been removed from FeaturetoolsSeveral key Featuretools methods have been moved or updated
Comparison between legacy typing system and Woodwork typing systems¶
Featuretools < 1.0 |
Featuretools 1.0 |
Description |
---|---|---|
Entity |
Woodwork DataFrame |
stores typing information for all columns |
Variable |
ColumnSchema |
stores typing information for a single column |
Variable subclass |
LogicalType and semantic_tags |
elements used to define a column type |
Summary of significant method changes¶
The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.
Older Versions |
Featuretools 1.0 |
---|---|
EntitySet.entity_from_dataframe |
EntitySet.add_dataframe |
EntitySet.normalize_entity |
EntitySet.normalize_dataframe |
EntitySet.update_data |
EntitySet.replace_dataframe |
Entity.variable_types |
es[‘dataframe_name’].ww |
es[‘entity_id’][‘variable_name’] |
es[‘dataframe_name’].ww.columns[’ column_name’] |
Entity.convert_variable_type |
es[‘dataframe_name’].ww.set_type s |
Entity.add_interesting_values |
es.add_interesting_values(datafr ame_name=’df_name’, …) |
Entity.set_secondary_time_index |
es.set_secondary_time_index(dat aframe_name=’df_name’, …) |
Feature(es[‘entity_id’][‘variable _name’]) |
Feature(es[‘dataframe_name’].ww[’ column_name’]) |
dfs(target_entity=’entity_id’, …) |
dfs(target_dataframe_name=’dataf rame_name’, …) |
For more information on how Woodwork manages typing information, refer to the Woodwork Understanding Types and Tags guide.
What do these changes mean for users?¶
Removing these classes required moving several methods from the Entity
to the EntitySet
object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.
All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible.
Removal of Entity
Class and Updates to EntitySet
¶
In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.
Adding dataframes to an EntitySet¶
When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. As before, Featuretools supports creating EntitySets from pandas, Dask and Koalas dataframes. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.
Below are some examples to illustrate this process. First we will create two small dataframes to use for the example.
[1]:
import featuretools as ft
import pandas as pd
import woodwork as ww
[2]:
orders_df = pd.DataFrame({
'order_id': [0, 1, 2],
'order_date': ['2021-01-02', '2021-01-03', '2021-01-04']
})
items_df = pd.DataFrame({
'id': [0, 1, 2, 3, 4],
'order_id': [0, 1, 1, 2, 2],
'item_price': [29.95, 4.99, 10.25, 20.50, 15.99],
'on_sale': [False, True, False, True, False]
})
With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling entity_from_dataframe
as shown below.
es = ft.EntitySet('old_es')
es.entity_from_dataframe(dataframe=orders_df,
entity_id='orders',
index='order_id',
time_index='order_date')
es.entity_from_dataframe(dataframe=items_df,
entity_id='items',
index='id')
Entityset: old_es
Entities:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 3]
Relationships:
No relationships
With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call EntitySet.add_dataframe
in place of the previous EntitySet.entity_from_dataframe
call. Note that the name of the dataframe is specified in the dataframe_name
argument, which was previously called entity_id
.
[3]:
es = ft.EntitySet('new_es')
es.add_dataframe(dataframe=orders_df,
dataframe_name='orders',
index='order_id',
time_index='order_date')
[3]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
Relationships:
No relationships
You can also define the name, index, and time index by first initializing Woodwork on the dataframe and then passing the Woodwork initialized dataframe directly to the add_dataframe
call. For this example we will initialize Woodwork on items_df
, setting the dataframe name as items
and specifying that the index should be the
id
column.
[4]:
items_df.ww.init(name='items', index='id')
items_df.ww
[4]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | int64 | Integer | ['numeric'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
With Woodwork initialized, we no longer need to specify values for the dataframe_name
or index
arguments when calling add_dataframe
as Featuretools will simply use the values that were already specified when Woodwork was initialized.
[5]:
es.add_dataframe(dataframe=items_df)
[5]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
Relationships:
No relationships
Accessing column typing information¶
Previously, column variable type information could be accessed for an entire Entity through Entity.variable_types
or for an individual column by selecting the individual column first through es['entity_id']['col_id']
.
es['items'].variable_types
{'id': featuretools.variable_types.variable.Index,
'order_id': featuretools.variable_types.variable.Numeric,
'item_price': featuretools.variable_types.variable.Numeric}
es['items']['item_price']
<Variable: item_price (dtype = numeric)>
With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the .ww
namespace on the dataframe. First, select the dataframe from the EntitySet with es['dataframe_name']
and then access the typing information by chaining a .ww
call on the end as shown below.
[6]:
es['items'].ww
[6]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | int64 | Integer | ['numeric'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a Woodwork.ColumnSchema
object that stores the typing information:
[7]:
es['items'].ww.columns['item_price']
[7]:
<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>
Type inference and updating column types¶
Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the Entity.variable_types
dictionary.
Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling EntitySet.add_dataframe
.
As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user’s full name as the Woodwork PersonFullName
logical type.
[8]:
users_df = pd.DataFrame({
'id': [0, 1, 2],
'name': ['John Doe', 'Rita Book', 'Teri Dactyl']
})
[9]:
es.add_dataframe(dataframe=users_df,
dataframe_name='users',
index='id',
logical_types={'name': 'PersonFullName'})
es['users'].ww
[9]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
name | string | PersonFullName | [] |
Looking at the typing information above, we can see that the logical type for the name
column was set to PersonFullName
as we specified.
Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork set_types
method. Let’s say we want the order_id
column of the orders
dataframe to have a Categorical
logical type instead of the Integer
type that was inferred. Previously, this would have accomplished through the Entity.convert_variable_type
method.
from featuretools.variable_types import Categorical
es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)
Now, we can perform this same update using Woodwork:
[10]:
es['items'].ww.set_types(logical_types={'order_id': 'Categorical'})
es['items'].ww
[10]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | category | Categorical | ['category'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
For additional information on Woodwork typing and how it is used in Featuretools, refer to Woodwork Typing in Featuretools.
Adding interesting values¶
Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.
To add interesting values for all of the dataframes in an EntitySet, simply call EntitySet.add_interesting_values
, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.
Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call Entity.add_interesting_values()
:
es['items'].add_interesting_values()
Now, in order to specify interesting values for a single dataframe, you call add_interesting_values
on the EntitySet, and pass the name of the dataframe for which you want interesting values added:
[11]:
es.add_interesting_values(dataframe_name='items')
Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:
es['items']['order_id'].interesting_values = [1, 2]
Now, this is done through EntitySet.add_interesting_values
, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of [1, 2]
to the order_id
column of the items
dataframe, use the following approach:
[12]:
es.add_interesting_values(dataframe_name='items',
values={'order_id': [1, 2]})
Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the values
parameter.
Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:
es['items']['order_id'].interesting_values
Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:
[13]:
es['items'].ww.columns['order_id'].metadata['interesting_values']
[13]:
[1, 2]
Setting a secondary time index¶
In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling Entity.set_secondary_time_index
.
es_flight = ft.demo.load_flight(nrows=100)
arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
'national_airspace_delay', 'security_delay',
'late_aircraft_delay', 'canceled', 'diverted',
'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})
Since the Entity
class has been removed in Featuretools 1.0, this now needs to be done through the EntitySet
instead:
[14]:
es_flight = ft.demo.load_flight(nrows=100)
arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
'national_airspace_delay', 'security_delay',
'late_aircraft_delay', 'canceled', 'diverted',
'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight.set_secondary_time_index(dataframe_name='trip_logs',
secondary_time_index={'arr_time': arr_time_columns})
Downloading data ...
Previously, the secondary time index could be accessed directly from the Entity with es_flight['trip_logs'].secondary_time_index
. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below.
[15]:
es_flight['trip_logs'].ww.metadata['secondary_time_index']
[15]:
{'arr_time': ['arr_delay',
'dep_delay',
'carrier_delay',
'weather_delay',
'national_airspace_delay',
'security_delay',
'late_aircraft_delay',
'canceled',
'diverted',
'taxi_in',
'taxi_out',
'air_time',
'dep_time',
'arr_time']}
Normalizing Entities/DataFrames¶
EntitySet.normalize_entity
has been renamed to EntitySet.normalize_dataframe
in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.
Old Parameter Name |
New Parameter Name |
---|---|
base_entity_id |
base_dataframe_name |
new_entity_id |
new_dataframe_name |
additional_variables |
additional_columns |
copy_variables |
copy_columns |
new_entity_time_index |
new_dataframe_time_index |
new_entity_secondary_time_index |
new_dataframe_secondary_time_index |
Defining and adding relationships¶
In earlier versions of Featuretools, relationships were defined by creating a Relationship
object, which took two Variables
as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a Relationship
and then add it to the EntitySet:
relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])
es.add_relationship(relationship)
With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to EntitySet.add_relationship
, and another is to pass a previously created Relationship
object to the relationship
keyword argument. Both approaches are demonstrated below.
[16]:
# Undo change from above and change child column logical type to match parent and prevent warning
# NOTE: This cell is hidden in the docs build
es['items'].ww.set_types(logical_types={'order_id': 'Integer'})
[17]:
es.add_relationship(parent_dataframe_name='orders',
parent_column_name='order_id',
child_dataframe_name='items',
child_column_name='order_id')
[17]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
users [Rows: 3, Columns: 2]
Relationships:
items.order_id -> orders.order_id
[18]:
# Reset the relationship so we can add it again
# NOTE: This cell is hidden in the docs build
es.relationships = []
Alternatively, we can first create a Relationship
and pass that to EntitySet.add_relationship
. When defining a Relationship
we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column.
[19]:
relationship = ft.Relationship(entityset=es,
parent_dataframe_name='orders',
parent_column_name='order_id',
child_dataframe_name='items',
child_column_name='order_id')
es.add_relationship(relationship=relationship)
[19]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
users [Rows: 3, Columns: 2]
Relationships:
items.order_id -> orders.order_id
Updating data for a dataframe in an EntitySet¶
Previously to update (replace) the data associated with an Entity, users could call Entity.update_data
and pass in the new dataframe. As an example, let’s update the data in our users
Entity:
new_users_df = pd.DataFrame({
'id': [3, 4],
'name': ['Anne Teak', 'Art Decco']
})
es['users'].update_data(df=new_users_df)
To accomplish this task with Featuretools 1.0, we will use the EntitySet.replace_dataframe
method instead:
[20]:
new_users_df = pd.DataFrame({
'id': [0, 1],
'name': ['Anne Teak', 'Art Decco']
})
es.replace_dataframe(dataframe_name='users', df=new_users_df)
es['users']
[20]:
id | name | |
---|---|---|
0 | 0 | Anne Teak |
1 | 1 | Art Decco |
Defining features¶
The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.
feature = ft.Feature(es['items']['item_price'])
Starting with Featuretools 1.0, a similar syntax can be used, but because es['items']
will now return a Woodwork dataframe instead of an Entity
, we need to update the syntax slightly to access the Woodwork column. To update, simply add .ww
between the dataframe name selector and the column selector as shown below.
[21]:
feature = ft.Feature(es['items'].ww['item_price'])
Defining primitives¶
In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate Variable
class. Starting in version 1.0, the input and return types are defined by Woodwork ColumnSchema
objects.
To illustrate this change, let’s look closer at the Age
transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person’s age. In previous versions of Featuretools, the input type was defined by specifying the DateOfBirth
variable type and the return type was specified by the Numeric
variable type:
input_types = [DateOfBirth]
return_type = Numeric
Woodwork does not have a specific DateOfBirth
logical type, but rather identifies a column as a date of birth column by specifying the logical type as Datetime
with a semantic tag of date_of_birth
. There is also no Numeric
logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of numeric
. Furthermore, we know the Age
primitive will return a floating point number, which would correspond to a
Woodwork logical type of Double
. With these items in mind, we can redefine the Age
input types and return types with ColumnSchema
objects as follows:
input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})
Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged.
Mapping from old Featuretools variable types to Woodwork ColumnSchemas¶
Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by ColumnSchema
objects, the approximate mapping is shown below.
Featuretools Variable |
Woodwork Column Schema |
---|---|
Boolean |
ColumnSchema(logical_type= Boolean) or ColumnSchema(logical_type= BooleanNullable) |
Categorical |
ColumnSchema(logical_type= Categorical) |
CountryCode |
ColumnSchema(logical_type= CountryCode) |
Datetime |
ColumnSchema(logical_type= Datetime) |
DateOfBirth |
ColumnSchema(logical_type= Datetime, semantic_tags={‘date_of_ birth’}) |
DatetimeTimeIndex |
ColumnSchema(logical_type= Datetime, semantic_tags={‘time_inde x’}) |
Discrete |
ColumnSchema(semantic_tags ={‘category’}) |
EmailAddress |
ColumnSchema(logical_type= EmailAddress) |
FilePath |
ColumnSchema(logical_type= Filepath) |
FullName |
ColumnSchema(logical_type= PersonFullName) |
Id |
ColumnSchema(semantic_tags ={‘foreign_key’}) |
Index |
ColumnSchema(semantic_tags ={‘index’}) |
IPAddress |
ColumnSchema(logical_type= IPAddress) |
LatLong |
ColumnSchema(logical_type= LatLong) |
NaturalLanguage |
ColumnSchema(logical_type= NaturalLanguage) |
Numeric |
ColumnSchema(semantic_tags ={‘numeric’}) |
NumericTimeIndex |
ColumnSchema(semantic_tags ={‘numeric’, ‘time_index’}) |
Ordinal |
ColumnSchema(logical_type= Ordinal) |
PhoneNumber |
ColumnSchema(logical_type= PhoneNumber) |
SubRegionCode |
ColumnSchema(logical_type= SubRegionCode) |
Timedelta |
ColumnSchema(logical_type= Timedelta) |
TimeIndex |
ColumnSchema(semantic_tags ={‘time_index’}) |
URL |
ColumnSchema(logical_type= URL) |
Unknown |
ColumnSchema(logical_type= Unknown) |
ZIPCode |
ColumnSchema(logical_type= PostalCode) |
Changes to Deep Feature Synthesis and Calculate Feature Matrix¶
The argument names for both featuretools.dfs
and featuretools.calculate_feature_matrix
have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:
features = ft.dfs(entityset=es,
target_entity='items',
features_only=True)
In Featuretools 1.0, the target_entity
argument has been renamed to target_dataframe_name
, but otherwise this basic call remains the same.
[22]:
features = ft.dfs(entityset=es,
target_dataframe_name='items',
features_only=True)
features
[22]:
[<Feature: order_id>,
<Feature: item_price>,
<Feature: on_sale>,
<Feature: orders.COUNT(items)>,
<Feature: orders.MAX(items.item_price)>,
<Feature: orders.MEAN(items.item_price)>,
<Feature: orders.MIN(items.item_price)>,
<Feature: orders.SKEW(items.item_price)>,
<Feature: orders.STD(items.item_price)>,
<Feature: orders.SUM(items.item_price)>,
<Feature: orders.DAY(order_date)>,
<Feature: orders.MONTH(order_date)>,
<Feature: orders.WEEKDAY(order_date)>,
<Feature: orders.YEAR(order_date)>]
In addition, the dfs
argument ignore_entities
was renamed to ignore_dataframes
and ignore_variables
was renamed to ignore_columns
. Similarly, if specifying primitive options, all references to entities
should be replaced with dataframes
and references to variables
should be replaced with columns. For example, the primitive option of include_groupby_entities
is now include_groupby_dataframes
and include_variables
is now include_columns
.
The basic call to featuretools.calculate_feature_matrix
remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling calculate_feature_matrix
by passing in a list of entities
and relationships
should note that the entities
argument has been renamed to dataframes
and the values in the dictionary values should now include Woodwork logical types instead of Featuretools Variable
classes.
[23]:
feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)
feature_matrix
[23]:
order_id | item_price | on_sale | orders.COUNT(items) | orders.MAX(items.item_price) | orders.MEAN(items.item_price) | orders.MIN(items.item_price) | orders.SKEW(items.item_price) | orders.STD(items.item_price) | orders.SUM(items.item_price) | orders.DAY(order_date) | orders.MONTH(order_date) | orders.WEEKDAY(order_date) | orders.YEAR(order_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
0 | 0 | 29.95 | False | 1 | 29.95 | 29.950 | 29.95 | NaN | NaN | 29.95 | 2 | 1 | 5 | 2021 |
1 | 1 | 4.99 | True | 2 | 10.25 | 7.620 | 4.99 | NaN | 3.719382 | 15.24 | 3 | 1 | 6 | 2021 |
2 | 1 | 10.25 | False | 2 | 10.25 | 7.620 | 4.99 | NaN | 3.719382 | 15.24 | 3 | 1 | 6 | 2021 |
3 | 2 | 20.50 | True | 2 | 20.50 | 18.245 | 15.99 | NaN | 3.189052 | 36.49 | 4 | 1 | 0 | 2021 |
4 | 2 | 15.99 | False | 2 | 20.50 | 18.245 | 15.99 | NaN | 3.189052 | 36.49 | 4 | 1 | 0 | 2021 |
In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key
(previously Id
) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as Integer
, features will not be generated from these columns. Manually converting foreign key columns to Categorical
as shown above will result in features much closer to those achieved with previous versions.
Also, because Woodwork’s type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.
Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows.
[24]:
feature_matrix.ww
[24]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_id | int64 | Integer | ['numeric', 'foreign_key'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
orders.COUNT(items) | Int64 | IntegerNullable | ['numeric'] |
orders.MAX(items.item_price) | float64 | Double | ['numeric'] |
orders.MEAN(items.item_price) | float64 | Double | ['numeric'] |
orders.MIN(items.item_price) | float64 | Double | ['numeric'] |
orders.SKEW(items.item_price) | float64 | Double | ['numeric'] |
orders.STD(items.item_price) | float64 | Double | ['numeric'] |
orders.SUM(items.item_price) | float64 | Double | ['numeric'] |
orders.DAY(order_date) | category | Ordinal | ['category'] |
orders.MONTH(order_date) | category | Ordinal | ['category'] |
orders.WEEKDAY(order_date) | category | Ordinal | ['category'] |
orders.YEAR(order_date) | category | Ordinal | ['category'] |
Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork origin
attribute for the column. Columns that were in the original data will be labeled with base
and features that were created by Featuretools will be labeled with engineered
.
As a demonstration of how to access this information, let’s compare two features in the feature matrix: item_price
and orders.MEAN(items.item_price)
. item_price
was present in the original data, and orders.MEAN(items.item_price)
was created by Featuretools.
[25]:
feature_matrix.ww['item_price'].ww.origin
[25]:
'base'
[26]:
feature_matrix.ww['orders.MEAN(items.item_price)'].ww.origin
[26]:
'engineered'
Other changes¶
In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.
Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.
For
LatLong
columns, older versions of Featuretools would replace singlenan
values in the columns with a tuple(nan, nan)
. This is no longer the case, and singlenan
values will now remain in theLatLong
column. Based on the behavior in Woodwork, any values of(nan, nan)
in aLatLong
column will be replaced with a singlenan
value.Since Featuretools no longer defines
Variable
objects with relationships between them, thefeaturetools.variable_types.graph_variable_types
function has been removed.The
featuretools.variable_types.list_variable_types
utility function has been deprecated and replaced with two corresponding Woodwork functions:woodwork.list_logical_types
andwoodwork.list_semantic_tags
. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns.