NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.

Transitioning to Featuretools Version 1.0¶

Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.

Background and Introduction¶

Why make these changes?¶

The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of Woodwork. Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, EvalML, which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.

Other benefits of using Woodwork for managing typing in Featuretools include:

Simplified code - custom type management code has been removed
Seamless integration of new types and improvements to type integration as Woodwork improves
Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data.

What has changed?¶

The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types
Both the Entity and Variable classes have been removed from Featuretools
Several key Featuretools methods have been moved or updated

Comparison between legacy typing system and Woodwork typing systems¶

Featuretools < 1.0	Featuretools 1.0	Description
Entity	Woodwork DataFrame	stores typing information for all columns
Variable	ColumnSchema	stores typing information for a single column
Variable subclass	LogicalType and semantic_tags	elements used to define a column type

Summary of significant method changes¶

The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.

Older Versions	Featuretools 1.0
EntitySet.entity_from_dataframe	EntitySet.add_dataframe
EntitySet.normalize_entity	EntitySet.normalize_dataframe
EntitySet.update_data	EntitySet.replace_dataframe
Entity.variable_types	es[‘dataframe_name’].ww
es[‘entity_id’][‘variable_name’]	es[‘dataframe_name’].ww.columns[’ column_name’]
Entity.convert_variable_type	es[‘dataframe_name’].ww.set_type s
Entity.add_interesting_values	es.add_interesting_values(datafr ame_name=’df_name’, …)
Entity.set_secondary_time_index	es.set_secondary_time_index(dat aframe_name=’df_name’, …)
Feature(es[‘entity_id’][‘variable _name’])	Feature(es[‘dataframe_name’].ww[’ column_name’])
dfs(target_entity=’entity_id’, …)	dfs(target_dataframe_name=’dataf rame_name’, …)

For more information on how Woodwork manages typing information, refer to the Woodwork Understanding Types and Tags guide.

What do these changes mean for users?¶

Removing these classes required moving several methods from the Entity to the EntitySet object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.

All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible.

Removal of `Entity` Class and Updates to `EntitySet`¶

In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.

Adding dataframes to an EntitySet¶

When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. As before, Featuretools supports creating EntitySets from pandas, Dask and Koalas dataframes. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.

Below are some examples to illustrate this process. First we will create two small dataframes to use for the example.

[1]:

import featuretools as ft
import pandas as pd
import woodwork as ww

[2]:

orders_df = pd.DataFrame({
    'order_id': [0, 1, 2],
    'order_date': ['2021-01-02', '2021-01-03', '2021-01-04']
})
items_df = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'order_id': [0, 1, 1, 2, 2],
    'item_price': [29.95, 4.99, 10.25, 20.50, 15.99],
    'on_sale': [False, True, False, True, False]
})

With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling entity_from_dataframe as shown below.

es = ft.EntitySet('old_es')

es.entity_from_dataframe(dataframe=orders_df,
                         entity_id='orders',
                         index='order_id',
                         time_index='order_date')
es.entity_from_dataframe(dataframe=items_df,
                         entity_id='items',
                         index='id')

Entityset: old_es
  Entities:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 3]
  Relationships:
    No relationships

With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call EntitySet.add_dataframe in place of the previous EntitySet.entity_from_dataframe call. Note that the name of the dataframe is specified in the dataframe_name argument, which was previously called entity_id.

[3]:

es = ft.EntitySet('new_es')

es.add_dataframe(dataframe=orders_df,
                 dataframe_name='orders',
                 index='order_id',
                 time_index='order_date')

[3]:

Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
  Relationships:
    No relationships

You can also define the name, index, and time index by first initializing Woodwork on the dataframe and then passing the Woodwork initialized dataframe directly to the add_dataframe call. For this example we will initialize Woodwork on items_df, setting the dataframe name as items and specifying that the index should be the id column.

[4]:

items_df.ww.init(name='items', index='id')
items_df.ww

[4]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
id	int64	Integer	['index']
order_id	int64	Integer	['numeric']
item_price	float64	Double	['numeric']
on_sale	bool	Boolean	[]

With Woodwork initialized, we no longer need to specify values for the dataframe_name or index arguments when calling add_dataframe as Featuretools will simply use the values that were already specified when Woodwork was initialized.

[5]:

es.add_dataframe(dataframe=items_df)

[5]:

Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
  Relationships:
    No relationships

Accessing column typing information¶

Previously, column variable type information could be accessed for an entire Entity through Entity.variable_types or for an individual column by selecting the individual column first through es['entity_id']['col_id'].

es['items'].variable_types

{'id': featuretools.variable_types.variable.Index,
 'order_id': featuretools.variable_types.variable.Numeric,
 'item_price': featuretools.variable_types.variable.Numeric}

es['items']['item_price']

<Variable: item_price (dtype = numeric)>

With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the .ww namespace on the dataframe. First, select the dataframe from the EntitySet with es['dataframe_name'] and then access the typing information by chaining a .ww call on the end as shown below.

[6]:

es['items'].ww

[6]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
id	int64	Integer	['index']
order_id	int64	Integer	['numeric']
item_price	float64	Double	['numeric']
on_sale	bool	Boolean	[]

The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a Woodwork.ColumnSchema object that stores the typing information:

[7]:

es['items'].ww.columns['item_price']

[7]:

<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>

Type inference and updating column types¶

Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the Entity.variable_types dictionary.

Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling EntitySet.add_dataframe.

As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user’s full name as the Woodwork PersonFullName logical type.

[8]:

users_df = pd.DataFrame({
    'id': [0, 1, 2],
    'name': ['John Doe', 'Rita Book', 'Teri Dactyl']
})

[9]:

es.add_dataframe(dataframe=users_df,
                 dataframe_name='users',
                 index='id',
                 logical_types={'name': 'PersonFullName'})

es['users'].ww

[9]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
id	int64	Integer	['index']
name	string	PersonFullName	[]

Looking at the typing information above, we can see that the logical type for the name column was set to PersonFullName as we specified.

Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork set_types method. Let’s say we want the order_id column of the orders dataframe to have a Categorical logical type instead of the Integer type that was inferred. Previously, this would have accomplished through the Entity.convert_variable_type method.

from featuretools.variable_types import Categorical

es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)

Now, we can perform this same update using Woodwork:

[10]:

es['items'].ww.set_types(logical_types={'order_id': 'Categorical'})
es['items'].ww

[10]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
id	int64	Integer	['index']
order_id	category	Categorical	['category']
item_price	float64	Double	['numeric']
on_sale	bool	Boolean	[]

For additional information on Woodwork typing and how it is used in Featuretools, refer to Woodwork Typing in Featuretools.

Adding interesting values¶

Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.

To add interesting values for all of the dataframes in an EntitySet, simply call EntitySet.add_interesting_values, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.

Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call Entity.add_interesting_values():

es['items'].add_interesting_values()

Now, in order to specify interesting values for a single dataframe, you call add_interesting_values on the EntitySet, and pass the name of the dataframe for which you want interesting values added:

[11]:

es.add_interesting_values(dataframe_name='items')

Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:

es['items']['order_id'].interesting_values = [1, 2]

Now, this is done through EntitySet.add_interesting_values, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of [1, 2] to the order_id column of the items dataframe, use the following approach:

[12]:

es.add_interesting_values(dataframe_name='items',
                          values={'order_id': [1, 2]})

Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the values parameter.

Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:

es['items']['order_id'].interesting_values

Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:

[13]:

es['items'].ww.columns['order_id'].metadata['interesting_values']

[13]:

[1, 2]

Setting a secondary time index¶

In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling Entity.set_secondary_time_index.

es_flight = ft.demo.load_flight(nrows=100)

arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
                    'national_airspace_delay', 'security_delay',
                    'late_aircraft_delay', 'canceled', 'diverted',
                    'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})

Since the Entity class has been removed in Featuretools 1.0, this now needs to be done through the EntitySet instead:

[14]:

es_flight = ft.demo.load_flight(nrows=100)

arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
                    'national_airspace_delay', 'security_delay',
                    'late_aircraft_delay', 'canceled', 'diverted',
                    'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight.set_secondary_time_index(dataframe_name='trip_logs',
                                   secondary_time_index={'arr_time': arr_time_columns})

Downloading data ...

Previously, the secondary time index could be accessed directly from the Entity with es_flight['trip_logs'].secondary_time_index. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below.

[15]:

es_flight['trip_logs'].ww.metadata['secondary_time_index']

[15]:

{'arr_time': ['arr_delay',
  'dep_delay',
  'carrier_delay',
  'weather_delay',
  'national_airspace_delay',
  'security_delay',
  'late_aircraft_delay',
  'canceled',
  'diverted',
  'taxi_in',
  'taxi_out',
  'air_time',
  'dep_time',
  'arr_time']}

Normalizing Entities/DataFrames¶

EntitySet.normalize_entity has been renamed to EntitySet.normalize_dataframe in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.

Old Parameter Name	New Parameter Name
base_entity_id	base_dataframe_name
new_entity_id	new_dataframe_name
additional_variables	additional_columns
copy_variables	copy_columns
new_entity_time_index	new_dataframe_time_index
new_entity_secondary_time_index	new_dataframe_secondary_time_index

Defining and adding relationships¶

In earlier versions of Featuretools, relationships were defined by creating a Relationship object, which took two Variables as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a Relationship and then add it to the EntitySet:

relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])
es.add_relationship(relationship)

With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to EntitySet.add_relationship, and another is to pass a previously created Relationship object to the relationship keyword argument. Both approaches are demonstrated below.

[16]:

# Undo change from above and change child column logical type to match parent and prevent warning
# NOTE: This cell is hidden in the docs build
es['items'].ww.set_types(logical_types={'order_id': 'Integer'})

[17]:

es.add_relationship(parent_dataframe_name='orders',
                    parent_column_name='order_id',
                    child_dataframe_name='items',
                    child_column_name='order_id')

[17]:

Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
    users [Rows: 3, Columns: 2]
  Relationships:
    items.order_id -> orders.order_id

[18]:

# Reset the relationship so we can add it again
# NOTE: This cell is hidden in the docs build
es.relationships = []

Alternatively, we can first create a Relationship and pass that to EntitySet.add_relationship. When defining a Relationship we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column.

[19]:

relationship = ft.Relationship(entityset=es,
                               parent_dataframe_name='orders',
                               parent_column_name='order_id',
                               child_dataframe_name='items',
                               child_column_name='order_id')
es.add_relationship(relationship=relationship)

[19]:

Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
    users [Rows: 3, Columns: 2]
  Relationships:
    items.order_id -> orders.order_id

Updating data for a dataframe in an EntitySet¶

Previously to update (replace) the data associated with an Entity, users could call Entity.update_data and pass in the new dataframe. As an example, let’s update the data in our users Entity:

new_users_df = pd.DataFrame({
    'id': [3, 4],
    'name': ['Anne Teak', 'Art Decco']
})

es['users'].update_data(df=new_users_df)

To accomplish this task with Featuretools 1.0, we will use the EntitySet.replace_dataframe method instead:

[20]:

new_users_df = pd.DataFrame({
    'id': [0, 1],
    'name': ['Anne Teak', 'Art Decco']
})

es.replace_dataframe(dataframe_name='users', df=new_users_df)
es['users']

[20]:

	id	name
0	0	Anne Teak
1	1	Art Decco

Defining features¶

The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.

feature = ft.Feature(es['items']['item_price'])

Starting with Featuretools 1.0, a similar syntax can be used, but because es['items'] will now return a Woodwork dataframe instead of an Entity, we need to update the syntax slightly to access the Woodwork column. To update, simply add .ww between the dataframe name selector and the column selector as shown below.

[21]:

feature = ft.Feature(es['items'].ww['item_price'])

Defining primitives¶

In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate Variable class. Starting in version 1.0, the input and return types are defined by Woodwork ColumnSchema objects.

To illustrate this change, let’s look closer at the Age transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person’s age. In previous versions of Featuretools, the input type was defined by specifying the DateOfBirth variable type and the return type was specified by the Numeric variable type:

input_types = [DateOfBirth]
return_type = Numeric

Woodwork does not have a specific DateOfBirth logical type, but rather identifies a column as a date of birth column by specifying the logical type as Datetime with a semantic tag of date_of_birth. There is also no Numeric logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of numeric. Furthermore, we know the Age primitive will return a floating point number, which would correspond to a Woodwork logical type of Double. With these items in mind, we can redefine the Age input types and return types with ColumnSchema objects as follows:

input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})

Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged.

Mapping from old Featuretools variable types to Woodwork ColumnSchemas¶

Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by ColumnSchema objects, the approximate mapping is shown below.

Featuretools Variable	Woodwork Column Schema
Boolean	ColumnSchema(logical_type= Boolean) or ColumnSchema(logical_type= BooleanNullable)
Categorical	ColumnSchema(logical_type= Categorical)
CountryCode	ColumnSchema(logical_type= CountryCode)
Datetime	ColumnSchema(logical_type= Datetime)
DateOfBirth	ColumnSchema(logical_type= Datetime, semantic_tags={‘date_of_ birth’})
DatetimeTimeIndex	ColumnSchema(logical_type= Datetime, semantic_tags={‘time_inde x’})
Discrete	ColumnSchema(semantic_tags ={‘category’})
EmailAddress	ColumnSchema(logical_type= EmailAddress)
FilePath	ColumnSchema(logical_type= Filepath)
FullName	ColumnSchema(logical_type= PersonFullName)
Id	ColumnSchema(semantic_tags ={‘foreign_key’})
Index	ColumnSchema(semantic_tags ={‘index’})
IPAddress	ColumnSchema(logical_type= IPAddress)
LatLong	ColumnSchema(logical_type= LatLong)
NaturalLanguage	ColumnSchema(logical_type= NaturalLanguage)
Numeric	ColumnSchema(semantic_tags ={‘numeric’})
NumericTimeIndex	ColumnSchema(semantic_tags ={‘numeric’, ‘time_index’})
Ordinal	ColumnSchema(logical_type= Ordinal)
PhoneNumber	ColumnSchema(logical_type= PhoneNumber)
SubRegionCode	ColumnSchema(logical_type= SubRegionCode)
Timedelta	ColumnSchema(logical_type= Timedelta)
TimeIndex	ColumnSchema(semantic_tags ={‘time_index’})
URL	ColumnSchema(logical_type= URL)
Unknown	ColumnSchema(logical_type= Unknown)
ZIPCode	ColumnSchema(logical_type= PostalCode)

Changes to Deep Feature Synthesis and Calculate Feature Matrix¶

The argument names for both featuretools.dfs and featuretools.calculate_feature_matrix have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:

features = ft.dfs(entityset=es,
                  target_entity='items',
                  features_only=True)

In Featuretools 1.0, the target_entity argument has been renamed to target_dataframe_name, but otherwise this basic call remains the same.

[22]:

features = ft.dfs(entityset=es,
                  target_dataframe_name='items',
                  features_only=True)
features

[22]:

[<Feature: order_id>,
 <Feature: item_price>,
 <Feature: on_sale>,
 <Feature: orders.COUNT(items)>,
 <Feature: orders.MAX(items.item_price)>,
 <Feature: orders.MEAN(items.item_price)>,
 <Feature: orders.MIN(items.item_price)>,
 <Feature: orders.SKEW(items.item_price)>,
 <Feature: orders.STD(items.item_price)>,
 <Feature: orders.SUM(items.item_price)>,
 <Feature: orders.DAY(order_date)>,
 <Feature: orders.MONTH(order_date)>,
 <Feature: orders.WEEKDAY(order_date)>,
 <Feature: orders.YEAR(order_date)>]

In addition, the dfs argument ignore_entities was renamed to ignore_dataframes and ignore_variables was renamed to ignore_columns. Similarly, if specifying primitive options, all references to entities should be replaced with dataframes and references to variables should be replaced with columns. For example, the primitive option of include_groupby_entities is now include_groupby_dataframes and include_variables is now include_columns.

The basic call to featuretools.calculate_feature_matrix remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling calculate_feature_matrix by passing in a list of entities and relationships should note that the entities argument has been renamed to dataframes and the values in the dictionary values should now include Woodwork logical types instead of Featuretools Variable classes.

[23]:

feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)
feature_matrix

[23]:

	order_id	item_price	on_sale	orders.COUNT(items)	orders.MAX(items.item_price)	orders.MEAN(items.item_price)	orders.MIN(items.item_price)	orders.SKEW(items.item_price)	orders.STD(items.item_price)	orders.SUM(items.item_price)	orders.DAY(order_date)	orders.MONTH(order_date)	orders.WEEKDAY(order_date)	orders.YEAR(order_date)
id
0	0	29.95	False	1	29.95	29.950	29.95	NaN	NaN	29.95	2	1	5	2021
1	1	4.99	True	2	10.25	7.620	4.99	NaN	3.719382	15.24	3	1	6	2021
2	1	10.25	False	2	10.25	7.620	4.99	NaN	3.719382	15.24	3	1	6	2021
3	2	20.50	True	2	20.50	18.245	15.99	NaN	3.189052	36.49	4	1	0	2021
4	2	15.99	False	2	20.50	18.245	15.99	NaN	3.189052	36.49	4	1	0	2021

In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously Id) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as Integer, features will not be generated from these columns. Manually converting foreign key columns to Categorical as shown above will result in features much closer to those achieved with previous versions.

Also, because Woodwork’s type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.

Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows.

[24]:

feature_matrix.ww

[24]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_id	int64	Integer	['foreign_key', 'numeric']
item_price	float64	Double	['numeric']
on_sale	bool	Boolean	[]
orders.COUNT(items)	Int64	IntegerNullable	['numeric']
orders.MAX(items.item_price)	float64	Double	['numeric']
orders.MEAN(items.item_price)	float64	Double	['numeric']
orders.MIN(items.item_price)	float64	Double	['numeric']
orders.SKEW(items.item_price)	float64	Double	['numeric']
orders.STD(items.item_price)	float64	Double	['numeric']
orders.SUM(items.item_price)	float64	Double	['numeric']
orders.DAY(order_date)	category	Ordinal	['category']
orders.MONTH(order_date)	category	Ordinal	['category']
orders.WEEKDAY(order_date)	category	Ordinal	['category']
orders.YEAR(order_date)	category	Ordinal	['category']

Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork origin attribute for the column. Columns that were in the original data will be labeled with base and features that were created by Featuretools will be labeled with engineered.

As a demonstration of how to access this information, let’s compare two features in the feature matrix: item_price and orders.MEAN(items.item_price). item_price was present in the original data, and orders.MEAN(items.item_price) was created by Featuretools.

[25]:

feature_matrix.ww['item_price'].ww.origin

[25]:

'base'

[26]:

feature_matrix.ww['orders.MEAN(items.item_price)'].ww.origin

[26]:

'engineered'

Other changes¶

In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.

Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.
For LatLong columns, older versions of Featuretools would replace single nan values in the columns with a tuple (nan, nan). This is no longer the case, and single nan values will now remain in the LatLong column. Based on the behavior in Woodwork, any values of (nan, nan) in a LatLong column will be replaced with a single nan value.
Since Featuretools no longer defines Variable objects with relationships between them, the featuretools.variable_types.graph_variable_types function has been removed.
The featuretools.variable_types.list_variable_types utility function has been deprecated and replaced with two corresponding Woodwork functions: woodwork.list_logical_types and woodwork.list_semantic_tags. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns.

Resources Frequently Asked Questions