Woodwork Typing in Featuretools#
Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external data typing library for its typing: Woodwork.
Understanding the Woodwork types that exist and how Featuretools uses Woodwork’s type system will allow users to: - build EntitySets that best represent their data - understand the possible input and return types for Featuretools’ Primitives - understand what features will get generated from a given set of data and primitives.
Read the Understanding Woodwork Logical Types and Semantic Tags guide for an in-depth walkthrough of the available Woodwork types that are outlined below.
For users that are familiar with the old Variable
objects, the Transitioning to Featuretools Version 1.0 guide will be useful for converting Variable types to Woodwork types.
Physical Types#
Physical types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s dtype
.
Knowing a Woodwork DataFrame’s physical types is important because Pandas, Dask, and Spark rely on these types when performing DataFrame operations. Each Woodwork LogicalType
class has a single physical type associated with it.
Logical Types#
Logical types add additional information about how data should be interpreted or parsed beyond what can be contained in a physical type. In fact, multiple logical types have the same physical type, each imparting a different meaning that’s not contained in the physical type alone.
In Featuretools, a column’s logical type informs how data is read into an EntitySet and how it gets used down the line in Deep Feature Synthesis.
Woodwork provides many different logical types, which can be seen with the list_logical_types
function.
[1]:
import featuretools as ft
ft.list_logical_types()
[1]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | CurrencyCode | currency_code | Represents Logical Types that use the ISO-4217... | category | {category} | True | True | Categorical |
9 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
10 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
11 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | None |
12 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
13 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | None |
14 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
15 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
16 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
17 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
18 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
19 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
20 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | None |
21 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
22 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
23 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | None |
24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | None |
25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Featuretools will perform type inference to assign logical types to the data in EntitySets if none are provided, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type).
To learn more about how logical types are used in EntitySets, see the Creating EntitySets guide.
To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on working with Logical Types.
Semantic Tags#
Semantic tags provide additional information to columns about the meaning or potential uses of data. Columns can have many or no semantic tags. Some tags are added by Woodwork, some are added by Featuretools, and users can add additional tags as they see fit.
To learn more about setting semantic tags directly on a DataFrame, see the Woodwork guide on working with Semantic Tags.
Woodwork-defined Semantic Tags#
Woodwork will add certain semantic tags to columns at initialization. These can be standard tags that may be associated with different sets of logical types or index tags. There are also tags that users can add to confer a suggested meaning to columns in Woodwork.
To get a list of these tags, you can use the list_semantic_tags
function.
[2]:
ft.list_semantic_tags()
[2]:
name | is_standard_tag | valid_logical_types | |
---|---|---|---|
0 | numeric | True | [Age, AgeFractional, AgeNullable, Double, Inte... |
1 | category | True | [Categorical, CountryCode, CurrencyCode, Ordin... |
2 | index | False | Any LogicalType |
3 | time_index | False | [Datetime, Age, AgeFractional, AgeNullable, Do... |
4 | date_of_birth | False | [Datetime] |
5 | ignore | False | Any LogicalType |
6 | passthrough | False | Any LogicalType |
Above we see the semantic tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, an example of which can be seen in the Age
primitive, which requires that the date_of_birth
semantic tag be present on a column.
The date_of_birth
tag will not get automatically added by Woodwork, so in order for Featuretools to be able to use the Age
primitive, the date_of_birth
tag must be manually added to any columns to which it applies.
Featuretools-defined Semantic Tags#
Just like Woodwork specifies semantic tags internally, Featuretools also defines a few tags of its own that allow the full set of Features to be generated. These tags have specific meanings when they are present on a column.
'last_time_index'
- added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.'foreign_key'
- used to indicate that this column is the child column of a relationship, meaning that this column is related to a corresponding index column of another dataframe in the EntitySet.
Woodwork Throughout Featuretools#
Now that we’ve described the elements that make up Woodwork’s type system, lets see them in action in Featuretools.
Woodwork in EntitySets#
For more information on building EntitySets using Woodwork, see the EntitySet guide.
Let’s look at the Woodwork typing information as it’s stored in a demo EntitySet of retail data:
[3]:
es = ft.demo.load_retail()
es
[3]:
Entityset: demo_retail_data
DataFrames:
order_products [Rows: 401604, Columns: 8]
products [Rows: 3684, Columns: 4]
orders [Rows: 22190, Columns: 6]
customers [Rows: 4372, Columns: 3]
Relationships:
order_products.product_id -> products.product_id
order_products.order_id -> orders.order_id
orders.customer_name -> customers.customer_name
Woodwork typing information is not stored in the EntitySet object, but rather is stored in the individual DataFrames that make up the EntitySet. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the ww
namespace:
[4]:
df = es["products"]
df.head()
[4]:
product_id | description | first_order_products_time | _ft_last_time | |
---|---|---|---|---|
85123A | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 2010-12-01 08:26:00 | 2011-12-09 11:34:00 |
71053 | 71053 | WHITE METAL LANTERN | 2010-12-01 08:26:00 | 2011-12-07 14:12:00 |
84406B | 84406B | CREAM CUPID HEARTS COAT HANGER | 2010-12-01 08:26:00 | 2011-12-05 14:30:00 |
84029G | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 2010-12-01 08:26:00 | 2011-12-09 11:26:00 |
84029E | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 2010-12-01 08:26:00 | 2011-12-09 09:07:00 |
[5]:
df.ww
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
product_id | category | Categorical | ['index'] |
description | string | NaturalLanguage | [] |
first_order_products_time | datetime64[ns] | Datetime | ['time_index'] |
_ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
Notice how the three columns showing this DataFrame’s typing information are the three elements of typing information outlined at the beginning of this guide. To reiterate: By defining physical types, logical types, and semantic tags for each column in a DataFrame, we’ve defined a DataFrame’s Woodwork schema, and with it, we can gain an understanding of the contents of each column.
This column-specific typing information that exists for every column in every DataFrame in an EntitySet is an integral part of Deep Feature Synthesis’ ability to generate features for an EntitySet.
Woodwork in DFS#
As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For an in-depth explanation of Primitives in Featuretools, see the Feature Primitives guide. Here, we’ll look at how the Woodwork types come together into a ColumnSchema
object to describe Primitive input and return types.
Below is a Woodwork ColumnSchema
that we’ve obtained from the 'product_id'
column in the products
DataFrame in the retail EntitySet.
[6]:
products_df = es["products"]
product_ids_series = products_df.ww["product_id"]
column_schema = product_ids_series.ww.schema
column_schema
[6]:
<ColumnSchema (Logical Type = Categorical) (Semantic Tags = ['index'])>
This combination of logical type and semantic tag typing information is a ColumnSchema
. In the case above, the ColumnSchema
describes the type definition for a single column of data.
Notice that there is no physical type in a ColumnSchema
. This is because a ColumnSchema
is a collection of Woodwork types that doesn’t have any data tied to it and therefore has no physical representation. Because a ColumnSchema
object is not tied to any data, it can also be used to describe a type space into which other columns may or may not fall.
This flexibility of the ColumnSchema
class allows ColumnSchema
objects to be used both as type definitions for every column in an EntitySet as well as input and return type spaces for every Primitive in Featuretools.
Let’s look at a different column in a different DataFrame to see how this works:
[7]:
order_products_df = es["order_products"]
order_products_df.head()
[7]:
order_product_id | order_id | product_id | quantity | order_date | unit_price | total | _ft_last_time | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | 6 | 2010-12-01 08:26:00 | 4.2075 | 25.245 | 2010-12-01 08:26:00 |
1 | 1 | 536365 | 71053 | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
2 | 2 | 536365 | 84406B | 8 | 2010-12-01 08:26:00 | 4.5375 | 36.300 | 2010-12-01 08:26:00 |
3 | 3 | 536365 | 84029G | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
4 | 4 | 536365 | 84029E | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
[8]:
quantity_series = order_products_df.ww["quantity"]
column_schema = quantity_series.ww.schema
column_schema
[8]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
The ColumnSchema
above has been pulled from the 'quantity'
column in the order_products
DataFrame in the retail EntitySet. This is a type definition.
If we look at the Woodwork typing information for the order_products
DataFrame, we can see that there are several columns that will have similar ColumnSchema
type definitions. If we wanted to describe subsets of those columns, we could define several ColumnSchema
type spaces
[9]:
es["order_products"].ww
[9]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['index'] |
order_id | category | Categorical | ['foreign_key', 'category'] |
product_id | category | Categorical | ['foreign_key', 'category'] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
total | float64 | Double | ['numeric'] |
_ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
Below are several ColumnSchema
s that all would include our quantity
column, but each of them describes a different type space. These ColumnSchema
s get more restrictive as we go down:
Entire DataFrame#
No restrictions have been placed; any column falls into this definition. This would include the whole DataFrame.
[10]:
from woodwork.column_schema import ColumnSchema
ColumnSchema()
[10]:
<ColumnSchema>
An example of a Primitive with this ColumnSchema
as its input type is the IsNull
transform primitive.
By Semantic Tag#
Only columns with the numeric
tag apply. This can include Double, Integer, and Age logical type columns as well. It will not include the index
column which, despite containing integers, has had its standard tags replaced by the 'index'
tag.
[11]:
ColumnSchema(semantic_tags={"numeric"})
[11]:
<ColumnSchema (Semantic Tags = ['numeric'])>
[12]:
df = es["order_products"].ww.select(include="numeric")
df.ww
[12]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
quantity | int64 | Integer | ['numeric'] |
unit_price | float64 | Double | ['numeric'] |
total | float64 | Double | ['numeric'] |
And example of a Primitive with this ColumnSchema
as its input type is the Mean
aggregation primitive.
By Logical Type#
Only columns with logical type of Integer
are included in this definition. Does not require the numeric
tag, so an index column (which has its standard tags removed) would still apply.
[13]:
from woodwork.logical_types import Integer
ColumnSchema(logical_type=Integer)
[13]:
<ColumnSchema (Logical Type = Integer)>
[14]:
df = es["order_products"].ww.select(include="Integer")
df.ww
[14]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['index'] |
quantity | int64 | Integer | ['numeric'] |
By Logical Type and Semantic Tag#
The column must have logical type Integer
and have the numeric
semantic tag, excluding index columns.
[15]:
ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
[15]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
[16]:
df = es["order_products"].ww.select(include="numeric")
df = df.ww.select(include="Integer")
df.ww
[16]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
quantity | int64 | Integer | ['numeric'] |
In this way, a ColumnSchema
can define a type space under which columns in a Woodwork DataFrame can fall. This is how Featuretools determines which columns in a DataFrame are valid for a Primitive in building Features during DFS.
Each Primitive has input_types
and a return_type
that are described by a Woodwork ColumnSchema
. Every DataFrame in an EntitySet has Woodwork initialized on it. This means that when an EntitySet is passed into DFS, Featuretools can select the relevant columns in the DataFrame that are valid for the Primitive’s input_types
. We then get a Feature that has a column_schema
property that indicates what that Feature’s typing definition is in a way that lets DFS stack features on top
of one another.
In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the ColumnSchema
, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis.