NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
A Variable is analogous to a column in a table in a relational database. When creating an Entity, Featuretools will attempt to infer the types of variables present. Featuretools also allows for explicitly specifying the variable types when creating the Entity.
It is important that datasets have appropriately defined variable types when using DFS because this will allow the correct primitives to be used to generate new features.
Note: When using Dask Entities, users must explicitly specify the variable types for all columns in the Entity dataframe.
To understand the different variable types in Featuretools, let’s first look at a graph of the variables:
[1]:
from featuretools.variable_types import graph_variable_types graph_variable_types()
As we can see, there are multiple variable types and some have subclassed variable types. For example, ZIPCode is variable type that is child of Categorical type which is a child of Discrete type.
Let’s explore some of the variable types and understand them in detail.
A Discrete variable type can only take certain values. It is a type of data that can be counted, but cannot be measured. If it can be classified into distinct buckets, then it a discrete variable type.
There are 2 sub-variable types of Discrete. These are Categorical, and Ordinal. If the data has a certain ordering, it is of Ordinal type. If it cannot be ordered, then is a Categorical type.
A Categorical variable type can take unordered discrete values. It is usually a limited, and fixed number of possible values. Categorical variable types can be represented as strings, or integers.
Some examples of Categorical variable types:
Gender
Eye Color
Nationality
Hair Color
Spoken Language
A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.
Some examples of Ordinal variable types:
Educational Background (Elementary, High School, Undergraduate, Graduate)
Satisfaction Rating (“Not Satisfied”, “Satisfied”, “Very Satisfied”)
Spicy Level (Hot, Hotter, Hottest)
Student Grade (A, B, C, D, F)
Size (small, medium, large)
There are also more distinctions within the Categorical variable type. These include CountryCode, Id, SubRegionCode, and ZIPCode.
It is important to make this distinction because there are certain operations that can be applied, but they don’t necessary apply to all Categorical types. For example, there could be a custom primitive that applies to the ZIPCode variable type. It could extract the first 5 digits of a ZIPCode. However, this operation is not valid for all Categorical variable types. Therefore it is approriate to use the ZIPCode variable type.
A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers. However, they should be in a intrepretable format or properly cast before using DFS.
Some examples of Datetime include:
transaction time
flight departure time
pickup time
A more distinct type of datetime is a DateOfBirth. This is an important distinction because it allows additional primitives to be applied to the data to generate new features. For example, having an DateOfBirth variable type, will allow the Age primitive to be applied during DFS, and lead to a new Numeric feature.
Text is a long-form string, that can be of any length. It is commonly used with NLP operations, such as TF-IDF. Featuretools supports NLP operations with the nlp-primitives add-on.
A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.
To make a LatLong in a dataframe do the following:
[2]:
import pandas as pd data = pd.DataFrame() data['latitude'] = [51.52, 9.93, 37.38] data['longitude'] = [-0.17, 76.25, -122.08] data['latlong'] = data[['latitude', 'longitude']].apply(tuple, axis=1) data['latlong']
0 (51.52, -0.17) 1 (9.93, 76.25) 2 (37.38, -122.08) Name: latlong, dtype: object
We can also get all the variable types as a DataFrame.
[3]:
from featuretools.variable_types import list_variable_types list_variable_types()
Users can define their own variable types. For example, to make a custom variable type called Age, run the following code:
[4]:
from featuretools.variable_types import Variable class Age(Variable): _default_pandas_dtype = float
The _default_pandas_dtype specifies the pandas dtype to use to represent the underlying data. A list of pandas dtypes can be found here.
_default_pandas_dtype
Age can now be used as a variable type when creating an entity. For example, let’s create an entity with a column called customer_age.
customer_age
[5]:
import pandas as pd import featuretools as ft df = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5], "customer_age": [40, 50, 10, 20, 30]}) es = ft.EntitySet(id="customer_data") es = es.entity_from_dataframe(entity_id="customers", dataframe=df, index="customer_id", variable_types={ 'customer_age': Age })
Age can also be used as a variable type to create a custom primitive. For example, let’s create a transform primitive that returns a boolean if the age is greater than 100.
[6]:
from featuretools.variable_types import Boolean from featuretools.primitives.base import TransformPrimitive class AgeOver100(TransformPrimitive): name = "age_over_100" input_types = [Age] return_type = Boolean def get_function(self): def age_over_100(x): return x > 100 return age_over_100
This primitive can now be passed to ft.dfs as one of the transform primitives. DFS will generate a feature which uses the custom primitive (AgeOver100) with the custom variable type (Age).
ft.dfs
[7]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", trans_primitives=[AgeOver100]) feature_defs
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/v0.27.1/lib/python3.7/site-packages/featuretools/synthesis/deep_feature_synthesis.py:152: UserWarning: Only one entity in entityset, changing max_depth to 1 since deeper features cannot be created warnings.warn("Only one entity in entityset, changing max_depth to "
[<Feature: AGE_OVER_100(customer_age)>]
[8]:
feature_matrix.head(5)