NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.

Variable Types¶

A Variable is analogous to a column in a table in a relational database. When creating an Entity, Featuretools will attempt to infer the types of variables present. Featuretools also allows for explicitly specifying the variable types when creating the Entity.

It is important that datasets have appropriately defined variable types when using DFS because this will allow the correct primitives to be used to generate new features.

Note: When using Dask Entities, users must explicitly specify the variable types for all columns in the Entity dataframe.

To understand the different variable types in Featuretools, let’s first look at a graph of the variables:

[1]:

from featuretools.variable_types import graph_variable_types
graph_variable_types()

[1]:

../_images/getting_started_variables_1_0.svg

As we can see, there are multiple variable types and some have subclassed variable types. For example, ZIPCode is variable type that is child of Categorical type which is a child of Discrete type.

Let’s explore some of the variable types and understand them in detail.

Discrete¶

A Discrete variable type can only take certain values. It is a type of data that can be counted, but cannot be measured. If it can be classified into distinct buckets, then it a discrete variable type.

There are 2 sub-variable types of Discrete. These are Categorical, and Ordinal. If the data has a certain ordering, it is of Ordinal type. If it cannot be ordered, then is a Categorical type.

Categorical¶

A Categorical variable type can take unordered discrete values. It is usually a limited, and fixed number of possible values. Categorical variable types can be represented as strings, or integers.

Some examples of Categorical variable types:

Gender
Eye Color
Nationality
Hair Color
Spoken Language

Ordinal¶

A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.

Some examples of Ordinal variable types:

Educational Background (Elementary, High School, Undergraduate, Graduate)
Satisfaction Rating (“Not Satisfied”, “Satisfied”, “Very Satisfied”)
Spicy Level (Hot, Hotter, Hottest)
Student Grade (A, B, C, D, F)
Size (small, medium, large)

Categorical SubTypes (CountryCode, Id, SubRegionCode, ZIPCode)¶

There are also more distinctions within the Categorical variable type. These include CountryCode, Id, SubRegionCode, and ZIPCode.

It is important to make this distinction because there are certain operations that can be applied, but they don’t necessary apply to all Categorical types. For example, there could be a custom primitive that applies to the ZIPCode variable type. It could extract the first 5 digits of a ZIPCode. However, this operation is not valid for all Categorical variable types. Therefore it is approriate to use the ZIPCode variable type.

Datetime¶

A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers. However, they should be in a intrepretable format or properly cast before using DFS.

Some examples of Datetime include:

transaction time
flight departure time
pickup time

DateOfBirth¶

A more distinct type of datetime is a DateOfBirth. This is an important distinction because it allows additional primitives to be applied to the data to generate new features. For example, having an DateOfBirth variable type, will allow the Age primitive to be applied during DFS, and lead to a new Numeric feature.

Text¶

Text is a long-form string, that can be of any length. It is commonly used with NLP operations, such as TF-IDF. Featuretools supports NLP operations with the nlp-primitives add-on.

LatLong¶

A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.

To make a LatLong in a dataframe do the following:

[2]:

import pandas as pd

data = pd.DataFrame()
data['latitude'] = [51.52, 9.93, 37.38]
data['longitude'] = [-0.17, 76.25, -122.08]
data['latlong'] = data[['latitude', 'longitude']].apply(tuple, axis=1)
data['latlong']

[2]:

0      (51.52, -0.17)
1       (9.93, 76.25)
2    (37.38, -122.08)
Name: latlong, dtype: object

List of Variable Types¶

We can also get all the variable types as a DataFrame.

[3]:

from featuretools.variable_types import list_variable_types
list_variable_types()

[3]:

	name	type_string	description
0	Unknown	unknown	None
1	Discrete	discrete	Superclass representing variables that take on...
2	Categorical	categorical	Represents variables that can take an unordere...
3	Id	id	Represents variables that identify another entity
4	ZIPCode	zip_code	Represents a postal address in the United Stat...
5	CountryCode	country_code	Represents an ISO-3166 standard country code.\...
6	SubRegionCode	sub_region_code	Represents an ISO-3166 standard sub-region cod...
7	Ordinal	ordinal	Represents variables that take on an ordered d...
8	Boolean	boolean	Represents variables that take on one of two v...
9	Numeric	numeric	Represents variables that contain numeric valu...
10	NumericTimeIndex	numeric_time_index	Represents time index of entity that is numeric
11	Index	index	Represents variables that uniquely identify an...
12	Datetime	datetime	Represents variables that are points in time\n...
13	DatetimeTimeIndex	datetime_time_index	Represents time index of entity that is a date...
14	DateOfBirth	date_of_birth	Represents a date of birth as a datetime
15	TimeIndex	time_index	Represents time index of entity
16	Timedelta	timedelta	Represents variables that are timedeltas\n\n ...
17	NaturalLanguage	natural_language	Represents variables that are arbitary strings
18	LatLong	lat_long	Represents an ordered pair (Latitude, Longitud...
19	IPAddress	ip_address	Represents a computer network address. Represe...
20	FullName	full_name	Represents a person's full name. May consist o...
21	EmailAddress	email_address	Represents an email box to which email message...
22	URL	url	Represents a valid web url (with or without ht...
23	PhoneNumber	phone_number	Represents any valid phone number.\n Can be...
24	FilePath	file_path	Represents a valid filepath, absolute or relative

Defining Custom Variable Types¶

Users can define their own variable types. For example, to make a custom variable type called Age, run the following code:

[4]:

from featuretools.variable_types import Variable

class Age(Variable):
    _default_pandas_dtype = float

The _default_pandas_dtype specifies the pandas dtype to use to represent the underlying data. A list of pandas dtypes can be found here.

Age can now be used as a variable type when creating an entity. For example, let’s create an entity with a column called customer_age.

[5]:

import pandas as pd
import featuretools as ft

df = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
                   "customer_age": [40, 50, 10, 20, 30]})

es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customers",
                              dataframe=df,
                              index="customer_id",
                              variable_types={
                                  'customer_age': Age
                              })

Age can also be used as a variable type to create a custom primitive. For example, let’s create a transform primitive that returns a boolean if the age is greater than 100.

[6]:

from featuretools.variable_types import Boolean
from featuretools.primitives.base import TransformPrimitive

class AgeOver100(TransformPrimitive):
    name = "age_over_100"
    input_types = [Age]
    return_type = Boolean

    def get_function(self):
        def age_over_100(x):
            return x > 100
        return age_over_100

This primitive can now be passed to ft.dfs as one of the transform primitives. DFS will generate a feature which uses the custom primitive (AgeOver100) with the custom variable type (Age).

[7]:

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="customers",
                                      trans_primitives=[AgeOver100])
feature_defs

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/v0.27.1/lib/python3.7/site-packages/featuretools/synthesis/deep_feature_synthesis.py:152: UserWarning: Only one entity in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn("Only one entity in entityset, changing max_depth to "

[7]:

[<Feature: AGE_OVER_100(customer_age)>]

[8]:

feature_matrix.head(5)

[8]:

	AGE_OVER_100(customer_age)
customer_id
1	False
2	False
3	False
4	False
5	False

Feature primitives Handling Time