Variable Types

A Variable is analogous to a column in a table in a relational database. When creating an Entity, Featuretools will attempt to infer the types of variables present. Featuretools also allows for explicitly specifying the variable types when creating the Entity.

It is important that datasets have appropriately defined variable types when using DFS because this will allow the correct primitives to be used to generate new features.

Note: When using Dask Entities, users must explicitly specify the variable types for all columns in the Entity dataframe.

To understand the different variable types in Featuretools, let’s first look at a graph of the variables:

[1]:
from featuretools.variable_types import graph_variable_types
graph_variable_types()
[1]:
../_images/getting_started_variables_1_0.svg

As we can see, there are multiple variable types and some have subclassed variable types. For example, ZIPCode is variable type that is child of Categorical type which is a child of Discrete type.

Let’s explore some of the variable types and understand them in detail.

Discrete

A Discrete variable type can only take certain values. It is a type of data that can be counted, but cannot be measured. If it can be classified into distinct buckets, then it a discrete variable type.

There are 2 sub-variable types of Discrete. These are Categorical, and Ordinal. If the data has a certain ordering, it is of Ordinal type. If it cannot be ordered, then is a Categorical type.

Categorical

A Categorical variable type can take unordered discrete values. It is usually a limited, and fixed number of possible values. Categorical variable types can be represented as strings, or integers.

Some examples of Categorical variable types:

  • Gender

  • Eye Color

  • Nationality

  • Hair Color

  • Spoken Language

Ordinal

A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers.

Some examples of Ordinal variable types:

  • Educational Background (Elementary, High School, Undergraduate, Graduate)

  • Satisfaction Rating (“Not Satisfied”, “Satisfied”, “Very Satisfied”)

  • Spicy Level (Hot, Hotter, Hottest)

  • Student Grade (A, B, C, D, F)

  • Size (small, medium, large)

Categorical SubTypes (CountryCode, Id, SubRegionCode, ZIPCode)

There are also more distinctions within the Categorical variable type. These include CountryCode, Id, SubRegionCode, and ZIPCode.

It is important to make this distinction because there are certain operations that can be applied, but they don’t necessary apply to all Categorical types. For example, there could be a custom primitive that applies to the ZIPCode variable type. It could extract the first 5 digits of a ZIPCode. However, this operation is not valid for all Categorical variable types. Therefore it is approriate to use the ZIPCode variable type.

Datetime

A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers. However, they should be in a intrepretable format or properly cast before using DFS.

Some examples of Datetime include:

  • transaction time

  • flight departure time

  • pickup time

DateOfBirth

A more distinct type of datetime is a DateOfBirth. This is an important distinction because it allows additional primitives to be applied to the data to generate new features. For example, having an DateOfBirth variable type, will allow the Age primitive to be applied during DFS, and lead to a new Numeric feature.

Text

Text is a long-form string, that can be of any length. It is commonly used with NLP operations, such as TF-IDF. Featuretools supports NLP operations with the nlp-primitives add-on.

LatLong

A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers.

To make a LatLong in a dataframe do the following:

[2]:
import pandas as pd

data = pd.DataFrame()
data['latitude'] = [51.52, 9.93, 37.38]
data['longitude'] = [-0.17, 76.25, -122.08]
data['latlong'] = data[['latitude', 'longitude']].apply(tuple, axis=1)
data['latlong']
[2]:
0      (51.52, -0.17)
1       (9.93, 76.25)
2    (37.38, -122.08)
Name: latlong, dtype: object

List of Variable Types

We can also get all the variable types as a DataFrame.

[3]:
from featuretools.variable_types import list_variable_types
list_variable_types()
[3]:
name type_string description
0 Unknown unknown None
1 Discrete discrete Superclass representing variables that take on...
2 Categorical categorical Represents variables that can take an unordere...
3 Id id Represents variables that identify another entity
4 ZIPCode zip_code Represents a postal address in the United Stat...
5 CountryCode country_code Represents an ISO-3166 standard country code.\...
6 SubRegionCode sub_region_code Represents an ISO-3166 standard sub-region cod...
7 Ordinal ordinal Represents variables that take on an ordered d...
8 Boolean boolean Represents variables that take on one of two v...
9 Numeric numeric Represents variables that contain numeric valu...
10 NumericTimeIndex numeric_time_index Represents time index of entity that is numeric
11 Index index Represents variables that uniquely identify an...
12 Datetime datetime Represents variables that are points in time\n...
13 DatetimeTimeIndex datetime_time_index Represents time index of entity that is a date...
14 DateOfBirth date_of_birth Represents a date of birth as a datetime
15 TimeIndex time_index Represents time index of entity
16 Timedelta timedelta Represents variables that are timedeltas\n\n ...
17 NaturalLanguage natural_language Represents variables that are arbitary strings
18 LatLong lat_long Represents an ordered pair (Latitude, Longitud...
19 IPAddress ip_address Represents a computer network address. Represe...
20 FullName full_name Represents a person's full name. May consist o...
21 EmailAddress email_address Represents an email box to which email message...
22 URL url Represents a valid web url (with or without ht...
23 PhoneNumber phone_number Represents any valid phone number.\n Can be...
24 FilePath file_path Represents a valid filepath, absolute or relative

Defining Custom Variable Types

Users can define their own variable types. For example, to make a custom variable type called Age, run the following code:

[4]:
from featuretools.variable_types import Variable

class Age(Variable):
    _default_pandas_dtype = float

The _default_pandas_dtype specifies the pandas dtype to use to represent the underlying data. A list of pandas dtypes can be found here.

Age can now be used as a variable type when creating an entity. For example, let’s create an entity with a column called customer_age.

[5]:
import pandas as pd
import featuretools as ft

df = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
                   "customer_age": [40, 50, 10, 20, 30]})

es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customers",
                              dataframe=df,
                              index="customer_id",
                              variable_types={
                                  'customer_age': Age
                              })

Age can also be used as a variable type to create a custom primitive. For example, let’s create a transform primitive that returns a boolean if the age is greater than 100.

[6]:
from featuretools.variable_types import Boolean
from featuretools.primitives.base import TransformPrimitive

class AgeOver100(TransformPrimitive):
    name = "age_over_100"
    input_types = [Age]
    return_type = Boolean

    def get_function(self):
        def age_over_100(x):
            return x > 100
        return age_over_100

This primitive can now be passed to ft.dfs as one of the transform primitives. DFS will generate a feature which uses the custom primitive (AgeOver100) with the custom variable type (Age).

[7]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="customers",
                                      trans_primitives=[AgeOver100])
feature_defs
[7]:
[<Feature: AGE_OVER_100(customer_age)>]
[8]:
feature_matrix.head(5)
[8]:
AGE_OVER_100(customer_age)
customer_id
1 False
2 False
3 False
4 False
5 False