Feature primitives#
Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.
Why primitives?#
The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.
A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: average time between events. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.
DFS achieves the same feature by stacking two primitives "time_since_previous"
and "mean"
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean"],
trans_primitives=["time_since_previous"],
features_only=True,
)
feature_defs
[1]:
[<Feature: zip_code>,
<Feature: MEAN(transactions.amount)>,
<Feature: TIME_SINCE_PREVIOUS(join_date)>,
<Feature: MEAN(sessions.MEAN(transactions.amount))>,
<Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>]
Note
The primitive arguments to DFS (eg. agg_primitives
and trans_primitives
in the example above) accept snake_case
, camelCase
, or TitleCase
strings of included Featuretools primitives (ie. time_since_previous
, timeSincePrevious
, and TimeSincePrevious
are all acceptable inputs).
Note
When dfs
is called with features_only=True
, only feature definitions are returned as output. By default this parameter is set to False
. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix.
A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event.
[2]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", "max", "min", "std", "skew"],
trans_primitives=["time_since_previous"],
)
feature_matrix[
[
"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"STD(sessions.TIME_SINCE_PREVIOUS(session_start))",
"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))",
]
]
[2]:
MEAN(sessions.TIME_SINCE_PREVIOUS(session_start)) | MAX(sessions.TIME_SINCE_PREVIOUS(session_start)) | MIN(sessions.TIME_SINCE_PREVIOUS(session_start)) | STD(sessions.TIME_SINCE_PREVIOUS(session_start)) | SKEW(sessions.TIME_SINCE_PREVIOUS(session_start)) | |
---|---|---|---|---|---|
customer_id | |||||
5 | 1007.500000 | 1170.0 | 715.0 | 157.884451 | -1.507217 |
4 | 999.375000 | 1625.0 | 650.0 | 308.688904 | 1.065177 |
1 | 966.875000 | 1170.0 | 715.0 | 171.754341 | -0.254557 |
3 | 888.333333 | 1170.0 | 650.0 | 177.613813 | 0.434581 |
2 | 725.833333 | 975.0 | 520.0 | 194.638554 | 0.162631 |
Aggregation vs Transform Primitive#
In the example above, we use two types of primitives.
Aggregation primitives: These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an EntitySet. E.g: "count"
, "sum"
, "avg_time_between"
.
![digraph "COUNT(sessions)" {
graph [bb="0,0,658,119",
rankdir=LR
];
node [label="\N",
shape=box
];
edge [arrowhead=none,
dir=forward,
style=dotted
];
customers [height=1.1389,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
</TR>
<TR>
<TD ALIGN="LEFT" port="COUNT(sessions)" BGCOLOR="#D9EAD3">COUNT(sessions)</TD>
</TR>
</TABLE>>,
pos="574,59.5",
shape=plaintext,
width=2.3333];
sessions [height=1.6528,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>sessions</B></TD>
</TR><TR><TD ALIGN="LEFT" port="session_id">session_id (index)</TD></TR>
<TR><TD ALIGN="LEFT" port="customer_id">customer_id</TD></TR>
</TABLE>>,
pos="69,59.5",
shape=plaintext,
width=1.9167];
"COUNT(sessions)_groupby_sessions--customer_id" [height=0.52778,
label="group by
customer_id",
pos="216,40.5",
width=1.1667];
sessions:session_id -> "COUNT(sessions)_groupby_sessions--customer_id" [arrowhead="",
pos="e,173.93,52.903 130,58.5 141.09,58.5 152.83,57.047 163.94,54.967",
style=solid];
sessions:customer_id -> "COUNT(sessions)_groupby_sessions--customer_id" [pos="130,21.5 144.53,21.5 160.11,24.117 173.98,27.408"];
"0_COUNT(sessions)_count" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Aggregation</B><BR></BR></FONT>COUNT>,
pos="374,40.5",
shape=diamond,
width=2.2222];
"0_COUNT(sessions)_count" -> customers:"COUNT(sessions)" [arrowhead="",
pos="e,498,40.5 454.04,40.5 465.24,40.5 476.73,40.5 487.88,40.5",
style=solid];
"COUNT(sessions)_groupby_sessions--customer_id" -> "0_COUNT(sessions)_count" [arrowhead="",
pos="e,293.71,40.5 258.34,40.5 266.23,40.5 274.74,40.5 283.44,40.5",
style=solid];
}](../_images/graphviz-b16eabe4138463322f14255864360596a8ad2ffb.png)
Transform primitives: These primitives take one or more columns from a dataframe as an input and output a new column for that dataframe. They are applied to a single dataframe. E.g: "hour"
, "time_since_previous"
, "absolute"
.
![digraph "TIME_SINCE_PREVIOUS(join_date)" {
graph [bb="0,0,624,119",
rankdir=LR
];
node [label="\N",
shape=box
];
edge [arrowhead=none,
dir=forward,
style=dotted
];
customers [height=1.6528,
label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<TR>
<TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
</TR><TR><TD ALIGN="LEFT" port="join_date">join_date</TD></TR>
<TR>
<TD ALIGN="LEFT" port="TIME_SINCE_PREVIOUS(join_date)" BGCOLOR="#D9EAD3">TIME_SINCE_PREVIOUS(join_date)</TD>
</TR>
</TABLE>>,
pos="125,59.5",
shape=plaintext,
width=3.4722];
"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" [height=0.94444,
label=<<FONT POINT-SIZE="12"><B>Transform</B><BR></BR></FONT>TIME_SINCE_PREVIOUS>,
pos="455,40.5",
shape=diamond,
width=4.6944];
customers:join_date -> "0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" [arrowhead="",
pos="e,348.16,53.075 242,58.5 273.23,58.5 306.94,56.555 338.12,53.944",
style=solid];
"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" -> customers:"TIME_SINCE_PREVIOUS(join_date)" [arrowhead="",
pos="e,242,21.5 350.62,27.46 319.34,24.455 284.66,22.002 252.05,21.568",
style=solid];
}](../_images/graphviz-cff844357207f247a9b51037b1a917d0d223be64.png)
The above graphs were generated using the graph_feature
function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature.
For a DataFrame that lists and describes each built-in primitive in Featuretools, call ft.list_primitives()
.
[3]:
ft.list_primitives().head(5)
[3]:
name | type | dask_compatible | spark_compatible | description | valid_inputs | return_type | |
---|---|---|---|---|---|---|---|
0 | num_consecutive_greater_mean | aggregation | False | False | Determines the length of the longest subsequen... | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = IntegerNullable)... |
1 | entropy | aggregation | False | False | Calculates the entropy for a categorical column | <ColumnSchema (Semantic Tags = ['category'])> | <ColumnSchema (Semantic Tags = ['numeric'])> |
2 | max_consecutive_zeros | aggregation | False | False | Determines the maximum number of consecutive z... | <ColumnSchema (Logical Type = Double)>, <Colum... | <ColumnSchema (Logical Type = Integer) (Semant... |
3 | first | aggregation | False | False | Determines the first value in a list. | <ColumnSchema> | None |
4 | mean | aggregation | True | True | Computes the average for a list of values. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Semantic Tags = ['numeric'])> |
For a DataFrame of metrics that summarizes various properties and capabilities of all of the built-in primitives in Featuretools, call ft.summarize_primitives()
.
[4]:
ft.summarize_primitives()
[4]:
Metric | Count | |
---|---|---|
0 | total_primitives | 177 |
1 | aggregation_primitives | 42 |
2 | transform_primitives | 135 |
3 | unique_input_types | 19 |
4 | unique_output_types | 20 |
5 | uses_multi_input | 46 |
6 | uses_multi_output | 3 |
7 | uses_external_data | 1 |
8 | are_controllable | 76 |
9 | uses_address_input | 0 |
10 | uses_age_input | 0 |
11 | uses_age_fractional_input | 0 |
12 | uses_age_nullable_input | 0 |
13 | uses_boolean_input | 14 |
14 | uses_boolean_nullable_input | 12 |
15 | uses_categorical_input | 0 |
16 | uses_country_code_input | 0 |
17 | uses_currency_code_input | 0 |
18 | uses_datetime_input | 57 |
19 | uses_double_input | 3 |
20 | uses_email_address_input | 2 |
21 | uses_filepath_input | 0 |
22 | uses_ip_address_input | 0 |
23 | uses_integer_input | 3 |
24 | uses_integer_nullable_input | 0 |
25 | uses_lat_long_input | 6 |
26 | uses_natural_language_input | 24 |
27 | uses_ordinal_input | 4 |
28 | uses_person_full_name_input | 0 |
29 | uses_phone_number_input | 0 |
30 | uses_postal_code_input | 2 |
31 | uses_sub_region_code_input | 0 |
32 | uses_timedelta_input | 0 |
33 | uses_url_input | 3 |
34 | uses_unknown_input | 0 |
35 | uses_numeric_tag_input | 75 |
36 | uses_category_tag_input | 6 |
37 | uses_index_tag_input | 1 |
38 | uses_time_index_tag_input | 25 |
39 | uses_date_of_birth_tag_input | 1 |
40 | uses_ignore_tag_input | 0 |
41 | uses_passthrough_tag_input | 0 |
42 | uses_foreign_key_tag_input | 1 |
Defining Custom Primitives#
The library of primitives in Featuretools is constantly expanding. Users can define their own primitive using the APIs below. To define a primitive, a user will
Specify the type of primitive
Aggregation
orTransform
Define the input and output data types
Write a function in python to do the calculation
Annotate with attributes to constrain how it is applied
Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another.
[5]:
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage
import pandas as pd
Simple Custom Primitives#
[6]:
class Absolute(TransformPrimitive):
name = "absolute"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def absolute(column):
return abs(column)
return absolute
Above, we created a new transform primitive that can be used with Deep Feature Synthesis by deriving a new primitive class using TransformPrimitive
as a base and overriding get_function
to return a function that calculates the feature. Additionally, we set the input data types that the primitive applies to and the return data type. Input and return data types are defined using a Woodwork ColumnSchema. A full guide on Woodwork logical types and semantic tags can be found in the Woodwork
Understanding Logical Types and Semantic Tags guide.
Similarly, we can make a new aggregation primitive using AggregationPrimitive
.
[7]:
class Maximum(AggregationPrimitive):
name = "maximum"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def maximum(column):
return max(column)
return maximum
Because we defined an aggregation primitive, the function takes in a list of values but only returns one.
Now that we’ve defined two primitives, we can use them with the dfs function as if they were built-in primitives.
[8]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[Maximum],
trans_primitives=[Absolute],
max_depth=2,
)
feature_matrix.head(5)[
[
"customers.MAXIMUM(transactions.amount)",
"MAXIMUM(transactions.ABSOLUTE(amount))",
]
]
[8]:
customers.MAXIMUM(transactions.amount) | MAXIMUM(transactions.ABSOLUTE(amount)) | |
---|---|---|
session_id | ||
1 | 146.81 | 141.66 |
2 | 149.02 | 135.25 |
3 | 149.95 | 147.73 |
4 | 139.43 | 129.00 |
5 | 149.95 | 139.20 |
Word Count Example#
Here we define a transform primitive, WordCount
, which counts the number of words in each row of an input and returns a list of the counts.
[9]:
class WordCount(TransformPrimitive):
"""
Counts the number of words in each row of the column. Returns a list
of the counts for each row.
"""
name = "word_count"
input_types = [ColumnSchema(logical_type=NaturalLanguage)]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def word_count(column):
word_counts = []
for value in column:
words = value.split(None)
word_counts.append(len(words))
return word_counts
return word_count
[10]:
es = make_ecommerce_entityset()
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["sum", "mean", "std"],
trans_primitives=[WordCount],
)
feature_matrix[
[
"customers.WORD_COUNT(favorite_quote)",
"STD(log.WORD_COUNT(comments))",
"SUM(log.WORD_COUNT(comments))",
"MEAN(log.WORD_COUNT(comments))",
]
]
[10]:
customers.WORD_COUNT(favorite_quote) | STD(log.WORD_COUNT(comments)) | SUM(log.WORD_COUNT(comments)) | MEAN(log.WORD_COUNT(comments)) | |
---|---|---|---|---|
id | ||||
0 | 9.0 | 540.436860 | 2500.0 | 500.0 |
1 | 9.0 | 583.702550 | 1732.0 | 433.0 |
2 | 9.0 | NaN | 246.0 | 246.0 |
3 | 6.0 | 883.883476 | 1256.0 | 628.0 |
4 | 6.0 | 0.000000 | 9.0 | 3.0 |
5 | 12.0 | 19.798990 | 68.0 | 34.0 |
By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.
Multiple Input Types#
If a primitive requires multiple features as input, input_types
has multiple elements, eg [ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]
would mean the primitive requires two columns with the semantic tag numeric
as input. Below is an example of a primitive that has multiple input features.
[11]:
class MeanSunday(AggregationPrimitive):
"""
Finds the mean of non-null values of a feature that occurred on Sundays
"""
name = "mean_sunday"
input_types = [
ColumnSchema(semantic_tags={"numeric"}),
ColumnSchema(logical_type=Datetime),
]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def mean_sunday(numeric, datetime):
days = pd.DatetimeIndex(datetime).weekday.values
df = pd.DataFrame({"numeric": numeric, "time": days})
return df[df["time"] == 6]["numeric"].mean()
return mean_sunday
[12]:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[MeanSunday],
trans_primitives=[],
max_depth=1,
)
feature_matrix[
[
"MEAN_SUNDAY(log.value, datetime)",
"MEAN_SUNDAY(log.value_2, datetime)",
]
]
[12]:
MEAN_SUNDAY(log.value, datetime) | MEAN_SUNDAY(log.value_2, datetime) | |
---|---|---|
id | ||
0 | NaN | NaN |
1 | NaN | NaN |
2 | NaN | NaN |
3 | 2.5 | 1.0 |
4 | 7.0 | 3.0 |
5 | NaN | NaN |