Advanced Custom Primitives Guide¶
Functions With Additional Arguments¶
One caveat with the make_primitive functions is that the required arguments of function
must be input features. Here we create a function for StringCount
, a primitive which counts the number of occurrences of a string in a Text
input. Since string
is not a feature, it needs to be a keyword argument to string_count
.
In [1]: def string_count(column, string=None):
...: '''Count the number of times the value string occurs'''
...: assert string is not None, "string to count needs to be defined"
...: counts = [element.lower().count(string) for element in column]
...: return counts
...:
In order to have features defined using the primitive reflect what string is being counted, we define a custom generate_name
function.
In [2]: def string_count_generate_name(self, base_feature_names):
...: return u'STRING_COUNT(%s, "%s")' % (base_feature_names[0], self.kwargs['string'])
...:
Now that we have the function, we create the primitive using the make_trans_primitive
function.
In [3]: StringCount = make_trans_primitive(function=string_count,
...: input_types=[Text],
...: return_type=Numeric,
...: cls_attributes={"generate_name": string_count_generate_name})
...:
Passing in string="test"
as a keyword argument when initializing the StringCount primitive will make “test” the value used for string when string_count
is called to calculate the feature values. Now we use this primitive to define features and calculate the feature values.
In [4]: from featuretools.tests.testing_utils import make_ecommerce_entityset
In [5]: es = make_ecommerce_entityset()
In [6]: feature_matrix, features = ft.dfs(entityset=es,
...: target_entity="sessions",
...: agg_primitives=["sum", "mean", "std"],
...: trans_primitives=[StringCount(string="the")])
...:
In [7]: feature_matrix.columns
Out[7]: Index(['device_name', 'customer_id', 'device_type', 'SUM(log.value_2)', 'SUM(log.value_many_nans)', 'SUM(log.value)', 'MEAN(log.value_2)', 'MEAN(log.value_many_nans)', 'MEAN(log.value)', 'STD(log.value_2)', 'STD(log.value_many_nans)', 'STD(log.value)', 'customers.cohort', 'customers.age', 'customers.région_id', 'customers.loves_ice_cream', 'customers.cancel_reason', 'customers.engagement_level', 'SUM(log.products.rating)', 'SUM(log.STRING_COUNT(comments, "the"))', 'MEAN(log.products.rating)', 'MEAN(log.STRING_COUNT(comments, "the"))', 'STD(log.products.rating)', 'STD(log.STRING_COUNT(comments, "the"))', 'customers.SUM(log.value_many_nans)', 'customers.SUM(log.value_2)', 'customers.SUM(log.value)', 'customers.MEAN(log.value_many_nans)', 'customers.MEAN(log.value_2)', 'customers.MEAN(log.value)', 'customers.STD(log.value_many_nans)', 'customers.STD(log.value_2)', 'customers.STD(log.value)', 'customers.STRING_COUNT(favorite_quote, "the")', 'customers.cohorts.cohort_name', 'customers.régions.language'], dtype='object')
In [8]: feature_matrix[['STD(log.STRING_COUNT(comments, "the"))', 'SUM(log.STRING_COUNT(comments, "the"))', 'MEAN(log.STRING_COUNT(comments, "the"))']]
Out[8]:
STD(log.STRING_COUNT(comments, "the")) SUM(log.STRING_COUNT(comments, "the")) MEAN(log.STRING_COUNT(comments, "the"))
id
0 47.124304 209 41.80
1 36.509131 109 27.25
2 NaN 29 29.00
3 49.497475 70 35.00
4 0.000000 0 0.00
5 1.414214 4 2.00
Features with Multiple Outputs¶
With the make_primitive
functions, it is possible to have multiple columns output from a single feature. In order to do that, the output must be formatted as a list of arrays/series where each item in the list corresponds to an output from the primitive. In each of these list items (either arrays or series), there must be one element for each input element.
Take, for example, a primitive called case_count
. For each given string, this primitive outputs the number of uppercase and the number of lowercase letters. So, this primitive must return a list with 2 elements, one corresponding to the number of lowercase letters and one corresponding to the number of uppercase letters. Each element in the list is a series/array having the same number of elements as the number of input strings. Below you can see this example in action, as well as the proper way to specify multiple outputs in the make_trans_primitive
function.
In [9]: def case_count(array):
...: '''Return the count of upper case and lower case letters in text'''
...: upper = np.array([len(re.findall('[A-Z]', i)) for i in array])
...: lower = np.array([len(re.findall('[a-z]', i)) for i in array])
...: ret = [upper,lower]
...: return ret
...:
We must use the num_output_features
attribute to specify the number of outputs when creating the primitive using the make_trans_primitive
function.
In [10]: CaseCount = make_trans_primitive(function=case_count,
....: input_types=[Text],
....: return_type=Numeric,
....: number_output_features=2)
....:
In [11]: es = make_ecommerce_entityset()
When we call dfs
on this entityset, there are 6 instances (one for each of the strings in the dataset) of our two created features in this feature matrix.
In [12]: feature_matrix, features = ft.dfs(entityset=es,
....: target_entity="sessions",
....: agg_primitives=[],
....: trans_primitives=[CaseCount])
....:
In [13]: feature_matrix.columns
Out[13]: Index(['device_name', 'customer_id', 'device_type', 'customers.cohort', 'customers.age', 'customers.région_id', 'customers.loves_ice_cream', 'customers.cancel_reason', 'customers.engagement_level', 'customers.cohorts.cohort_name', 'customers.régions.language', 'customers.CASE_COUNT(favorite_quote)[0]', 'customers.CASE_COUNT(favorite_quote)[1]'], dtype='object')
In [14]: feature_matrix[['customers.CASE_COUNT(favorite_quote)[0]', 'customers.CASE_COUNT(favorite_quote)[1]']]
Out[14]:
customers.CASE_COUNT(favorite_quote)[0] customers.CASE_COUNT(favorite_quote)[1]
id
0 1 44
1 1 44
2 1 44
3 1 41
4 1 41
5 1 57