Advanced Custom Primitives Guide

Functions With Additional Arguments

One caveat with the make_primitive functions is that the required arguments of function must be input features. Here we create a function for StringCount, a primitive which counts the number of occurrences of a string in a Text input. Since string is not a feature, it needs to be a keyword argument to string_count.

In [1]: def string_count(column, string=None):
   ...:     '''Count the number of times the value string occurs'''
   ...:     assert string is not None, "string to count needs to be defined"
   ...:     counts = [element.lower().count(string) for element in column]
   ...:     return counts
   ...: 

In order to have features defined using the primitive reflect what string is being counted, we define a custom generate_name function.

In [2]: def string_count_generate_name(self, base_feature_names):
   ...:   return u'STRING_COUNT(%s, "%s")' % (base_feature_names[0], self.kwargs['string'])
   ...: 

Now that we have the function, we create the primitive using the make_trans_primitive function.

In [3]: StringCount = make_trans_primitive(function=string_count,
   ...:                                    input_types=[Text],
   ...:                                    return_type=Numeric,
   ...:                                    cls_attributes={"generate_name": string_count_generate_name})
   ...: 

Passing in string="test" as a keyword argument when initializing the StringCount primitive will make “test” the value used for string when string_count is called to calculate the feature values. Now we use this primitive to define features and calculate the feature values.

In [4]: from featuretools.tests.testing_utils import make_ecommerce_entityset

In [5]: es = make_ecommerce_entityset()

In [6]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                   target_entity="sessions",
   ...:                                   agg_primitives=["sum", "mean", "std"],
   ...:                                   trans_primitives=[StringCount(string="the")])
   ...: 

In [7]: feature_matrix.columns
Out[7]: Index(['device_name', 'customer_id', 'device_type', 'SUM(log.value_2)', 'SUM(log.value_many_nans)', 'SUM(log.value)', 'MEAN(log.value_2)', 'MEAN(log.value_many_nans)', 'MEAN(log.value)', 'STD(log.value_2)', 'STD(log.value_many_nans)', 'STD(log.value)', 'customers.cohort', 'customers.age', 'customers.région_id', 'customers.loves_ice_cream', 'customers.cancel_reason', 'customers.engagement_level', 'SUM(log.products.rating)', 'SUM(log.STRING_COUNT(comments, "the"))', 'MEAN(log.products.rating)', 'MEAN(log.STRING_COUNT(comments, "the"))', 'STD(log.products.rating)', 'STD(log.STRING_COUNT(comments, "the"))', 'customers.SUM(log.value_many_nans)', 'customers.SUM(log.value_2)', 'customers.SUM(log.value)', 'customers.MEAN(log.value_many_nans)', 'customers.MEAN(log.value_2)', 'customers.MEAN(log.value)', 'customers.STD(log.value_many_nans)', 'customers.STD(log.value_2)', 'customers.STD(log.value)', 'customers.STRING_COUNT(favorite_quote, "the")', 'customers.cohorts.cohort_name', 'customers.régions.language'], dtype='object')

In [8]: feature_matrix[['STD(log.STRING_COUNT(comments, "the"))', 'SUM(log.STRING_COUNT(comments, "the"))', 'MEAN(log.STRING_COUNT(comments, "the"))']]
Out[8]: 
    STD(log.STRING_COUNT(comments, "the"))  SUM(log.STRING_COUNT(comments, "the"))  MEAN(log.STRING_COUNT(comments, "the"))
id                                                                                                                         
0                                47.124304                                     209                                    41.80
1                                36.509131                                     109                                    27.25
2                                      NaN                                      29                                    29.00
3                                49.497475                                      70                                    35.00
4                                 0.000000                                       0                                     0.00
5                                 1.414214                                       4                                     2.00

Features with Multiple Outputs

With the make_primitive functions, it is possible to have multiple columns output from a single feature. In order to do that, the output must be formatted as a list of arrays/series where each item in the list corresponds to an output from the primitive. In each of these list items (either arrays or series), there must be one element for each input element.

Take, for example, a primitive called case_count. For each given string, this primitive outputs the number of uppercase and the number of lowercase letters. So, this primitive must return a list with 2 elements, one corresponding to the number of lowercase letters and one corresponding to the number of uppercase letters. Each element in the list is a series/array having the same number of elements as the number of input strings. Below you can see this example in action, as well as the proper way to specify multiple outputs in the make_trans_primitive function.

In [9]: def case_count(array):
   ...:     '''Return the count of upper case and lower case letters in text'''
   ...:     upper = np.array([len(re.findall('[A-Z]', i)) for i in array])
   ...:     lower = np.array([len(re.findall('[a-z]', i)) for i in array])
   ...:     ret = [upper,lower]
   ...:     return ret
   ...: 

We must use the num_output_features attribute to specify the number of outputs when creating the primitive using the make_trans_primitive function.

In [10]: CaseCount = make_trans_primitive(function=case_count,
   ....:                                    input_types=[Text],
   ....:                                    return_type=Numeric,
   ....:                                    number_output_features=2)
   ....: 

In [11]: es = make_ecommerce_entityset()

When we call dfs on this entityset, there are 6 instances (one for each of the strings in the dataset) of our two created features in this feature matrix.

In [12]: feature_matrix, features = ft.dfs(entityset=es,
   ....:                                   target_entity="sessions",
   ....:                                   agg_primitives=[],
   ....:                                   trans_primitives=[CaseCount])
   ....: 

In [13]: feature_matrix.columns
Out[13]: Index(['device_name', 'customer_id', 'device_type', 'customers.cohort', 'customers.age', 'customers.région_id', 'customers.loves_ice_cream', 'customers.cancel_reason', 'customers.engagement_level', 'customers.cohorts.cohort_name', 'customers.régions.language', 'customers.CASE_COUNT(favorite_quote)[0]', 'customers.CASE_COUNT(favorite_quote)[1]'], dtype='object')

In [14]: feature_matrix[['customers.CASE_COUNT(favorite_quote)[0]', 'customers.CASE_COUNT(favorite_quote)[1]']]
Out[14]: 
    customers.CASE_COUNT(favorite_quote)[0]  customers.CASE_COUNT(favorite_quote)[1]
id                                                                                  
0                                         1                                       44
1                                         1                                       44
2                                         1                                       44
3                                         1                                       41
4                                         1                                       41
5                                         1                                       57