NOTICE
The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:
pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip
For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.
Note
Support for Dask EntitySets is still in Beta. While the key functionality has been implemented, development is ongoing to add the remaining functionality.
All planned improvements to the Featuretools/Dask integration are documented on Github. If you see an open issue that is important for your application, please let us know by upvoting or commenting on the issue. If you encounter any errors using Dask entities, or find missing functionality that does not yet have an open issue, please create a new issue on Github.
Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the entities cannot easily fit in memory. To help get around this issue, Featuretools supports creating Entity and EntitySet objects from Dask dataframes. A Dask EntitySet can then be passed to featuretools.dfs or featuretools.calculate_feature_matrix to create a feature matrix, which will be returned as a Dask dataframe. In addition to working on larger than memory datasets, this approach also allows users to take advantage of the parallel and distributed processing capabilities offered by Dask.
Entity
EntitySet
featuretools.dfs
featuretools.calculate_feature_matrix
This guide will provide an overview of how to create a Dask EntitySet and then generate a feature matrix from it. If you are already familiar with creating a feature matrix starting from pandas dataframes, this process will seem quite familiar, as there are no differences in the process. There are, however, some limitations when using Dask dataframes, and those limitations are reviewed in more detail below.
For this example, we will create a very small pandas dataframe and then convert this into a Dask dataframe to use in the remainder of the process. Normally when using Dask, you would just read your data directly into a Dask dataframe without the intermediate step of using pandas.
In [1]: import featuretools as ft In [2]: import pandas as pd In [3]: import dask.dataframe as dd In [4]: id = [0, 1, 2, 3, 4] In [5]: values = [12, -35, 14, 103, -51] In [6]: df = pd.DataFrame({"id": id, "values": values}) In [7]: dask_df = dd.from_pandas(df, npartitions=2) In [8]: dask_df Out[8]: Dask DataFrame Structure: id values npartitions=2 0 int64 int64 3 ... ... 4 ... ... Dask Name: from_pandas, 2 tasks
Now that we have our Dask dataframe, we can start to create the EntitySet. The current implementation does not support variable type inference for Dask entities, so we must pass a dictionary of variable types using the variable_types parameter when calling es.entity_from_dataframe(). Aside from needing to supply the variable types, the rest of the process of creating an EntitySet is the same as if we were using pandas dataframes.
variable_types
es.entity_from_dataframe()
In [9]: es = ft.EntitySet(id="dask_es") In [10]: es = es.entity_from_dataframe(entity_id="dask_entity", ....: dataframe=dask_df, ....: index="id", ....: variable_types={"id": ft.variable_types.Id, ....: "values": ft.variable_types.Numeric}) ....: In [11]: es Out[11]: Entityset: dask_es Entities: dask_entity [Rows: Delayed('int-003fdf70-f49f-4a29-bc42-3d4c2a9d8313'), Columns: 2] Relationships: No relationships
Notice that when we print our EntitySet, the number of rows for the dask_entity entity is returned as a Dask Delayed object. This is because obtaining the length of a Dask dataframe may require an expensive compute operation to sum up the lengths of all the individual partitions that make up the dataframe and that operation is not performed by default.
dask_entity
Delayed
We can pass the EntitySet we created above to featuretools.dfs in order to create a feature matrix. If the EntitySet we pass to dfs is made of Dask entities, the feature matrix we get back will be a Dask dataframe.
dfs
In [12]: feature_matrix, features = ft.dfs(entityset=es, ....: target_entity="dask_entity", ....: trans_primitives=["negate"], ....: max_depth=1) ....: In [13]: feature_matrix Out[13]: Dask DataFrame Structure: values -(values) id npartitions=2 0 int64 int64 int64 3 ... ... ... 4 ... ... ... Dask Name: getitem, 21 tasks
This feature matrix can be saved to disk or computed and brought into memory, using the appropriate Dask dataframe methods.
In [14]: fm_computed = feature_matrix.compute() In [15]: fm_computed Out[15]: values -(values) id 0 12 -12 0 1 -35 35 1 2 14 -14 2 3 103 -103 3 4 -51 51 4
While this is a simple example to illustrate the process of using Dask dataframes with Featuretools, this process will also work with an EntitySet containing multiple entities, as well as with aggregation primitives.
The key functionality of Featuretools is available for use with a Dask EntitySet, and work is ongoing to add the remaining functionality that is available when using a pandas EntitySet. There are, however, some limitations to be aware of when creating a Dask Entityset and then using it to generate a feature matrix. The most significant limitations are reviewed in more detail in this section.
Entityset
If the limitations of using a Dask EntitySet are problematic for your problem, you may still be able to compute a larger-than-memory feature matrix by partitioning your data as described in Improving Computational Performance.
When creating a feature matrix from a Dask EntitySet, only certain primitives can be used. Primitives that rely on the order of the entire dataframe or require an entire column for computation are currently not supported when using a Dask EntitySet. Multivariable and time-dependent aggregation primitives also are not currently supported.
To obtain a list of the primitives that can be used with a Dask EntitySet, you can call featuretools.list_primitives(). This will return a table of all primitives. Any primitive that can be used with a Dask EntitySet will have a value of True in the dask_compatible column.
featuretools.list_primitives()
True
dask_compatible
In [16]: primitives_df = ft.list_primitives() In [17]: dask_compatible_df = primitives_df[primitives_df["dask_compatible"] == True] In [18]: dask_compatible_df.head() Out[18]: name type dask_compatible koalas_compatible description valid_inputs return_type 4 sum aggregation True True Calculates the total addition, ignoring `NaN`. Numeric Numeric 5 any aggregation True False Determines if any value is 'True' in a list. Boolean Boolean 7 count aggregation True True Determines the total number of values, excludi... Index Numeric 8 num_unique aggregation True True Determines the number of distinct values, igno... Discrete Numeric 9 min aggregation True True Calculates the smallest value, ignoring `NaN` ... Numeric Numeric In [19]: dask_compatible_df.tail() Out[19]: name type dask_compatible koalas_compatible description valid_inputs return_type 78 minute transform True True Determines the minutes value of a datetime. Datetime Numeric 79 subtract_numeric_scalar transform True True Subtract a scalar from each element in the list. Numeric Numeric 81 modulo_by_feature transform True True Return the modulo of a scalar by each element ... Numeric Numeric 82 equal transform True True Determines if values in one list are equal to ... Variable Boolean 83 hour transform True True Determines the hour value of a datetime. Datetime Ordinal
At this time, custom primitives created with featuretools.primitives.make_trans_primitive() or featuretools.primitives.make_agg_primitive() cannot be used for running deep feature synthesis on a Dask EntitySet. While it is possible to create custom primitives for use with a Dask EntitySet by extending the proper primitive class, there are several potential problems in doing so, and those issues are beyond the scope of this guide.
featuretools.primitives.make_trans_primitive()
featuretools.primitives.make_agg_primitive()
When creating a Featuretools Entity from Dask dataframes, variable type inference is not performed as it is when creating entities from pandas dataframes. This is done to improve speed as sampling the data to infer the variable types would require an expensive compute operation on the underlying Dask dataframe. As a consequence, users must define the variable types for each column in the supplied Dataframe. This step is needed so that the deep feature synthesis process can build the proper features based on the column types. A list of available variable types can be obtained by running featuretools.variable_types.find_variable_types().
featuretools.variable_types.find_variable_types()
By default, Featuretools checks that entities created from pandas dataframes have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when creating an entity from a Dask dataframe. When using Dask dataframes, users must ensure that the supplied index values are unique.
When an Entity is created from a pandas dataframe, the ordering of the underlying dataframe rows is maintained. For a Dask Entity, the ordering of the dataframe rows is not guaranteed, and Featuretools does not attempt to maintain row order in a Dask Entity. If ordering is important, close attention must be paid to any output to avoid issues.
The Entity.add_interesting_values() method is not supported when using a Dask Entity. If needed, users can manually set interesting_values on entities by assigning them directly with syntax similar to this: es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"].
Entity.add_interesting_values()
interesting_values
es["entity_name"]["variable_name"].interesting_values = ["Value 1", "Value 2"]
When creating a Featuretools EntitySet that will be made of Dask entities, all of the entities used to create the EntitySet must be of the same type, either all Dask entities or all pandas entities. Featuretools does not support creating an EntitySet containing a mix of Dask and pandas entities.
Additionally, the EntitySet.add_interesting_values() method is not supported when using a Dask EntitySet. Users can manually set interesting_values on entities, as described above.
EntitySet.add_interesting_values()
There are a few key limitations when generating a feature matrix from a Dask EntitySet.
If a cutoff_time parameter is passed to featuretools.dfs() it should be a single cutoff time value, or a pandas dataframe. The current implementation will still work if a Dask dataframe is supplied for cutoff times, but a .compute() call will be made on the dataframe to convert it into a pandas dataframe. This conversion will result in a warning, and the process could take a considerable amount of time to complete depending on the size of the supplied dataframe.
cutoff_time
featuretools.dfs()
.compute()
Additionally, Featuretools does not currently support the use of the approximate or training_window parameters when working with Dask entitiysets, but should in future releases.
approximate
training_window
Finally, if the output feature matrix contains a boolean column with NaN values included, the column type may have a different datatype than the same feature matrix generated from a pandas EntitySet. If feature matrix column data types are critical, the feature matrix should be inspected to make sure the types are of the proper types, and recast as necessary.
NaN
In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the dataframes that make up the entities. Possible solutions to this problem include reducing the partition size used when creating the entity dataframes or increasing the memory available on Dask workers.
Currently featuretools.encode_features() does not work with a Dask dataframe as input. This will hopefully be resolved in a future release of Featuretools.
featuretools.encode_features()
The utility function featuretools.make_temporal_cutoffs() will not work properly with Dask inputs for instance_ids or cutoffs. However, as noted above, if a cutoff_time dataframe is supplied to dfs, the supplied dataframe should be a pandas dataframe, and this can be generated by supplying pandas inputs to make_temporal_cutoffs().
featuretools.make_temporal_cutoffs()
instance_ids
cutoffs
make_temporal_cutoffs()
The use of featuretools.remove_low_information_features() cannot currently be used with a Dask feature matrix.
featuretools.remove_low_information_features()
When manually defining a Feature, the use_previous parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask EntitySet.
Feature
use_previous