Using Spark EntitySets (BETA)¶
Note
Support for Spark EntitySets is still in Beta. While the key functionality has been implemented, development is ongoing to add the remaining functionality.
All planned improvements to the Featuretools/Spark integration are documented on Github. If you see an open issue that is important for your application, please let us know by upvoting or commenting on the issue. If you encounter any errors using Spark dataframes in EntitySets, or find missing functionality that does not yet have an open issue, please create a new issue on Github.
Note
Featuretools does not currently support Spark for Python 3.10.
Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the EntitySet cannot easily fit in memory. To help get around this issue, Featuretools supports creating EntitySet
objects from Spark dataframes. A Spark EntitySet
can then be passed to featuretools.dfs
or featuretools.calculate_feature_matrix
to create a feature matrix, which will be returned as a Spark dataframe. In addition to working on larger than memory
datasets, this approach also allows users to take advantage of the parallel and distributed processing capabilities offered by Spark and Spark.
This guide will provide an overview of how to create a Spark EntitySet
and then generate a feature matrix from it. If you are already familiar with creating a feature matrix starting from pandas dataframes, this process will seem quite familiar, as there are no differences in the process. There are, however, some limitations when using Spark dataframes, and those limitations are reviewed in more detail below.
Creating EntitySets¶
Spark EntitySets
require PySpark. Both can be installed directly with pip install featuretools[spark]
. Java is also required for PySpark and may need to be installed, see the Spark documentation for more details. We will create a very small Spark dataframe for this example. Spark dataframes can also be created from pandas dataframes, Spark dataframes, or read in directly from a file.
[2]:
import featuretools as ft
import pyspark.pandas as ps
id = [0, 1, 2, 3, 4]
values = [12, -35, 14, 103, -51]
spark_df = ps.DataFrame({"id": id, "values": values})
spark_df
[2]:
id | values | |
---|---|---|
0 | 0 | 12 |
1 | 1 | -35 |
2 | 2 | 14 |
3 | 3 | 103 |
4 | 4 | -51 |
Now that we have our Spark dataframe, we can start to create the EntitySet
. Inferring Woodwork logical types for the columns in a Spark dataframe can be computationally expensive. To avoid this expense, logical type inference can be skipped by supplying a dictionary of logical types using the logical_types
parameter when calling es.add_dataframe()
. Logical types can be specified as Woodwork LogicalType classes, or their equivalent string representation. For more information on using
Woodwork types refer to the Woodwork Typing in Featuretools guide.
Aside from supplying the logical types, the rest of the process of creating an EntitySet
is the same as if we were using pandas DataFrames.
[3]:
from woodwork.logical_types import Double, Integer
es = ft.EntitySet(id="spark_es")
es = es.add_dataframe(
dataframe_name="spark_input_df",
dataframe=spark_df,
index="id",
logical_types={"id": Integer, "values": Double})
es
[3]:
Entityset: spark_es
DataFrames:
spark_input_df [Rows: 5, Columns: 2]
Relationships:
No relationships
Running DFS¶
We can pass the EntitySet
we created above to featuretools.dfs
in order to create a feature matrix. If the EntitySet
we pass to dfs
is made of Spark dataframes, the feature matrix we get back will be a Spark dataframe.
[4]:
feature_matrix, features = ft.dfs(entityset=es,
target_dataframe_name="spark_input_df",
trans_primitives=["negate"],
max_depth=1)
feature_matrix
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/06/10 15:34:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[4]:
values | -(values) | id | |
---|---|---|---|
0 | 12.0 | -12.0 | 0 |
1 | -35.0 | 35.0 | 1 |
2 | 14.0 | -14.0 | 2 |
3 | -51.0 | 51.0 | 4 |
4 | 103.0 | -103.0 | 3 |
This feature matrix can be saved to disk or converted to a pandas dataframe and brought into memory, using the appropriate Spark dataframe methods.
While this is a simple example to illustrate the process of using Spark dataframes with Featuretools, this process will also work with an EntitySet
containing multiple dataframes, as well as with aggregation primitives.
Limitations¶
The key functionality of Featuretools is available for use with a Spark EntitySet
, and work is ongoing to add the remaining functionality that is available when using a pandas EntitySet
. There are, however, some limitations to be aware of when creating a Spark Entityset
and then using it to generate a feature matrix. The most significant limitations are reviewed in more detail in this section.
Note
If the limitations of using a Spark EntitySet
are problematic for your problem, you may still be able to compute a larger-than-memory feature matrix by partitioning your data as described in Improving Computational Performance.
Supported Primitives¶
When creating a feature matrix from a Spark EntitySet
, only certain primitives can be used. Primitives that rely on the order of the entire dataframe or require an entire column for computation are currently not supported when using a Spark EntitySet
. Multivariable and time-dependent aggregation primitives also are not currently supported.
To obtain a list of the primitives that can be used with a Spark EntitySet
, you can call featuretools.list_primitives()
. This will return a table of all primitives. Any primitive that can be used with a Spark EntitySet
will have a value of True
in the spark_compatible
column.
[5]:
primitives_df = ft.list_primitives()
spark_compatible_df = primitives_df[primitives_df["spark_compatible"] == True]
spark_compatible_df.head()
[5]:
name | type | dask_compatible | spark_compatible | description | valid_inputs | return_type | |
---|---|---|---|---|---|---|---|
1 | count | aggregation | True | True | Determines the total number of values, excludi... | <ColumnSchema (Semantic Tags = ['index'])> | None |
8 | num_unique | aggregation | True | True | Determines the number of distinct values, igno... | <ColumnSchema (Semantic Tags = ['category'])> | None |
10 | min | aggregation | True | True | Calculates the smallest value, ignoring `NaN` ... | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
12 | mean | aggregation | True | True | Computes the average for a list of values. | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
16 | sum | aggregation | True | True | Calculates the total addition, ignoring `NaN`. | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
[6]:
spark_compatible_df.tail()
[6]:
name | type | dask_compatible | spark_compatible | description | valid_inputs | return_type | |
---|---|---|---|---|---|---|---|
113 | is_weekend | transform | True | True | Determines if a date falls on a weekend. | <ColumnSchema (Logical Type = Datetime)> | None |
115 | divide_numeric_scalar | transform | True | True | Divide each element in the list by a scalar. | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
116 | sine | transform | True | True | Computes the sine of a number. | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
117 | subtract_numeric_scalar | transform | True | True | Subtract a scalar from each element in the list. | <ColumnSchema (Semantic Tags = ['numeric'])> | None |
118 | num_characters | transform | True | True | Calculates the number of characters in a string. | <ColumnSchema (Logical Type = NaturalLanguage)> | None |
DataFrame Limitations¶
Featuretools stores the DataFrames that make up an EntitySet as Woodwork DataFrames, which include additional typing information about the columns that are in the DataFrame. When adding a DataFrame to an EntitySet
, Woodwork will attempt to infer the logical types for any columns that do not have a logical type defined. This inference process can be quite expensive for Spark DataFrames. In order to skip type inference and speed up the process of adding a Spark DataFrame to an EntitySet
,
users can specify the logical type to use for each column in the DataFrame. A list of available logical types can be obtained by running featuretools.list_logical_types()
. To learn more about the limitations of a Spark dataframe with Woodwork typing, see the Woodwork guide on Spark dataframes.
By default, Woodwork checks that pandas dataframes have unique index values. Because performing this same check with Spark could be computationally expensive, this check is not performed when adding a Spark dataframe to an EntitySet
. When using Spark dataframes, users must ensure that the supplied index values are unique.
When using a pandas DataFrames, the ordering of the underlying DataFrame rows is maintained by Featuretools. For a Spark DataFrame, the ordering of the DataFrame rows is not guaranteed, and Featuretools does not attempt to maintain row order in a Spark DataFrame. If ordering is important, close attention must be paid to any output to avoid issues.
EntitySet Limitations¶
When creating a Featuretools EntitySet
that will be made of Spark dataframes, all of the dataframes used to create the EntitySet
must be of the same type, either all Spark dataframe, all Dask dataframes, or all pandas dataframes. Featuretools does not support creating an EntitySet
containing a mix of Spark, Dask, and pandas dataframes.
Additionally, EntitySet.add_interesting_values()
cannot be used in Spark EntitySets to find interesting values; however, it can be used set a column’s interesting values with the values
parameter.
[7]:
values_dict = {'values': [12, 103]}
es.add_interesting_values(dataframe_name='spark_input_df', values=values_dict)
es['spark_input_df'].ww.columns['values'].metadata
[7]:
{'dataframe_name': 'spark_input_df',
'entityset_id': 'spark_es',
'interesting_values': [12, 103]}
DFS Limitations¶
There are a few key limitations when generating a feature matrix from a Spark EntitySet
.
If a cutoff_time
parameter is passed to featuretools.dfs()
it should be a single cutoff time value, or a pandas dataframe. The current implementation will still work if a Spark dataframe is supplied for cutoff times, but a .to_pandas()
call will be made on the dataframe to convert it into a pandas dataframe. This conversion will result in a warning, and the process could take a considerable amount of time to complete depending on the size of the supplied dataframe.
Additionally, Featuretools does not currently support the use of the approximate
or training_window
parameters when working with Spark EntitySets, but should in future releases.
Finally, if the output feature matrix contains a boolean column with NaN
values included, the column type may have a different datatype than the same feature matrix generated from a pandas EntitySet
. If feature matrix column data types are critical, the feature matrix should be inspected to make sure the types are of the proper types, and recast as necessary.
Other Limitations¶
Currently featuretools.encode_features()
does not work with a Spark dataframe as input. This will hopefully be resolved in a future release of Featuretools.
The utility function featuretools.make_temporal_cutoffs()
will not work properly with Spark inputs for instance_ids
or cutoffs
. However, as noted above, if a cutoff_time
dataframe is supplied to dfs
, the supplied dataframe should be a pandas dataframe, and this can be generated by supplying pandas inputs to make_temporal_cutoffs()
.
The use of featuretools.remove_low_information_features()
cannot currently be used with a Spark feature matrix.
When manually defining a Feature
, the use_previous
parameter cannot be used if this feature will be applied to calculate a feature matrix from a Spark EntitySet
.