{ "cells": [ { "cell_type": "markdown", "id": "99052370", "metadata": {}, "source": [ "# Using Dask EntitySets (BETA)" ] }, { "cell_type": "raw", "id": "84275fe9", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " Support for Dask EntitySets is still in Beta. While the key functionality has been implemented, development is ongoing to add the remaining functionality.\n", "\n", " All planned improvements to the Featuretools/Dask integration are `documented on Github `_. If you see an open issue that is important for your application, please let us know by upvoting or commenting on the issue. If you encounter any errors using Dask dataframes in EntitySets, or find missing functionality that does not yet have an open issue, please create a `new issue on Github `_." ] }, { "cell_type": "markdown", "id": "49c496cb", "metadata": {}, "source": [ "Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the EntitySet cannot easily fit in memory. To help get around this issue, Featuretools supports creating `EntitySet` objects from Dask dataframes. A Dask `EntitySet` can then be passed to `featuretools.dfs` or `featuretools.calculate_feature_matrix` to create a feature matrix, which will be returned as a Dask dataframe. In addition to working on larger than memory datasets, this approach also allows users to take advantage of the parallel and distributed processing capabilities offered by Dask.\n", "\n", "This guide will provide an overview of how to create a Dask `EntitySet` and then generate a feature matrix from it. If you are already familiar with creating a feature matrix starting from pandas DataFrames, this process will seem quite familiar, as there are no differences in the process. There are, however, some limitations when using Dask dataframes, and those limitations are reviewed in more detail below.\n", "\n", "## Creating EntitySets\n", "\n", "For this example, we will create a very small pandas DataFrame and then convert this into a Dask DataFrame to use in the remainder of the process. Normally when using Dask, you would just read your data directly into a Dask DataFrame without the intermediate step of using pandas." ] }, { "cell_type": "code", "execution_count": null, "id": "96a8f65a", "metadata": {}, "outputs": [], "source": [ "import featuretools as ft\n", "import pandas as pd\n", "import dask.dataframe as dd\n", "\n", "id = [0, 1, 2, 3, 4]\n", "values = [12, -35, 14, 103, -51]\n", "df = pd.DataFrame({\"id\": id, \"values\": values})\n", "dask_df = dd.from_pandas(df, npartitions=2)\n", "\n", "dask_df" ] }, { "cell_type": "markdown", "id": "e0c3d410", "metadata": {}, "source": [ "Now that we have our Dask DataFrame, we can start to create the `EntitySet`. Inferring Woodwork logical types for the columns in a Dask dataframe can be computationally expensive. To avoid this expense, logical type inference can be skipped by supplying a dictionary of logical types using the `logical_types` parameter when calling `es.add_dataframe()`. Logical types can be specified as Woodwork LogicalType classes, or their equivalent string representation. For more information refer to the [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb) guide.\n", "\n", "Aside from supplying the logical types, the rest of the process of creating an `EntitySet` is the same as if we were using pandas DataFrames." ] }, { "cell_type": "code", "execution_count": null, "id": "ffe671d9", "metadata": {}, "outputs": [], "source": [ "from woodwork.logical_types import Double, Integer\n", "\n", "es = ft.EntitySet(id=\"dask_es\")\n", "es = es.add_dataframe(\n", " dataframe_name=\"dask_input_df\",\n", " dataframe=dask_df,\n", " index=\"id\",\n", " logical_types={\"id\": Integer, \"values\": Double},\n", ")\n", "\n", "es" ] }, { "cell_type": "markdown", "id": "b2175c84", "metadata": {}, "source": [ "Notice that when we print our `EntitySet`, the number of rows for the DataFrame named `dask_input_df` is returned as a Dask `Delayed` object. This is because obtaining the length of a Dask DataFrame may require an expensive compute operation to sum up the lengths of all the individual partitions that make up the DataFrame and that operation is not performed by default.\n", "\n", "\n", "## Running DFS\n", "We can pass the `EntitySet` we created above to `featuretools.dfs` in order to create a feature matrix. If the `EntitySet` we pass to `dfs` is made of Dask DataFrames, the feature matrix we get back will be a Dask DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "a90e3640", "metadata": {}, "outputs": [], "source": [ "feature_matrix, features = ft.dfs(\n", " entityset=es,\n", " target_dataframe_name=\"dask_input_df\",\n", " trans_primitives=[\"negate\"],\n", " max_depth=1,\n", ")\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "03e03d97", "metadata": {}, "source": [ "This feature matrix can be saved to disk or computed and brought into memory, using the appropriate Dask DataFrame methods." ] }, { "cell_type": "code", "execution_count": null, "id": "19457b84", "metadata": {}, "outputs": [], "source": [ "fm_computed = feature_matrix.compute()\n", "fm_computed" ] }, { "cell_type": "markdown", "id": "af511e72", "metadata": {}, "source": [ "While this is a simple example to illustrate the process of using Dask DataFrames with Featuretools, this process will also work with an `EntitySet` containing multiple dataframes, as well as with aggregation primitives.\n", "\n", "## Limitations\n", "\n", "The key functionality of Featuretools is available for use with a Dask `EntitySet`, and work is ongoing to add the remaining functionality that is available when using a pandas `EntitySet`. There are, however, some limitations to be aware of when creating a Dask `Entityset` and then using it to generate a feature matrix. The most significant limitations are reviewed in more detail in this section." ] }, { "cell_type": "raw", "id": "a2212141", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " If the limitations of using a Dask ``EntitySet`` are problematic for your problem, you may still be able to compute a larger-than-memory feature matrix by partitioning your data as described in :doc:`performance`." ] }, { "cell_type": "markdown", "id": "7f99e3d0", "metadata": {}, "source": [ "### Supported Primitives\n", "\n", "When creating a feature matrix from a Dask `EntitySet`, only certain primitives can be used. Primitives that rely on the order of the entire DataFrame or require an entire column for computation are currently not supported when using a Dask `EntitySet`. Multivariable and time-dependent aggregation primitives also are not currently supported.\n", "\n", "To obtain a list of the primitives that can be used with a Dask `EntitySet`, you can call `featuretools.list_primitives()`. This will return a table of all primitives. Any primitive that can be used with a Dask `EntitySet` will have a value of `True` in the `dask_compatible` column." ] }, { "cell_type": "code", "execution_count": null, "id": "c7410cef", "metadata": {}, "outputs": [], "source": [ "primitives_df = ft.list_primitives()\n", "dask_compatible_df = primitives_df[primitives_df[\"dask_compatible\"] == True]\n", "dask_compatible_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "bc1c6d6b", "metadata": {}, "outputs": [], "source": [ "dask_compatible_df.tail()" ] }, { "cell_type": "markdown", "id": "07aaee73", "metadata": {}, "source": [ "### DataFrame Limitations\n", "\n", "Featuretools stores the DataFrames that make up an EntitySet as Woodwork DataFrames which include additional typing information about the columns that are in the DataFrame. When adding a DataFrame to an `EntitySet`, Woodwork will attempt to infer the logical types for any columns that do not have a logical type defined. This inference process can be quite expensive for Dask DataFrames. In order to skip type inference and speed up the process of adding a Dask DataFrame to an `EntitySet`, users can specify the logical type to use for each column in the DataFrame. A list of available logical types can be obtained by running ``featuretools.list_logical_types()``. To learn more about the limitations of a Dask dataframe with Woodwork typing, see the [Woodwork guide on Dask dataframes](https://woodwork.alteryx.com/en/stable/guides/using_woodwork_with_dask_and_spark.html#Dask-DataFrame-Example).\n", "\n", "By default, Woodwork checks that pandas DataFrames have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when adding a Dask DataFrame to an `EntitySet`. When using Dask DataFrames, users must ensure that the supplied index values are unique.\n", "\n", "When using a pandas DataFrames, the ordering of the underlying DataFrame rows is maintained by Featuretools. For a Dask DataFrame, the ordering of the DataFrame rows is not guaranteed, and Featuretools does not attempt to maintain row order. If ordering is important, close attention must be paid to any output to avoid issues.\n", "\n", "### EntitySet Limitations\n", "\n", "When creating a Featuretools `EntitySet` that will be made of Dask DataFrames, all of the DataFrames used to create the `EntitySet` must be of the same type, either all Dask DataFrames or all pandas DataFrames. Featuretools does not support creating an `EntitySet` containing a mix of Dask and pandas DataFrames.\n", "\n", "Additionally, ``EntitySet.add_interesting_values()`` cannot be used in Dask EntitySets to find interesting values; however, it can be used set a column's interesting values with the `values` parameter." ] }, { "cell_type": "code", "execution_count": null, "id": "a3c6b9b8", "metadata": {}, "outputs": [], "source": [ "values_dict = {\"values\": [12, 103]}\n", "es.add_interesting_values(dataframe_name=\"dask_input_df\", values=values_dict)\n", "\n", "es[\"dask_input_df\"].ww.columns[\"values\"].metadata" ] }, { "cell_type": "markdown", "id": "35d1b5c0", "metadata": {}, "source": [ "\n", "### DFS Limitations\n", "\n", "There are a few key limitations when generating a feature matrix from a Dask `EntitySet`.\n", "\n", "If a `cutoff_time` parameter is passed to `featuretools.dfs()` it should be a single cutoff time value, or a pandas DataFrame. The current implementation will still work if a Dask DataFrame is supplied for cutoff times, but a `.compute()` call will be made on the DataFrame to convert it into a pandas DataFrame. This conversion will result in a warning, and the process could take a considerable amount of time to complete depending on the size of the supplied DataFrame.\n", "\n", "Additionally, Featuretools does not currently support the use of the `approximate` or `training_window` parameters when working with Dask EntitySets, but should in future releases.\n", "\n", "Finally, if the output feature matrix contains a boolean column with `NaN` values included, the column type may have a different datatype than the same feature matrix generated from a pandas `EntitySet`. If feature matrix column data types are critical, the feature matrix should be inspected to make sure the types are of the expected types, and recast as necessary.\n", "\n", "### Other Limitations\n", "\n", "In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the DataFrames in the EntitySet. Possible solutions to this problem include reducing the partition size used when creating the DataFrames or increasing the memory available on Dask workers.\n", "\n", "Currently `featuretools.encode_features()` does not work with a Dask DataFrame as input. This will hopefully be resolved in a future release of Featuretools.\n", "\n", "The utility function `featuretools.make_temporal_cutoffs()` will not work properly with Dask inputs for `instance_ids` or `cutoffs`. However, as noted above, if a `cutoff_time` DataFrame is supplied to `dfs`, the supplied DataFrame should be a pandas DataFrame, and this can be generated by supplying pandas inputs to `make_temporal_cutoffs()`.\n", "\n", "The use of `featuretools.remove_low_information_features()` cannot currently be used with a Dask feature matrix.\n", "\n", "When manually defining a `Feature`, the `use_previous` parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask `EntitySet`." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 5 }