{ "cells": [ { "cell_type": "markdown", "id": "6004844f", "metadata": {}, "source": [ "# Transitioning to Featuretools Version 1.0\n", "\n", "Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.\n", "\n", "## Background and Introduction\n", "\n", "### Why make these changes?\n", "The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of [Woodwork](https://woodwork.alteryx.com/en/stable/). Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, [EvalML](https://evalml.alteryx.com/en/stable/), which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.\n", "\n", "Other benefits of using Woodwork for managing typing in Featuretools include:\n", "\n", "- Simplified code - custom type management code has been removed\n", "- Seamless integration of new types and improvements to type integration as Woodwork improves\n", "- Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data." ] }, { "cell_type": "markdown", "id": "4a9bfede", "metadata": {}, "source": [ "### What has changed?\n", "- The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types\n", "- Both the `Entity` and `Variable` classes have been removed from Featuretools\n", "- Several key Featuretools methods have been moved or updated\n", "\n", "#### Comparison between legacy typing system and Woodwork typing systems\n", "| Featuretools < 1.0 | Featuretools 1.0 | Description |\n", "| ---- | ---- | ---- |\n", "| Entity | Woodwork DataFrame | stores typing information for all columns |\n", "| Variable | ColumnSchema | stores typing information for a single column |\n", "| Variable subclass | LogicalType and semantic_tags | elements used to define a column type |\n", "\n", "#### Summary of significant method changes\n", "\n", "The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.\n", "\n", "| Older Versions | Featuretools 1.0 |\n", "| ---- | ---- |\n", "| EntitySet.entity_from_dataframe | EntitySet.add_dataframe |\n", "| EntitySet.normalize_entity | EntitySet.normalize_dataframe |\n", "| EntitySet.update_data | EntitySet.replace_dataframe |\n", "| Entity.variable_types | es['dataframe_name'].ww |\n", "| es['entity_id']['variable_name'] | es['dataframe_name'].ww.columns['column_name'] |\n", "| Entity.convert_variable_type | es['dataframe_name'].ww.set_types |\n", "| Entity.add_interesting_values | es.add_interesting_values(dataframe_name='df_name', ...) |\n", "| Entity.set_secondary_time_index | es.set_secondary_time_index(dataframe_name='df_name', ...) |\n", "| Feature(es['entity_id']['variable_name']) | Feature(es['dataframe_name'].ww['column_name']) |\n", "| dfs(target_entity='entity_id', ...) | dfs(target_dataframe_name='dataframe_name', ...) |" ] }, { "cell_type": "markdown", "id": "c3b1e217", "metadata": {}, "source": [ "For more information on how Woodwork manages typing information, refer to the [Woodwork Understanding Types and Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide." ] }, { "cell_type": "markdown", "id": "a8453248", "metadata": {}, "source": [ "### What do these changes mean for users?\n", "Removing these classes required moving several methods from the `Entity` to the `EntitySet` object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.\n", "\n", "All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible." ] }, { "cell_type": "markdown", "id": "de402e3b", "metadata": {}, "source": [ "## Removal of `Entity` Class and Updates to `EntitySet`\n", "\n", "In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.\n", "\n", "### Adding dataframes to an EntitySet\n", "\n", "When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. As before, Featuretools supports creating EntitySets from pandas, Dask and Spark dataframes. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.\n", "\n", "Below are some examples to illustrate this process. First we will create two small dataframes to use for the example." ] }, { "cell_type": "code", "execution_count": null, "id": "5bea1bd4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import featuretools as ft" ] }, { "cell_type": "code", "execution_count": null, "id": "b094ca23", "metadata": {}, "outputs": [], "source": [ "orders_df = pd.DataFrame(\n", " {\"order_id\": [0, 1, 2], \"order_date\": [\"2021-01-02\", \"2021-01-03\", \"2021-01-04\"]}\n", ")\n", "items_df = pd.DataFrame(\n", " {\n", " \"id\": [0, 1, 2, 3, 4],\n", " \"order_id\": [0, 1, 1, 2, 2],\n", " \"item_price\": [29.95, 4.99, 10.25, 20.50, 15.99],\n", " \"on_sale\": [False, True, False, True, False],\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "db705814", "metadata": {}, "source": [ "With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling `entity_from_dataframe` as shown below.\n", "\n", "```python\n", "es = ft.EntitySet('old_es')\n", "\n", "es.entity_from_dataframe(dataframe=orders_df,\n", " entity_id='orders',\n", " index='order_id',\n", " time_index='order_date')\n", "es.entity_from_dataframe(dataframe=items_df,\n", " entity_id='items',\n", " index='id')\n", "```\n", "\n", "```\n", "Entityset: old_es\n", " Entities:\n", " orders [Rows: 3, Columns: 2]\n", " items [Rows: 5, Columns: 3]\n", " Relationships:\n", " No relationships\n", "```" ] }, { "cell_type": "markdown", "id": "f6f95f35", "metadata": {}, "source": [ "With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call `EntitySet.add_dataframe` in place of the previous `EntitySet.entity_from_dataframe` call. Note that the name of the dataframe is specified in the `dataframe_name` argument, which was previously called `entity_id`." ] }, { "cell_type": "code", "execution_count": null, "id": "b1fdffe4", "metadata": {}, "outputs": [], "source": [ "es = ft.EntitySet(\"new_es\")\n", "\n", "es.add_dataframe(\n", " dataframe=orders_df,\n", " dataframe_name=\"orders\",\n", " index=\"order_id\",\n", " time_index=\"order_date\",\n", ")" ] }, { "cell_type": "markdown", "id": "1c983744", "metadata": {}, "source": [ "You can also define the name, index, and time index by first [initializing Woodwork](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.init.html#woodwork.table_accessor.WoodworkTableAccessor.init) on the dataframe and then passing the Woodwork initialized dataframe directly to the `add_dataframe` call. For this example we will initialize Woodwork on `items_df`, setting the dataframe name as `items` and specifying that the index should be the `id` column." ] }, { "cell_type": "code", "execution_count": null, "id": "0d5ad8e5", "metadata": {}, "outputs": [], "source": [ "items_df.ww.init(name=\"items\", index=\"id\")\n", "items_df.ww" ] }, { "cell_type": "markdown", "id": "07f5f27c", "metadata": {}, "source": [ "With Woodwork initialized, we no longer need to specify values for the `dataframe_name` or `index` arguments when calling `add_dataframe` as Featuretools will simply use the values that were already specified when Woodwork was initialized." ] }, { "cell_type": "code", "execution_count": null, "id": "5f4ab39a", "metadata": {}, "outputs": [], "source": [ "es.add_dataframe(dataframe=items_df)" ] }, { "cell_type": "markdown", "id": "93814387", "metadata": {}, "source": [ "### Accessing column typing information\n", "\n", "Previously, column variable type information could be accessed for an entire Entity through `Entity.variable_types` or for an individual column by selecting the individual column first through `es['entity_id']['col_id']`.\n", "\n", "```python\n", "es['items'].variable_types\n", "```\n", "```\n", "{'id': featuretools.variable_types.variable.Index,\n", " 'order_id': featuretools.variable_types.variable.Numeric,\n", " 'item_price': featuretools.variable_types.variable.Numeric}\n", "```\n", "```python\n", "es['items']['item_price']\n", "```\n", "```\n", "\n", "```\n", "\n", "With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the `.ww` namespace on the dataframe. First, select the dataframe from the EntitySet with `es['dataframe_name']` and then access the typing information by chaining a `.ww` call on the end as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "6abb9b10", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww" ] }, { "cell_type": "markdown", "id": "72775903", "metadata": {}, "source": [ "The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a `Woodwork.ColumnSchema` object that stores the typing information:" ] }, { "cell_type": "code", "execution_count": null, "id": "da516642", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.columns[\"item_price\"]" ] }, { "cell_type": "markdown", "id": "50f9f70a", "metadata": {}, "source": [ "### Type inference and updating column types\n", "\n", "Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the `Entity.variable_types` dictionary.\n", "\n", "Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling `EntitySet.add_dataframe`.\n", "\n", "As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user's full name as the Woodwork `PersonFullName` logical type." ] }, { "cell_type": "code", "execution_count": null, "id": "a34016b5", "metadata": {}, "outputs": [], "source": [ "users_df = pd.DataFrame(\n", " {\"id\": [0, 1, 2], \"name\": [\"John Doe\", \"Rita Book\", \"Teri Dactyl\"]}\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "d999e022", "metadata": {}, "outputs": [], "source": [ "es.add_dataframe(\n", " dataframe=users_df,\n", " dataframe_name=\"users\",\n", " index=\"id\",\n", " logical_types={\"name\": \"PersonFullName\"},\n", ")\n", "\n", "es[\"users\"].ww" ] }, { "cell_type": "markdown", "id": "d2eff5e1", "metadata": {}, "source": [ "Looking at the typing information above, we can see that the logical type for the `name` column was set to `PersonFullName` as we specified.\n", "\n", "Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork `set_types` method. Let's say we want the `order_id` column of the `orders` dataframe to have a `Categorical` logical type instead of the `Integer` type that was inferred. Previously, this would have accomplished through the `Entity.convert_variable_type` method.\n", "\n", "```python\n", "from featuretools.variable_types import Categorical\n", "\n", "es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)\n", "```\n", "\n", "Now, we can perform this same update using Woodwork:" ] }, { "cell_type": "code", "execution_count": null, "id": "a6c095b5", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Categorical\"})\n", "es[\"items\"].ww" ] }, { "cell_type": "markdown", "id": "d9d84e08", "metadata": {}, "source": [ "For additional information on Woodwork typing and how it is used in Featuretools, refer to [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb)." ] }, { "cell_type": "markdown", "id": "bf3dfea2", "metadata": {}, "source": [ "### Adding interesting values\n", "\n", "Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.\n", "\n", "To add interesting values for all of the dataframes in an EntitySet, simply call `EntitySet.add_interesting_values`, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.\n", "\n", "Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call `Entity.add_interesting_values()`:\n", "```python\n", "es['items'].add_interesting_values()\n", "```\n", "\n", "Now, in order to specify interesting values for a single dataframe, you call `add_interesting_values` on the EntitySet, and pass the name of the dataframe for which you want interesting values added:" ] }, { "cell_type": "code", "execution_count": null, "id": "c058d2ed", "metadata": {}, "outputs": [], "source": [ "es.add_interesting_values(dataframe_name=\"items\")" ] }, { "cell_type": "markdown", "id": "c3e0a247", "metadata": {}, "source": [ "Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:\n", "\n", "```python\n", "es['items']['order_id'].interesting_values = [1, 2]\n", "```\n", "\n", "Now, this is done through `EntitySet.add_interesting_values`, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of `[1, 2]` to the `order_id` column of the `items` dataframe, use the following approach:" ] }, { "cell_type": "code", "execution_count": null, "id": "8276114b", "metadata": {}, "outputs": [], "source": [ "es.add_interesting_values(dataframe_name=\"items\", values={\"order_id\": [1, 2]})" ] }, { "cell_type": "markdown", "id": "22e70b84", "metadata": {}, "source": [ "Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the `values` parameter.\n", "\n", "Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:\n", "```python\n", "es['items']['order_id'].interesting_values\n", "```\n", "\n", "Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "8461c4f7", "metadata": {}, "outputs": [], "source": [ "es[\"items\"].ww.columns[\"order_id\"].metadata[\"interesting_values\"]" ] }, { "cell_type": "markdown", "id": "cb23501f", "metadata": {}, "source": [ "### Setting a secondary time index\n", "\n", "In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling `Entity.set_secondary_time_index`. \n", "```python\n", "es_flight = ft.demo.load_flight(nrows=100)\n", "\n", "arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\n", " 'national_airspace_delay', 'security_delay',\n", " 'late_aircraft_delay', 'canceled', 'diverted',\n", " 'taxi_in', 'taxi_out', 'air_time', 'dep_time']\n", "es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})\n", "```\n", "\n", "Since the `Entity` class has been removed in Featuretools 1.0, this now needs to be done through the `EntitySet` instead:" ] }, { "cell_type": "code", "execution_count": null, "id": "b80b1f6a", "metadata": {}, "outputs": [], "source": [ "es_flight = ft.demo.load_flight(nrows=100)\n", "\n", "arr_time_columns = [\n", " \"arr_delay\",\n", " \"dep_delay\",\n", " \"carrier_delay\",\n", " \"weather_delay\",\n", " \"national_airspace_delay\",\n", " \"security_delay\",\n", " \"late_aircraft_delay\",\n", " \"canceled\",\n", " \"diverted\",\n", " \"taxi_in\",\n", " \"taxi_out\",\n", " \"air_time\",\n", " \"dep_time\",\n", "]\n", "es_flight.set_secondary_time_index(\n", " dataframe_name=\"trip_logs\", secondary_time_index={\"arr_time\": arr_time_columns}\n", ")" ] }, { "cell_type": "markdown", "id": "2ebee2e6", "metadata": {}, "source": [ "Previously, the secondary time index could be accessed directly from the Entity with `es_flight['trip_logs'].secondary_time_index`. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "3ea95fdb", "metadata": {}, "outputs": [], "source": [ "es_flight[\"trip_logs\"].ww.metadata[\"secondary_time_index\"]" ] }, { "cell_type": "markdown", "id": "f2f9b64c", "metadata": {}, "source": [ "### Normalizing Entities/DataFrames\n", "\n", "`EntitySet.normalize_entity` has been renamed to `EntitySet.normalize_dataframe` in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.\n", "\n", "| Old Parameter Name | New Parameter Name |\n", "| --- | --- |\n", "| base_entity_id | base_dataframe_name |\n", "| new_entity_id | new_dataframe_name |\n", "| additional_variables | additional_columns |\n", "| copy_variables | copy_columns |\n", "| new_entity_time_index | new_dataframe_time_index |\n", "| new_entity_secondary_time_index | new_dataframe_secondary_time_index |" ] }, { "cell_type": "markdown", "id": "ca81708b", "metadata": {}, "source": [ "### Defining and adding relationships\n", "\n", "In earlier versions of Featuretools, relationships were defined by creating a `Relationship` object, which took two `Variables` as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a `Relationship` and then add it to the EntitySet:\n", "\n", "```python\n", "relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])\n", "es.add_relationship(relationship)\n", "```\n", "\n", "With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to `EntitySet.add_relationship`, and another is to pass a previously created `Relationship` object to the `relationship` keyword argument. Both approaches are demonstrated below." ] }, { "cell_type": "code", "execution_count": null, "id": "7d738807", "metadata": { "nbshpinx": "hidden" }, "outputs": [], "source": [ "# Undo change from above and change child column logical type to match parent and prevent warning\n", "# NOTE: This cell is hidden in the docs build\n", "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Integer\"})" ] }, { "cell_type": "code", "execution_count": null, "id": "97c04dd4", "metadata": {}, "outputs": [], "source": [ "es.add_relationship(\n", " parent_dataframe_name=\"orders\",\n", " parent_column_name=\"order_id\",\n", " child_dataframe_name=\"items\",\n", " child_column_name=\"order_id\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "26643d04", "metadata": { "nbshpinx": "hidden" }, "outputs": [], "source": [ "# Reset the relationship so we can add it again\n", "# NOTE: This cell is hidden in the docs build\n", "es.relationships = []" ] }, { "cell_type": "markdown", "id": "317e5657", "metadata": {}, "source": [ "Alternatively, we can first create a `Relationship` and pass that to `EntitySet.add_relationship`. When defining a `Relationship` we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column." ] }, { "cell_type": "code", "execution_count": null, "id": "47e54c72", "metadata": {}, "outputs": [], "source": [ "relationship = ft.Relationship(\n", " entityset=es,\n", " parent_dataframe_name=\"orders\",\n", " parent_column_name=\"order_id\",\n", " child_dataframe_name=\"items\",\n", " child_column_name=\"order_id\",\n", ")\n", "es.add_relationship(relationship=relationship)" ] }, { "cell_type": "markdown", "id": "7a49ba91", "metadata": {}, "source": [ "### Updating data for a dataframe in an EntitySet\n", "\n", "Previously to update (replace) the data associated with an Entity, users could call `Entity.update_data` and pass in the new dataframe. As an example, let's update the data in our `users` Entity:\n", "```python\n", "new_users_df = pd.DataFrame({\n", " 'id': [3, 4],\n", " 'name': ['Anne Teak', 'Art Decco']\n", "})\n", "\n", "es['users'].update_data(df=new_users_df)\n", "```\n", "\n", "To accomplish this task with Featuretools 1.0, we will use the `EntitySet.replace_dataframe` method instead:" ] }, { "cell_type": "code", "execution_count": null, "id": "b45a81d5", "metadata": {}, "outputs": [], "source": [ "new_users_df = pd.DataFrame({\"id\": [0, 1], \"name\": [\"Anne Teak\", \"Art Decco\"]})\n", "\n", "es.replace_dataframe(dataframe_name=\"users\", df=new_users_df)\n", "es[\"users\"]" ] }, { "cell_type": "markdown", "id": "679af861", "metadata": {}, "source": [ "## Defining features\n", "\n", "The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.\n", "\n", "```python\n", "feature = ft.Feature(es['items']['item_price'])\n", "```\n", "\n", "Starting with Featuretools 1.0, a similar syntax can be used, but because `es['items']` will now return a Woodwork dataframe instead of an `Entity`, we need to update the syntax slightly to access the Woodwork column. To update, simply add `.ww` between the dataframe name selector and the column selector as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "88902f6b", "metadata": {}, "outputs": [], "source": [ "feature = ft.Feature(es[\"items\"].ww[\"item_price\"])" ] }, { "cell_type": "markdown", "id": "0faf41e4", "metadata": {}, "source": [ "## Defining primitives\n", "\n", "In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate `Variable` class. Starting in version 1.0, the input and return types are defined by Woodwork `ColumnSchema` objects. \n", "\n", "To illustrate this change, let's look closer at the `Age` transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person's age. In previous versions of Featuretools, the input type was defined by specifying the `DateOfBirth` variable type and the return type was specified by the `Numeric` variable type:\n", "\n", "```python\n", "input_types = [DateOfBirth]\n", "return_type = Numeric\n", "```\n", "\n", "Woodwork does not have a specific `DateOfBirth` logical type, but rather identifies a column as a date of birth column by specifying the logical type as `Datetime` with a semantic tag of `date_of_birth`. There is also no `Numeric` logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of `numeric`. Furthermore, we know the `Age` primitive will return a floating point number, which would correspond to a Woodwork logical type of `Double`. With these items in mind, we can redefine the `Age` input types and return types with `ColumnSchema` objects as follows:\n", "\n", "```python\n", "input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]\n", "return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})\n", "```\n", "\n", "Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged." ] }, { "cell_type": "markdown", "id": "ebcd6d9e", "metadata": {}, "source": [ "### Mapping from old Featuretools variable types to Woodwork ColumnSchemas\n", "\n", "Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by `ColumnSchema` objects, the approximate mapping is shown below.\n", "\n", "\n", "| Featuretools Variable | Woodwork Column Schema |\n", "| --- | --- |\n", "| Boolean | ColumnSchema(logical_type=Boolean) or ColumnSchema(logical_type=BooleanNullable) |\n", "| Categorical | ColumnSchema(logical_type=Categorical) |\n", "| CountryCode | ColumnSchema(logical_type=CountryCode) |\n", "| Datetime | ColumnSchema(logical_type=Datetime) |\n", "| DateOfBirth | ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'}) |\n", "| DatetimeTimeIndex | ColumnSchema(logical_type=Datetime, semantic_tags={'time_index'}) |\n", "| Discrete | ColumnSchema(semantic_tags={'category'}) |\n", "| EmailAddress | ColumnSchema(logical_type=EmailAddress) |\n", "| FilePath | ColumnSchema(logical_type=Filepath) |\n", "| FullName | ColumnSchema(logical_type=PersonFullName) |\n", "| Id | ColumnSchema(semantic_tags={'foreign_key'}) |\n", "| Index | ColumnSchema(semantic_tags={'index'}) |\n", "| IPAddress | ColumnSchema(logical_type=IPAddress) |\n", "| LatLong | ColumnSchema(logical_type=LatLong) |\n", "| NaturalLanguage | ColumnSchema(logical_type=NaturalLanguage) |\n", "| Numeric | ColumnSchema(semantic_tags={'numeric'}) |\n", "| NumericTimeIndex | ColumnSchema(semantic_tags={'numeric', 'time_index'}) |\n", "| Ordinal | ColumnSchema(logical_type=Ordinal) |\n", "| PhoneNumber | ColumnSchema(logical_type=PhoneNumber) |\n", "| SubRegionCode | ColumnSchema(logical_type=SubRegionCode) |\n", "| Timedelta | ColumnSchema(logical_type=Timedelta) |\n", "| TimeIndex | ColumnSchema(semantic_tags={'time_index'}) |\n", "| URL | ColumnSchema(logical_type=URL) |\n", "| Unknown | ColumnSchema(logical_type=Unknown) |\n", "| ZIPCode | ColumnSchema(logical_type=PostalCode) |" ] }, { "cell_type": "markdown", "id": "fec87370", "metadata": {}, "source": [ "## Changes to Deep Feature Synthesis and Calculate Feature Matrix\n", "\n", "The argument names for both `featuretools.dfs` and `featuretools.calculate_feature_matrix` have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:\n", "\n", "```python\n", "features = ft.dfs(entityset=es,\n", " target_entity='items',\n", " features_only=True)\n", "```\n", "\n", "In Featuretools 1.0, the `target_entity` argument has been renamed to `target_dataframe_name`, but otherwise this basic call remains the same.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5428949c", "metadata": {}, "outputs": [], "source": [ "features = ft.dfs(entityset=es, target_dataframe_name=\"items\", features_only=True)\n", "features" ] }, { "cell_type": "markdown", "id": "3154734d", "metadata": {}, "source": [ "In addition, the `dfs` argument `ignore_entities` was renamed to `ignore_dataframes` and `ignore_variables` was renamed to `ignore_columns`. Similarly, if specifying primitive options, all references to `entities` should be replaced with `dataframes` and references to `variables` should be replaced with columns. For example, the primitive option of `include_groupby_entities` is now `include_groupby_dataframes` and `include_variables` is now `include_columns`.\n", "\n", "The basic call to `featuretools.calculate_feature_matrix` remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling `calculate_feature_matrix` by passing in a list of `entities` and `relationships` should note that the `entities` argument has been renamed to `dataframes` and the values in the dictionary values should now include Woodwork logical types instead of Featuretools `Variable` classes." ] }, { "cell_type": "code", "execution_count": null, "id": "456da22e", "metadata": {}, "outputs": [], "source": [ "feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)\n", "feature_matrix" ] }, { "cell_type": "markdown", "id": "b87489cf", "metadata": {}, "source": [ "In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously `Id`) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as `Integer`, features will not be generated from these columns. Manually converting foreign key columns to `Categorical` as shown above will result in features much closer to those achieved with previous versions.\n", "\n", "Also, because Woodwork's type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.\n", "\n", "Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows." ] }, { "cell_type": "code", "execution_count": null, "id": "cdb45cc9", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww" ] }, { "cell_type": "markdown", "id": "68910d73", "metadata": {}, "source": [ "Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork `origin` attribute for the column. Columns that were in the original data will be labeled with `base` and features that were created by Featuretools will be labeled with `engineered`.\n", "\n", "As a demonstration of how to access this information, let's compare two features in the feature matrix: `item_price` and `orders.MEAN(items.item_price)`. `item_price` was present in the original data, and `orders.MEAN(items.item_price)` was created by Featuretools." ] }, { "cell_type": "code", "execution_count": null, "id": "f3e143fe", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww[\"item_price\"].ww.origin" ] }, { "cell_type": "code", "execution_count": null, "id": "12cf8260", "metadata": {}, "outputs": [], "source": [ "feature_matrix.ww[\"orders.MEAN(items.item_price)\"].ww.origin" ] }, { "cell_type": "markdown", "id": "4c429c75", "metadata": {}, "source": [ "## Other changes\n", "\n", "In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.\n", "\n", "- Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.\n", "\n", "- For `LatLong` columns, older versions of Featuretools would replace single `nan` values in the columns with a tuple `(nan, nan)`. This is no longer the case, and single `nan` values will now remain in the `LatLong` column. Based on the behavior in Woodwork, any values of `(nan, nan)` in a `LatLong` column will be replaced with a single `nan` value.\n", "\n", "- Since Featuretools no longer defines `Variable` objects with relationships between them, the `featuretools.variable_types.graph_variable_types` function has been removed.\n", "\n", "- The `featuretools.variable_types.list_variable_types` utility function has been removed and replaced with two corresponding Woodwork functions: `woodwork.list_logical_types` and `woodwork.list_semantic_tags`. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 }