Feature Selection#
Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.
Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:
ft.selection.remove_highly_null_features
ft.selection.remove_single_value_features
ft.selection.remove_highly_correlated_features
We will describe each of these functions in depth, but first we must create an entity set with which we can run ft.dfs
.
[1]:
import pandas as pd
import featuretools as ft
from featuretools.selection import (
remove_highly_correlated_features,
remove_highly_null_features,
remove_single_value_features,
)
from featuretools.demo.flight import load_flight
es = load_flight(nrows=50)
es
Downloading data ...
[1]:
Entityset: Flight Data
DataFrames:
trip_logs [Rows: 50, Columns: 21]
flights [Rows: 6, Columns: 9]
airlines [Rows: 1, Columns: 1]
airports [Rows: 4, Columns: 3]
Relationships:
trip_logs.flight_id -> flights.flight_id
flights.carrier -> airlines.carrier
flights.dest -> airports.dest
Remove Highly Null Features#
We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:
[2]:
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="trip_logs",
cutoff_time=pd.DataFrame(
{
"trip_log_id": [30, 1, 2, 3, 4],
"time": pd.to_datetime(["2016-09-22 00:00:00"] * 5),
}
),
trans_primitives=[],
agg_primitives=[],
max_depth=2,
)
fm
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/v1.22.0/lib/python3.8/site-packages/featuretools/computational_backends/feature_set_calculator.py:153: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df = default_df.append(df, sort=True)
[2]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We look at the above feature matrix and decide to remove the highly null features
[3]:
ft.selection.remove_highly_null_features(fm)
[3]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Notice that calling remove_highly_null_features
didn’t remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the pct_null_threshold
paramter ourselves.
[4]:
remove_highly_null_features(fm, pct_null_threshold=0.2)
[4]:
trip_log_id |
---|
30 |
1 |
2 |
3 |
4 |
Remove Single Value Features#
Another situation we might run into is one where our calculated features don’t have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use remove_single_value_features
.
Let’s see what happens when we remove the single value features of the feature matrix below.
[5]:
fm
[5]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Note
A list of feature definitions such as those created by dfs can be provided to the feature selection functions. Doing this will change the outputs to include an updated list of feature definitions.
[6]:
new_fm, new_features = remove_single_value_features(fm, features=features)
new_fm
[6]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Now that we have the features definitions for the updated feature matrix, we can see that the features that were removed are:
[7]:
set(features) - set(new_features)
[7]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: flights.carrier>,
<Feature: flights.flight_num>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}
With the function used as it is above, null values are not considered when counting a feature’s unique values. If we’d like to consider NaN
its own value, we can set count_nan_as_value
to True
and we’ll see flights.carrier
and flights.flight_num
back in the matrix.
[8]:
new_fm, new_features = remove_single_value_features(
fm, features=features, count_nan_as_value=True
)
new_fm
[8]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The features that were removed are:
[9]:
set(features) - set(new_features)
[9]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}