Feature Selection#
Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.
Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:
ft.selection.remove_highly_null_features
ft.selection.remove_single_value_features
ft.selection.remove_highly_correlated_features
We will describe each of these functions in depth, but first we must create an entity set with which we can run ft.dfs
.
[1]:
import pandas as pd
import featuretools as ft
from featuretools.demo.flight import load_flight
from featuretools.selection import (
remove_highly_correlated_features,
remove_highly_null_features,
remove_single_value_features,
)
es = load_flight(nrows=50)
es
Downloading data ...
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:291: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:296: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/demo/flight.py:302: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
[1]:
Entityset: Flight Data
DataFrames:
trip_logs [Rows: 50, Columns: 21]
flights [Rows: 6, Columns: 9]
airlines [Rows: 1, Columns: 1]
airports [Rows: 4, Columns: 3]
Relationships:
trip_logs.flight_id -> flights.flight_id
flights.carrier -> airlines.carrier
flights.dest -> airports.dest
Remove Highly Null Features#
We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:
[2]:
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="trip_logs",
cutoff_time=pd.DataFrame(
{
"trip_log_id": [30, 1, 2, 3, 4],
"time": pd.to_datetime(["2016-09-22 00:00:00"] * 5),
}
),
trans_primitives=[],
agg_primitives=[],
max_depth=2,
)
fm
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/entityset/entityset.py:1455: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.loc[mask, columns] = np.nan
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/entityset/entityset.py:1455: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.loc[mask, columns] = np.nan
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df = pd.concat([df, default_df], sort=True)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/stable/lib/python3.9/site-packages/woodwork/logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
[2]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We look at the above feature matrix and decide to remove the highly null features
[3]:
ft.selection.remove_highly_null_features(fm)
[3]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Notice that calling remove_highly_null_features
didn’t remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the pct_null_threshold
paramter ourselves.
[4]:
remove_highly_null_features(fm, pct_null_threshold=0.2)
[4]:
trip_log_id |
---|
30 |
1 |
2 |
3 |
4 |
Remove Single Value Features#
Another situation we might run into is one where our calculated features don’t have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use remove_single_value_features
.
Let’s see what happens when we remove the single value features of the feature matrix below.
[5]:
fm
[5]:
flight_id | dep_delay | taxi_out | taxi_in | arr_delay | diverted | air_time | distance | carrier_delay | weather_delay | national_airspace_delay | security_delay | late_aircraft_delay | canceled | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||||||||||||||
30 | AA-494:RSW->CLT | NaN | NaN | NaN | NaN | <NA> | NaN | 600.0 | NaN | NaN | NaN | NaN | NaN | <NA> | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | NaN | NaN | NaN | NaN | <NA> | NaN | 1773.0 | NaN | NaN | NaN | NaN | NaN | <NA> | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Note
A list of feature definitions such as those created by dfs can be provided to the feature selection functions. Doing this will change the outputs to include an updated list of feature definitions.
[6]:
new_fm, new_features = remove_single_value_features(fm, features=features)
new_fm
[6]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Now that we have the features definitions for the updated feature matrix, we can see that the features that were removed are:
[7]:
set(features) - set(new_features)
[7]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: flights.carrier>,
<Feature: flights.flight_num>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}
With the function used as it is above, null values are not considered when counting a feature’s unique values. If we’d like to consider NaN
its own value, we can set count_nan_as_value
to True
and we’ll see flights.carrier
and flights.flight_num
back in the matrix.
[8]:
new_fm, new_features = remove_single_value_features(
fm, features=features, count_nan_as_value=True
)
new_fm
[8]:
flight_id | distance | flights.origin | flights.origin_city | flights.origin_state | flights.dest | flights.distance_group | flights.carrier | flights.flight_num | flights.airports.dest_city | flights.airports.dest_state | |
---|---|---|---|---|---|---|---|---|---|---|---|
trip_log_id | |||||||||||
30 | AA-494:RSW->CLT | 600.0 | RSW | Fort Myers, FL | FL | CLT | 3 | AA | 494 | Charlotte, NC | NC |
1 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
2 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
3 | AA-494:CLT->PHX | 1773.0 | CLT | Charlotte, NC | NC | PHX | 8 | AA | 494 | Phoenix, AZ | AZ |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The features that were removed are:
[9]:
set(features) - set(new_features)
[9]:
{<Feature: air_time>,
<Feature: arr_delay>,
<Feature: canceled>,
<Feature: carrier_delay>,
<Feature: dep_delay>,
<Feature: diverted>,
<Feature: late_aircraft_delay>,
<Feature: national_airspace_delay>,
<Feature: security_delay>,
<Feature: taxi_in>,
<Feature: taxi_out>,
<Feature: weather_delay>}