NOTICE

The upcoming release of Featuretools 1.0.0 contains several breaking changes. Users are encouraged to test this version prior to release by installing from GitHub:

pip install https://github.com/alteryx/featuretools/archive/woodwork-integration.zip

For details on migrating to the new version, refer to Transitioning to Featuretools Version 1.0. Please report any issues in the Featuretools GitHub repo or by messaging in Alteryx Open Source Slack.


featuretools.selection.remove_highly_correlated_features

featuretools.selection.remove_highly_correlated_features(feature_matrix, features=None, pct_corr_threshold=0.95, features_to_check=None, features_to_keep=None)[source]

Removes columns in feature matrix that are highly correlated with another column.

Note

We make the assumption that, for a pair of features, the feature that is further right in the feature matrix produced by dfs is the more complex one. The assumption does not hold if the order of columns in the feature matrix has changed from what dfs produces.

Parameters
  • feature_matrix (pd.DataFrame) – DataFrame whose columns are feature names and rows are instances. If Woodwork is not initalized, will perform Woodwork initialization, which may result in slightly different types than those in the original feature matrix created by Featuretools.

  • features (list[featuretools.FeatureBase] or list[str], optional) – List of features to select.

  • pct_corr_threshold (float) – The correlation threshold to be considered highly correlated. Defaults to 0.95.

  • features_to_check (list[str], optional) – List of column names to check whether any pairs are highly correlated. Will not check any other columns, meaning the only columns that can be removed are in this list. If null, defaults to checking all columns.

  • features_to_keep (list[str], optional) – List of colum names to keep even if correlated to another column. If null, all columns will be candidates for removal.

Returns

The feature matrix and the list of generated feature definitions. Matches dfs output. If no feature list is provided as input, the feature list will not be returned. For consistent results, do not change the order of features outputted by dfs.

Return type

pd.DataFrame, list[FeatureBase]