featuretools.calculate_feature_matrix#
- featuretools.calculate_feature_matrix(features, entityset=None, cutoff_time=None, instance_ids=None, dataframes=None, relationships=None, cutoff_time_in_index=False, training_window=None, approximate=None, save_progress=None, verbose=False, chunk_size=None, n_jobs=1, dask_kwargs=None, progress_callback=None, include_cutoff_time=True)[source]#
Calculates a matrix for a given set of instance ids and calculation times.
- Parameters:
features (list[
FeatureBase
]) – Feature definitions to be calculated.entityset (EntitySet) – An already initialized entityset. Required if dataframes and relationships not provided
cutoff_time (pd.DataFrame or Datetime) – Specifies times at which to calculate the features for each instance. The resulting feature matrix will use data up to and including the cutoff_time. Can either be a DataFrame or a single value. If a DataFrame is passed the instance ids for which to calculate features must be in a column with the same name as the target dataframe index or a column named instance_id. The cutoff time values in the DataFrame must be in a column with the same name as the target dataframe time index or a column named time. If the DataFrame has more than two columns, any additional columns will be added to the resulting feature matrix. If a single value is passed, this value will be used for all instances.
instance_ids (list) – List of instances to calculate features on. Only used if cutoff_time is a single datetime.
dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]) – Dictionary of DataFrames. Entries take the format {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}. Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters will be ignored.
relationships (list[(str, str, str, str)]) – list of relationships between dataframes. List items are a tuple with the format (parent dataframe name, parent column, child dataframe name, child column).
cutoff_time_in_index (bool) – If True, return a DataFrame with a MultiIndex where the second index is the cutoff time (first is instance id). DataFrame will be sorted by (time, instance_id).
training_window (Timedelta or str, optional) – Window defining how much time before the cutoff time data can be used when calculating features. If
None
, all data before cutoff time is used. Defaults toNone
.approximate (Timedelta or str) – Frequency to group instances with similar cutoff times by for features with costly calculations. For example, if bucket is 24 hours, all instances with cutoff times on the same day will use the same calculation for expensive features.
verbose (bool, optional) – Print progress info. The time granularity is per chunk.
chunk_size (int or float or None) – maximum number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all rows. if None, and n_jobs > 1 it will be set to 1/n_jobs
n_jobs (int, optional) – number of parallel processes to use when calculating feature matrix.
dask_kwargs (dict, optional) –
Dictionary of keyword arguments to be passed when creating the dask client and scheduler. Even if n_jobs is not set, using dask_kwargs will enable multiprocessing. Main parameters:
- cluster (str or dask.distributed.LocalCluster):
cluster or address of cluster to send tasks to. If unspecified, a cluster will be created.
- diagnostics port (int):
port number to use for web dashboard. If left unspecified, web interface will not be enabled.
Valid keyword arguments for LocalCluster will also be accepted.
save_progress (str, optional) – path to save intermediate computational results.
progress_callback (callable) –
function to be called with incremental progress updates. Has the following parameters:
update: percentage change (float between 0 and 100) in progress since last call progress_percent: percentage (float between 0 and 100) of total computation completed time_elapsed: total time in seconds that has elapsed since start of call
include_cutoff_time (bool) – Include data at cutoff times in feature calculations. Defaults to
True
.
- Returns:
The feature matrix.
- Return type:
pd.DataFrame