API Reference#

Demo Datasets#

load_retail([id, nrows, return_single_table])

Returns the retail entityset example.

load_mock_customer([n_customers, ...])

Return dataframes of mock customer data

load_flight([month_filter, ...])

Download, clean, and filter flight data from 2017.

load_weather([nrows, return_single_table])

Load the Australian daily-min-temperatures weather dataset.

Deep Feature Synthesis#

dfs([dataframes, relationships, entityset, ...])

Calculates a feature matrix and features given a dictionary of dataframes and a list of relationships.

get_valid_primitives(entityset, ...[, ...])

Returns two lists of primitives (transform and aggregation) containing primitives that can be applied to the specific target dataframe to create features.

Wrappers#

scikit-learn (BETA)#

wrappers.DFSTransformer([...])

Transformer using Scikit-Learn interface for Pipeline uses.

Timedelta#

Timedelta(value[, unit, delta_obj])

Represents differences in time.

Time utils#

make_temporal_cutoffs(instance_ids, cutoffs)

Makes a set of equally spaced cutoff times prior to a set of input cutoffs and instance ids.

Feature Primitives#

Primitive Types#

TransformPrimitive()

Feature for dataframe that is a based off one or more other features in that dataframe.

AggregationPrimitive()

Aggregation Primitives#

All()

Calculates if all values are 'True' in a list.

Any()

Determines if any value is 'True' in a list.

AvgTimeBetween([unit])

Computes the average number of seconds between consecutive events.

Count()

Determines the total number of values, excluding NaN.

CountAboveMean([skipna])

Calculates the number of values that are above the mean.

CountBelowMean([skipna])

Determines the number of values that are below the mean.

CountGreaterThan([threshold])

Determines the number of values greater than a controllable threshold.

CountInsideNthSTD([n])

Determines the count of observations that lie inside

CountInsideRange([lower, upper, skipna])

Determines the number of values that fall within a certain range.

CountLessThan([threshold])

Determines the number of values less than a controllable threshold.

CountOutsideNthSTD([n])

Determines the number of observations that lie outside

CountOutsideRange([lower, upper, skipna])

Determines the number of values that fall outside a certain range.

Entropy([dropna, base])

Calculates the entropy for a categorical column

First()

Determines the first value in a list.

Last()

Determines the last value in a list.

Max()

Calculates the highest value, ignoring NaN values.

MaxConsecutiveFalse()

Determines the maximum number of consecutive False values in the input

MaxConsecutiveNegatives([skipna])

Determines the maximum number of consecutive negative values in the input

MaxConsecutivePositives([skipna])

Determines the maximum number of consecutive positive values in the input

MaxConsecutiveTrue()

Determines the maximum number of consecutive True values in the input

MaxConsecutiveZeros([skipna])

Determines the maximum number of consecutive zero values in the input

Mean([skipna])

Computes the average for a list of values.

Median()

Determines the middlemost number in a list of values.

Min()

Calculates the smallest value, ignoring NaN values.

Mode()

Determines the most commonly repeated value.

NMostCommon([n])

Determines the n most common elements.

NumConsecutiveGreaterMean([skipna])

Determines the length of the longest subsequence above the mean.

NumConsecutiveLessMean([skipna])

Determines the length of the longest subsequence below the mean.

NumTrue()

Counts the number of True values.

NumUnique()

Determines the number of distinct values, ignoring NaN values.

PercentTrue()

Determines the percent of True values.

Skew()

Computes the extent to which a distribution differs from a normal distribution.

Std()

Computes the dispersion relative to the mean value, ignoring NaN.

Sum()

Calculates the total addition, ignoring NaN.

TimeSinceFirst([unit])

Calculates the time elapsed since the first datetime (in seconds).

TimeSinceLast([unit])

Calculates the time elapsed since the last datetime (default in seconds).

TimeSinceLastFalse()

Calculates the time since the last False value.

TimeSinceLastMax()

Calculates the time since the maximum value occurred.

TimeSinceLastMin()

Calculates the time since the minimum value occurred.

TimeSinceLastTrue()

Calculates the time since the last True value.

Trend()

Calculates the trend of a column over time.

Transform Primitives#

Binary Transform Primitives#

AddNumeric()

Performs element-wise addition of two lists.

AddNumericScalar([value])

Adds a scalar to each value in the list.

DivideByFeature([value])

Divides a scalar by each value in the list.

DivideNumericScalar([value])

Divides each element in the list by a scalar.

Equal()

Determines if values in one list are equal to another list.

EqualScalar([value])

Determines if values in a list are equal to a given scalar.

GreaterThan()

Determines if values in one list are greater than another list.

GreaterThanEqualTo()

Determines if values in one list are greater than or equal to another list.

GreaterThanEqualToScalar([value])

Determines if values are greater than or equal to a given scalar.

GreaterThanScalar([value])

Determines if values are greater than a given scalar.

LessThan()

Determines if values in one list are less than another list.

LessThanEqualTo()

Determines if values in one list are less than or equal to another list.

LessThanEqualToScalar([value])

Determines if values are less than or equal to a given scalar.

LessThanScalar([value])

Determines if values are less than a given scalar.

ModuloByFeature([value])

Computes the modulo of a scalar by each element in a list.

ModuloNumeric()

Performs element-wise modulo of two lists.

ModuloNumericScalar([value])

Computes the modulo of each element in the list by a given scalar.

MultiplyBoolean()

Performs element-wise multiplication of two lists of boolean values.

MultiplyNumericBoolean()

Performs element-wise multiplication of a numeric list with a boolean list.

MultiplyNumericScalar([value])

Multiplies each element in the list by a scalar.

NotEqual()

Determines if values in one list are not equal to another list.

NotEqualScalar([value])

Determines if values in a list are not equal to a given scalar.

ScalarSubtractNumericFeature([value])

Subtracts each value in the list from a given scalar.

SubtractNumeric([commutative])

Performs element-wise subtraction of two lists.

SubtractNumericScalar([value])

Subtracts a scalar from each element in the list.

Combine features#

IsIn([list_of_outputs])

Determines whether a value is present in a provided list.

And()

Performs element-wise logical AND of two lists.

Or()

Performs element-wise logical OR of two lists.

Not()

Negates a boolean value.

Cumulative Transform Primitives#

Diff([periods])

Computes the difference between the value in a list and the previous value in that list.

DiffDatetime([periods])

Computes the timedelta between a datetime in a list and the previous datetime in that list.

TimeSincePrevious([unit])

Computes the time since the previous entry in a list.

CumCount()

Calculates the cumulative count.

CumSum()

Calculates the cumulative sum.

CumMean()

Calculates the cumulative mean.

CumMin()

Calculates the cumulative minimum.

CumMax()

Calculates the cumulative maximum.

Datetime Transform Primitives#

Age()

Calculates the age in years as a floating point number given a

DateToHoliday([country])

Transforms time of an instance into the holiday name, if there is one.

DateToTimeZone()

Determines the timezone of a datetime.

Day()

Determines the day of the month from a datetime.

DayOfYear()

Determines the ordinal day of the year from the given datetime

DaysInMonth()

Determines the day of the month from a datetime.

DistanceToHoliday([holiday, country])

Computes the number of days before or after a given holiday.

Hour()

Determines the hour value of a datetime.

IsFederalHoliday([country])

Determines if a given datetime is a federal holiday.

IsLeapYear()

Determines the is_leap_year attribute of a datetime column.

IsLunchTime([lunch_hour])

Determines if a datetime falls during configurable lunch hour, on a 24-hour clock.

IsMonthEnd()

Determines the is_month_end attribute of a datetime column.

IsMonthStart()

Determines the is_month_start attribute of a datetime column.

IsQuarterEnd()

Determines the is_quarter_end attribute of a datetime column.

IsQuarterStart()

Determines the is_quarter_start attribute of a datetime column.

IsWeekend()

Determines if a date falls on a weekend.

IsWorkingHours([start_hour, end_hour])

Determines if a datetime falls during working hours on a 24-hour clock.

IsYearEnd()

Determines if a date falls on the end of a year.

IsYearStart()

Determines if a date falls on the start of a year.

Minute()

Determines the minutes value of a datetime.

Month()

Determines the month value of a datetime.

PartOfDay()

Determines the part of day of a datetime.

Quarter()

Determines the quarter a datetime column falls into (1, 2, 3, 4)

Season()

Determines the season of a given datetime.

Second()

Determines the seconds value of a datetime.

Week()

Determines the week of the year from a datetime.

Weekday()

Determines the day of the week from a datetime.

Year()

Determines the year value of a datetime.

Email and URL Transform Primitives#

EmailAddressToDomain()

Determines the domain of an email

IsFreeEmailDomain()

Determines if an email address is from a free email domain.

URLToDomain()

Determines the domain of a url.

URLToProtocol()

Determines the protocol (http or https) of a url.

URLToTLD()

Determines the top level domain of a url.

Exponential Transform Primitives#

ExponentialWeightedAverage([com, span, ...])

Computes the exponentially weighted moving average for a series of numbers

ExponentialWeightedSTD([com, span, ...])

Computes the exponentially weighted moving standard deviation for a series of numbers

ExponentialWeightedVariance([com, span, ...])

Computes the exponentially weighted moving variance for a series of numbers

General Transform Primitives#

AbsoluteDiff([method, limit])

Calculates the absolute difference from the previous element

Absolute()

Computes the absolute value of a number.

Cosine()

Computes the cosine of a number.

IsNull()

Determines if a value is null.

NaturalLogarithm()

Computes the natural logarithm of a number.

Negate()

Negates a numeric value.

Percentile()

Determines the percentile rank for each value in a list.

RateOfChange()

Computes the rate of change of a value per second.

SameAsPrevious([fill_method, limit])

Determines if a value is equal to the previous value in a list.

Sine()

Computes the sine of a number.

SquareRoot()

Computes the square root of a number.

Tangent()

Computes the tangent of a number.

Variance()

Calculates the variance of a list of numbers.

Location Transform Primitives#

CityblockDistance([unit])

Calculates the distance between points in a city road grid.

GeoMidpoint()

Determines the geographic center of two coordinates.

Haversine([unit])

Calculates the approximate haversine distance between two LatLong columns.

IsInGeoBox([point1, point2])

Determines if coordinates are inside a box defined by two corner coordinate points.

Latitude()

Returns the first tuple value in a list of LatLong tuples.

Longitude()

Returns the second tuple value in a list of LatLong tuples.

NaturalLanguage Transform Primitives#

CountString([string, ignore_case, ...])

Determines how many times a given string shows up in a text field.

MeanCharactersPerWord()

Determines the mean number of characters per word.

MedianWordLength([delimiters_regex])

Determines the median word length.

NumCharacters()

Calculates the number of characters in a given string, including whitespace and punctuation.

NumUniqueSeparators([separators])

Calculates the number of unique separators.

NumWords()

Determines the number of words in a string.

NumberOfCommonWords([word_set, delimiters_regex])

Determines the number of common words in a string.

NumberOfHashtags()

Determines the number of hashtags in a string.

NumberOfMentions()

Determines the number of mentions in a string.

NumberOfUniqueWords([case_insensitive])

Determines the number of unique words in a string.

NumberOfWordsInQuotes([quote_type])

Determines the number of words in quotes in a string.

PunctuationCount()

Determines number of punctuation characters in a string.

TitleWordCount()

Determines the number of title words in a string.

TotalWordLength([do_not_count])

Determines the total word length.

UpperCaseCount()

Calculates the number of upper case letters in text.

UpperCaseWordCount()

Determines the number of words in a string that are entirely capitalized.

WhitespaceCount()

Calculates number of whitespaces in a string.

Postal Code Primitives#

OneDigitPostalCode()

Returns the one digit prefix of a given postal code.

TwoDigitPostalCode()

Returns the two digit prefix of a given postal code.

Time Series Transform Primitives#

ExpandingCount([gap, min_periods])

Computes the expanding count of events over a given window.

ExpandingMax([gap, min_periods])

Computes the expanding maximum of events over a given window.

ExpandingMean([gap, min_periods])

Computes the expanding mean of events over a given window.

ExpandingMin([gap, min_periods])

Computes the expanding minimum of events over a given window.

ExpandingSTD([gap, min_periods])

Computes the expanding standard deviation for events over a given window.

ExpandingTrend([gap, min_periods])

Computes the expanding trend for events over a given window.

Lag([periods])

Shifts an array of values by a specified number of periods.

RollingCount([window_length, gap, min_periods])

Determines a rolling count of events over a given window.

RollingMax([window_length, gap, min_periods])

Determines the maximum of entries over a given window.

RollingMean([window_length, gap, min_periods])

Calculates the mean of entries over a given window.

RollingMin([window_length, gap, min_periods])

Determines the minimum of entries over a given window.

RollingOutlierCount([window_length, gap, ...])

Determines how many values are outliers over a given window.

RollingSTD([window_length, gap, min_periods])

Calculates the standard deviation of entries over a given window.

RollingTrend([window_length, gap, min_periods])

Calculates the trend of a given window of entries of a column over time.

Natural Language Processing Primitives#

Natural Language Processing primitives create features for textual data. For more information on how to use and install these primitives, see here.

Primitives in standard install#

DiversityScore()

Calculates the overall complexity of the text based on the total

LSA([random_seed, corpus, algorithm])

Calculates the Latent Semantic Analysis Values of NaturalLanguage Input

PartOfSpeechCount()

Calculates the occurences of each different part of speech.

PolarityScore()

Calculates the polarity of a text on a scale from -1 (negative) to 1 (positive)

StopwordCount()

Determines number of stopwords in a string.

Primitives that require installing tensorflow#

Elmo()

Transforms a sentence or short paragraph using deep contextualized langauge representations.

UniversalSentenceEncoder()

Transforms a sentence or short paragraph to a vector using [tfhub model](https://tfhub.dev/google/universal-sentence-encoder/2)

Feature methods#

FeatureBase.rename(name)

Rename Feature, returns copy.

FeatureBase.get_depth([stop_at])

Returns depth of feature

Feature calculation#

calculate_feature_matrix(features[, ...])

Calculates a matrix for a given set of instance ids and calculation times.

Feature descriptions#

describe_feature(feature[, ...])

Generates an English language description of a feature.

Feature visualization#

graph_feature(feature[, to_file, description])

Generates a feature lineage graph for the given feature

Feature encoding#

encode_features(feature_matrix, features[, ...])

Encode categorical features

Feature Selection#

remove_low_information_features(feature_matrix)

Select features that have at least 2 unique values and that are not all null

remove_highly_correlated_features(feature_matrix)

Removes columns in feature matrix that are highly correlated with another column.

remove_highly_null_features(feature_matrix)

Removes columns from a feature matrix that have higher than a set threshold of null values.

remove_single_value_features(feature_matrix)

Removes columns in feature matrix where all the values are the same.

Feature Matrix utils#

replace_inf_values(feature_matrix[, ...])

Replace all np.inf values in a feature matrix with the specified replacement value.

Saving and Loading Features#

save_features(features[, location, profile_name])

Saves the features list as JSON to a specified filepath/S3 path, writes to an open file, or returns the serialized features as a JSON string.

load_features(features[, profile_name])

Loads the features from a filepath, S3 path, URL, an open file, or a JSON formatted string.

EntitySet, Relationship#

Constructors#

EntitySet([id, dataframes, relationships])

Stores all actual data and typing information for an entityset

Relationship(entityset, ...)

Class to represent a relationship between dataframes

EntitySet load and prepare data#

EntitySet.add_dataframe(dataframe[, ...])

Add a DataFrame to the EntitySet with Woodwork typing information.

EntitySet.add_interesting_values([...])

Find or set interesting values for categorical columns, to be used to generate "where" clauses

EntitySet.add_last_time_indexes([...])

Calculates the last time index values for each dataframe (the last time an instance or children of that instance were observed).

EntitySet.add_relationship([...])

Add a new relationship between dataframes in the entityset.

EntitySet.add_relationships(relationships)

Add multiple new relationships to a entityset

EntitySet.concat(other[, inplace])

Combine entityset with another to create a new entityset with the combined data of both entitysets.

EntitySet.normalize_dataframe(...[, ...])

Create a new dataframe and relationship from unique values of an existing column.

EntitySet.set_secondary_time_index(...)

Set the secondary time index for a dataframe in the EntitySet using its dataframe name.

EntitySet.replace_dataframe(dataframe_name, df)

Replace the internal dataframe of an EntitySet table, keeping Woodwork typing information the same.

EntitySet serialization#

read_entityset(path[, profile_name])

Read entityset from disk, S3 path, or URL.

EntitySet.to_csv(path[, sep, encoding, ...])

Write entityset to disk in the csv format, location specified by path.

EntitySet.to_pickle(path[, compression, ...])

Write entityset in the pickle format, location specified by path.

EntitySet.to_parquet(path[, engine, ...])

Write entityset to disk in the parquet format, location specified by path.

EntitySet query methods#

EntitySet.__getitem__(dataframe_name)

Get dataframe instance from entityset

EntitySet.find_backward_paths(...)

Generator which yields all backward paths between a start and goal dataframe.

EntitySet.find_forward_paths(...)

Generator which yields all forward paths between a start and goal dataframe.

EntitySet.get_forward_dataframes(dataframe_name)

Get dataframes that are in a forward relationship with dataframe

EntitySet.get_backward_dataframes(dataframe_name)

Get dataframes that are in a backward relationship with dataframe

EntitySet.query_by_values(dataframe_name, ...)

Query instances that have column with given value

EntitySet visualization#

EntitySet.plot([to_file])

Create a UML diagram-ish graph of the EntitySet.

Relationship attributes#

Relationship.parent_column

Column in parent dataframe

Relationship.child_column

Column in child dataframe

Relationship.parent_dataframe

Parent dataframe object

Relationship.child_dataframe

Child dataframe object

Data Type Util Methods#

list_logical_types()

Returns a dataframe describing all of the available Logical Types.

list_semantic_tags()

Returns a dataframe describing all of the common semantic tags.

Primitive Util Methods#

get_recommended_primitives(entityset[, ...])

Get a list of recommended primitives given an entity set.

list_primitives()

Returns a DataFrame that lists and describes each built-in primitive.

summarize_primitives()

Returns a metrics summary DataFrame of all primitives found in list_primitives.