premium_primitives.LSA#

class premium_primitives.LSA(random_seed=0, corpus=None, algorithm='randomized')[source]#

Calculates the Latent Semantic Analysis Values of NaturalLanguage Input

Description:

Given a list of strings, transforms those strings using tf-idf and single value decomposition to go from a sparse matrix to a compact matrix with two values for each string. These values represent that Latent Semantic Analysis of each string. By default these values will represent their context with respect to nltk’s gutenberg corpus. Users can optionally pass in a custom corpus when initializing the primitive by specifying the corpus values in a list with the corpus parameter.

If a string is missing, return NaN.

Note: If a small custom corpus is used, the output of the primitive may vary depending on the computer architecture being used (Linux, MacOS, Windows). This is especially true when using the default “randomized” algorithm for the TruncatedSVD component.

Parameters:

random_seed (int, optional) – The random seed value to use for the call to TruncatedSVD. Will default to 0 if not specified.
custom_corpus (list[str], optional) – A list of strings to use as a custom corpus. Will default to the NLTK Gutenberg corpus if not specified.
algorithm (str, optional) – The algorithm to use for the call to TruncatedSVD. Should be either “randomized” or “arpack”. Will default to “randomized” if not specified.

Examples

>>> lsa = LSA()
>>> x = ["he helped her walk,", "me me me eat food", "the sentence doth long"]
>>> res = lsa(x).tolist()
>>> for i in range(len(res)): res[i] = [abs(round(x, 2)) for x in res[i]]
>>> res
[[0.01, 0.01, 0.01], [0.0, 0.0, 0.01]]

Now, if we change the values of the input text, to something that better resembles the given corpus, the same given input text will result in a different, more discerning, output. Also, NaN values are handled, as well as strings without words.

>>> lsa = LSA()
>>> x = ["the earth is round", "", np.NaN, ".,/"]
>>> res = lsa(x).tolist()
>>> for i in range(len(res)): res[i] = [abs(round(x, 2)) for x in res[i]]
>>> res
[[0.02, 0.0, nan, 0.0], [0.02, 0.0, nan, 0.0]]

Users can optionally also pass in a custom corpus and specify the algorithm to use for the TruncatedSVD component used by the primitive.

>>> custom_corpus = ["dogs ate food", "she ate pineapple", "hello"]
>>> lsa = LSA(corpus=custom_corpus, algorithm="arpack")
>>> x = ["The dogs ate food.",
...      "She ate a pineapple",
...      "Consume Electrolytes, he told me.",
...      "Hello",]
>>> res = lsa(x).tolist()
>>> for i in range(len(res)): res[i] = [abs(round(x, 2)) for x in res[i]]
>>> res
[[0.68, 0.68, 0.0, 0.0], [0.0, 0.0, 0.0, 1.0]]

__init__(random_seed=0, corpus=None, algorithm='randomized')[source]#

Methods

`__init__`([random_seed, corpus, algorithm])
`flatten_nested_input_types`(input_types)	Flattens nested column schema inputs into a single list.
`generate_name`(base_feature_names)
`generate_names`(base_feature_names)
`get_args_string`()
`get_arguments`()
`get_description`(input_column_descriptions[, ...])
`get_filepath`(filename)
`get_function`()

Attributes

`base_of`
`base_of_exclude`
`commutative`
`compatibility`	Additional compatible libraries
`default_value`	Default value this feature returns if no data found.
`description_template`
`input_types`	woodwork.ColumnSchema types of inputs
`max_stack_depth`
`name`	Name of the primitive
`number_output_features`	Number of columns in feature matrix associated with this feature
`return_type`	ColumnSchema type of return
`series_library`
`stack_on`
`stack_on_exclude`
`stack_on_self`
`uses_calc_time`
`uses_full_dataframe`

Table of Contents

Previous topic

Next topic

This Page

premium_primitives.LSA#

Table of Contents

Previous topic

Next topic

This Page

Quick search

premium_primitives.LSA#