nlp_primitives.LSA

class nlp_primitives.LSA[source]

Calculates the Latent Semantic Analysis Values of Text Input

Description:

Given a list of strings, transforms those strings using tf-idf and single value decomposition to go from a sparse matrix to a compact matrix with two values for each string. These values represent that Latent Semantic Analysis of each string. These values will represent their context with respect to (nltk’s brown sentence corpus.)[https://www.nltk.org/book/ch02.html#brown-corpus]

If a string is missing, return NaN.

Examples

>>> lsa = LSA()
>>> x = ["he helped her walk,", "me me me eat food", "the sentence doth long"]
>>> res = lsa(x).tolist()
>>> for i in range(len(res)): res[i] = [abs(round(x, 2)) for x in res[i]]
>>> res
[[0.0, 0.0, 0.01], [0.0, 0.0, 0.0]]

Now, if we change the values of the input corpus, to something that better resembles the given text, the same given input text will result in a different, more discerning, output. Also, NaN values are handled, as well as strings without words.

>>> lsa = LSA()
>>> x = ["the earth is round", "", np.NaN, ".,/"]
>>> res = lsa(x).tolist()
>>> for i in range(len(res)): res[i] = [abs(round(x, 2)) for x in res[i]]
>>> res
[[0.01, 0.0, nan, 0.0], [0.0, 0.0, nan, 0.0]]
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__()

Initialize self.

generate_name(base_feature_names)

generate_names(base_feature_names)

get_args_string()

get_arguments()

get_filepath(filename)

get_function()

Attributes

base_of

base_of_exclude

commutative

dask_compatible

default_value

input_types

max_stack_depth

name

number_output_features

uses_calc_time

uses_full_entity