org.apache.spark.mllib.clustering
Alias for getDocConcentration
Alias for getTopicConcentration
Period (in iterations) between checkpoints.
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This is the parameter to a symmetric Dirichlet distribution.
Number of topics to infer.
Number of topics to infer. I.e., the number of soft cluster centers.
Maximum number of iterations for learning.
Random seed
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
Java-friendly version of run()
Learn an LDA model using the given dataset.
Learn an LDA model using the given dataset.
RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
Inferred LDA model
Alias for setDocConcentration()
Alias for setTopicConcentration()
Period (in iterations) between checkpoints (default = 10).
Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored.
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This is the parameter to a symmetric Dirichlet distribution.
This value should be > 1.0, where larger values mean more smoothing (more regularization). If set to -1, then docConcentration is set automatically. (default = -1 = automatic)
Automatic setting of parameter:
Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
Number of topics to infer.
Number of topics to infer. I.e., the number of soft cluster centers. (default = 10)
Maximum number of iterations for learning.
Maximum number of iterations for learning. (default = 20)
Random seed
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
This value should be > 0.0. If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Automatic setting of parameter:
Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), but values in (0,1) are not yet supported.
:: Experimental ::
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References:
Latent Dirichlet allocation (Wikipedia)