Alias for getDocConcentration
Alias for getDocConcentration
Alias for getAsymmetricDocConcentration
Alias for getAsymmetricDocConcentration
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This is the parameter to a Dirichlet distribution.
Alias for getTopicConcentration
Alias for getTopicConcentration
Period (in iterations) between checkpoints.
Period (in iterations) between checkpoints.
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This method assumes the Dirichlet distribution is symmetric and can be described by a single Double parameter. It should fail if docConcentration is asymmetric.
Number of topics to infer, i.e., the number of soft cluster centers.
Number of topics to infer, i.e., the number of soft cluster centers.
Maximum number of iterations allowed.
Maximum number of iterations allowed.
:: DeveloperApi ::
:: DeveloperApi ::
LDAOptimizer used to perform the actual calculation
Random seed for cluster initialization.
Random seed for cluster initialization.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
Java-friendly version of run()
Java-friendly version of run()
Learn an LDA model using the given dataset.
Learn an LDA model using the given dataset.
RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
Inferred LDA model
Alias for setDocConcentration()
Alias for setDocConcentration()
Alias for setDocConcentration()
Alias for setDocConcentration()
Alias for setTopicConcentration()
Alias for setTopicConcentration()
Parameter for set checkpoint interval (>= 1) or disable checkpoint (-1).
Parameter for set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored. (default = 10)
Replicates a Double docConcentration to create a symmetric prior.
Replicates a Double docConcentration to create a symmetric prior.
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k. (default = Vector(-1) = automatic)
Optimizer-specific parameter settings:
Set the number of topics to infer, i.e., the number of soft cluster centers.
Set the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)
Set the maximum number of iterations allowed.
Set the maximum number of iterations allowed. (default = 20)
Set the LDAOptimizer used to perform the actual calculation by algorithm name.
Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.
:: DeveloperApi ::
:: DeveloperApi ::
LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
Set the random seed for cluster initialization.
Set the random seed for cluster initialization.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Optimizer-specific parameter settings:
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
References:
Latent Dirichlet allocation (Wikipedia)