org.apache.spark.mllib.clustering
Return the largest change in log-likelihood at which convergence is considered to have occurred.
Return the largest change in log-likelihood at which convergence is considered to have occurred.
Return the user supplied initial GMM, if supplied
Return the user supplied initial GMM, if supplied
Return the number of Gaussians in the mixture model
Return the number of Gaussians in the mixture model
Return the maximum number of iterations to run
Return the maximum number of iterations to run
Return the random seed
Return the random seed
Java-friendly version of run()
Java-friendly version of run()
Perform expectation maximization
Perform expectation maximization
Set the largest change in log-likelihood at which convergence is considered to have occurred.
Set the largest change in log-likelihood at which convergence is considered to have occurred.
Set the initial GMM starting point, bypassing the random initialization.
Set the initial GMM starting point, bypassing the random initialization. You must call setK() prior to calling this method, and the condition (model.k == this.k) must be met; failure will result in an IllegalArgumentException
Set the number of Gaussians in the mixture model.
Set the number of Gaussians in the mixture model. Default: 2
Set the maximum number of iterations to run.
Set the maximum number of iterations to run. Default: 100
Set the random seed
Set the random seed
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Note: For high-dimensional data (with many features), this algorithm may perform poorly. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.