Class

org.apache.spark.mllib.clustering

StreamingKMeansModel

Related Doc: package clustering

Permalink

class StreamingKMeansModel extends KMeansModel with Logging

StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:

$$ \begin{align} c_t+1 &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_t+1 &= n_t * a + m_t \end{align} $$

Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.

The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.

Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

Annotations
@Since( "1.2.0" )
Source
StreamingKMeans.scala
Linear Supertypes
Logging, KMeansModel, PMMLExportable, Serializable, Serializable, Saveable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. StreamingKMeansModel
  2. Logging
  3. KMeansModel
  4. PMMLExportable
  5. Serializable
  6. Serializable
  7. Saveable
  8. AnyRef
  9. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new StreamingKMeansModel(clusterCenters: Array[Vector], clusterWeights: Array[Double])

    Permalink
    Annotations
    @Since( "1.2.0" )

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. val clusterCenters: Array[Vector]

    Permalink
    Definition Classes
    StreamingKMeansModelKMeansModel
    Annotations
    @Since( "1.2.0" )
  7. val clusterWeights: Array[Double]

    Permalink
    Annotations
    @Since( "1.2.0" )
  8. def computeCost(data: RDD[Vector]): Double

    Permalink

    Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

    Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

    Definition Classes
    KMeansModel
    Annotations
    @Since( "0.8.0" )
  9. val distanceMeasure: String

    Permalink
    Definition Classes
    KMeansModel
    Annotations
    @Since( "2.4.0" )
  10. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  12. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. def formatVersion: String

    Permalink

    Current version of model save/load format.

    Current version of model save/load format.

    Attributes
    protected
    Definition Classes
    KMeansModelSaveable
  14. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  15. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  16. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean = false): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  17. def initializeLogIfNecessary(isInterpreter: Boolean): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  18. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  19. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  20. def k: Int

    Permalink

    Total number of clusters.

    Total number of clusters.

    Definition Classes
    KMeansModel
    Annotations
    @Since( "0.8.0" )
  21. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  22. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  26. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  28. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  29. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  30. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  31. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  32. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  33. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  34. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  35. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  36. def predict(points: JavaRDD[Vector]): JavaRDD[Integer]

    Permalink

    Maps given points to their cluster indices.

    Maps given points to their cluster indices.

    Definition Classes
    KMeansModel
    Annotations
    @Since( "1.0.0" )
  37. def predict(points: RDD[Vector]): RDD[Int]

    Permalink

    Maps given points to their cluster indices.

    Maps given points to their cluster indices.

    Definition Classes
    KMeansModel
    Annotations
    @Since( "1.0.0" )
  38. def predict(point: Vector): Int

    Permalink

    Returns the cluster index that a given point belongs to.

    Returns the cluster index that a given point belongs to.

    Definition Classes
    KMeansModel
    Annotations
    @Since( "0.8.0" )
  39. def save(sc: SparkContext, path: String): Unit

    Permalink

    Save this model to the given path.

    Save this model to the given path.

    This saves:

    • human-readable (JSON) model metadata to path/metadata/
    • Parquet formatted data to path/data/

    The model may be loaded using Loader.load.

    sc

    Spark context used to save model data.

    path

    Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.

    Definition Classes
    KMeansModelSaveable
    Annotations
    @Since( "1.4.0" )
  40. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  41. def toPMML(): String

    Permalink

    Export the model to a String in PMML format

    Export the model to a String in PMML format

    Definition Classes
    PMMLExportable
    Annotations
    @Since( "1.4.0" )
  42. def toPMML(outputStream: OutputStream): Unit

    Permalink

    Export the model to the OutputStream in PMML format

    Export the model to the OutputStream in PMML format

    Definition Classes
    PMMLExportable
    Annotations
    @Since( "1.4.0" )
  43. def toPMML(sc: SparkContext, path: String): Unit

    Permalink

    Export the model to a directory on a distributed file system in PMML format

    Export the model to a directory on a distributed file system in PMML format

    Definition Classes
    PMMLExportable
    Annotations
    @Since( "1.4.0" )
  44. def toPMML(localPath: String): Unit

    Permalink

    Export the model to a local file in PMML format

    Export the model to a local file in PMML format

    Definition Classes
    PMMLExportable
    Annotations
    @Since( "1.4.0" )
  45. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  46. val trainingCost: Double

    Permalink
    Definition Classes
    KMeansModel
    Annotations
    @Since( "2.4.0" )
  47. def update(data: RDD[Vector], decayFactor: Double, timeUnit: String): StreamingKMeansModel

    Permalink

    Perform a k-means update on a batch of data.

    Perform a k-means update on a batch of data.

    Annotations
    @Since( "1.2.0" )
  48. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  49. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  50. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from KMeansModel

Inherited from PMMLExportable

Inherited from Serializable

Inherited from Serializable

Inherited from Saveable

Inherited from AnyRef

Inherited from Any

Ungrouped