KMeans (Spark 2.2.1 JavaDoc)

Object
- org.apache.spark.mllib.clustering.KMeans

All Implemented Interfaces:

java.io.Serializable, Logging
```
public class KMeans
extends Object
implements scala.Serializable, Logging
```
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description
`KMeans()` Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means\|\|", initializationSteps: 2, epsilon: 1e-4, seed: random}.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`double`	`getEpsilon()` The distance threshold within which we've consider centers to have converged.
`String`	`getInitializationMode()` The initialization algorithm.
`int`	`getInitializationSteps()` Number of steps for the k-means\|\| initialization mode
`int`	`getK()` Number of clusters to create (k).
`int`	`getMaxIterations()` Maximum number of iterations allowed.
`int`	`getRuns()` Deprecated. This has no effect and always returns 1. Since 2.1.0.
`long`	`getSeed()` The random seed for cluster initialization.
`static String`	`K_MEANS_PARALLEL()`
`static String`	`RANDOM()`
`KMeansModel`	`run(RDD<Vector> data)` Train a K-means model on the given set of points; `data` should be cached for high performance, because this is an iterative algorithm.
`KMeans`	`setEpsilon(double epsilon)` Set the distance threshold within which we've consider centers to have converged.
`KMeans`	`setInitializationMode(String initializationMode)` Set the initialization algorithm.
`KMeans`	`setInitializationSteps(int initializationSteps)` Set the number of steps for the k-means\|\| initialization mode.
`KMeans`	`setInitialModel(KMeansModel model)` Set the initial starting point, bypassing the random initialization or k-means\|\| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.
`KMeans`	`setK(int k)` Set the number of clusters to create (k).
`KMeans`	`setMaxIterations(int maxIterations)` Set maximum number of iterations allowed.
`KMeans`	`setRuns(int runs)` Deprecated. This has no effect. Since 2.1.0.
`KMeans`	`setSeed(long seed)` Set the random seed for cluster initialization.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations)` Trains a k-means model using specified parameters and the default values for unspecified.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations, int runs)` Deprecated. Use train method without 'runs'. Since 2.1.0.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations, int runs, String initializationMode)` Deprecated. Use train method without 'runs'. Since 2.1.0.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations, int runs, String initializationMode, long seed)` Deprecated. Use train method without 'runs'. Since 2.1.0.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations, String initializationMode)` Trains a k-means model using the given set of parameters.
`static KMeansModel`	`train(RDD<Vector> data, int k, int maxIterations, String initializationMode, long seed)` Trains a k-means model using the given set of parameters.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeLogging, initializeLogIfNecessary, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

- Constructor Detail
  - KMeans
```
public KMeans()
```
    Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random}.
- Method Detail
  - RANDOM
```
public static String RANDOM()
```
  - K_MEANS_PARALLEL
```
public static String K_MEANS_PARALLEL()
```
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                String initializationMode,
                                long seed)
```
    Trains a k-means model using the given set of parameters.
    
    Parameters:
    
    data - Training points as an RDD of Vector types.
    
    k - Number of clusters to create.
    
    maxIterations - Maximum number of iterations allowed.
    
    initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
    
    seed - Random seed for cluster initialization. Default is to generate seed based on system time.
    
    Returns:
    
    (undocumented)
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                String initializationMode)
```
    Trains a k-means model using the given set of parameters.
    
    Parameters:
    
    data - Training points as an RDD of Vector types.
    
    k - Number of clusters to create.
    
    maxIterations - Maximum number of iterations allowed.
    
    initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
    
    Returns:
    
    (undocumented)
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs,
                                String initializationMode,
                                long seed)
```
    Deprecated. Use train method without 'runs'. Since 2.1.0.
    
    Trains a k-means model using the given set of parameters.
    
    Parameters:
    
    data - Training points as an RDD of Vector types.
    
    k - Number of clusters to create.
    
    maxIterations - Maximum number of iterations allowed.
    
    runs - This param has no effect since Spark 2.0.0.
    
    initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
    
    seed - Random seed for cluster initialization. Default is to generate seed based on system time.
    
    Returns:
    
    (undocumented)
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs,
                                String initializationMode)
```
    Deprecated. Use train method without 'runs'. Since 2.1.0.
    
    Trains a k-means model using the given set of parameters.
    
    Parameters:
    
    data - Training points as an RDD of Vector types.
    
    k - Number of clusters to create.
    
    maxIterations - Maximum number of iterations allowed.
    
    runs - This param has no effect since Spark 2.0.0.
    
    initializationMode - The initialization algorithm. This can either be "random" or "k-means||". (default: "k-means||")
    
    Returns:
    
    (undocumented)
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations)
```
    Trains a k-means model using specified parameters and the default values for unspecified.
    
    Parameters:
    
    data - (undocumented)
    
    k - (undocumented)
    
    maxIterations - (undocumented)
    
    Returns:
    
    (undocumented)
  - train
```
public static KMeansModel train(RDD<Vector> data,
                                int k,
                                int maxIterations,
                                int runs)
```
    Deprecated. Use train method without 'runs'. Since 2.1.0.
    
    Trains a k-means model using specified parameters and the default values for unspecified.
    
    Parameters:
    
    data - (undocumented)
    
    k - (undocumented)
    
    maxIterations - (undocumented)
    
    runs - (undocumented)
    
    Returns:
    
    (undocumented)
  - getK
```
public int getK()
```
    Number of clusters to create (k).
    
    Returns:
    
    (undocumented)
    
    Note:
    
    It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
  - setK
```
public KMeans setK(int k)
```
    Set the number of clusters to create (k).
    
    Parameters:
    
    k - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Note:
    
    It is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster. Default: 2.
  - getMaxIterations
```
public int getMaxIterations()
```
    Maximum number of iterations allowed.
    
    Returns:
    
    (undocumented)
  - setMaxIterations
```
public KMeans setMaxIterations(int maxIterations)
```
    Set maximum number of iterations allowed. Default: 20.
    
    Parameters:
    
    maxIterations - (undocumented)
    
    Returns:
    
    (undocumented)
  - getInitializationMode
```
public String getInitializationMode()
```
    The initialization algorithm. This can be either "random" or "k-means||".
    
    Returns:
    
    (undocumented)
  - setInitializationMode
```
public KMeans setInitializationMode(String initializationMode)
```
    Set the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
    
    Parameters:
    
    initializationMode - (undocumented)
    
    Returns:
    
    (undocumented)
  - getRuns
```
public int getRuns()
```
    Deprecated. This has no effect and always returns 1. Since 2.1.0.
    
    This function has no effect since Spark 2.0.0.
    
    Returns:
    
    (undocumented)
  - setRuns
```
public KMeans setRuns(int runs)
```
    Deprecated. This has no effect. Since 2.1.0.
    
    This function has no effect since Spark 2.0.0.
    
    Parameters:
    
    runs - (undocumented)
    
    Returns:
    
    (undocumented)
  - getInitializationSteps
```
public int getInitializationSteps()
```
    Number of steps for the k-means|| initialization mode
    
    Returns:
    
    (undocumented)
  - setInitializationSteps
```
public KMeans setInitializationSteps(int initializationSteps)
```
    Set the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 2 is almost always enough. Default: 2.
    
    Parameters:
    
    initializationSteps - (undocumented)
    
    Returns:
    
    (undocumented)
  - getEpsilon
```
public double getEpsilon()
```
    The distance threshold within which we've consider centers to have converged.
    
    Returns:
    
    (undocumented)
  - setEpsilon
```
public KMeans setEpsilon(double epsilon)
```
    Set the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.
    
    Parameters:
    
    epsilon - (undocumented)
    
    Returns:
    
    (undocumented)
  - getSeed
```
public long getSeed()
```
    The random seed for cluster initialization.
    
    Returns:
    
    (undocumented)
  - setSeed
```
public KMeans setSeed(long seed)
```
    Set the random seed for cluster initialization.
    
    Parameters:
    
    seed - (undocumented)
    
    Returns:
    
    (undocumented)
  - setInitialModel
```
public KMeans setInitialModel(KMeansModel model)
```
    Set the initial starting point, bypassing the random initialization or k-means|| The condition model.k == this.k must be met, failure results in an IllegalArgumentException.
    
    Parameters:
    
    model - (undocumented)
    
    Returns:
    
    (undocumented)
  - run
```
public KMeansModel run(RDD<Vector> data)
```
    Train a K-means model on the given set of points; data should be cached for high performance, because this is an iterative algorithm.
    
    Parameters:
    
    data - (undocumented)
    
    Returns:
    
    (undocumented)

Class KMeans

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Detail

KMeans

Method Detail

RANDOM

K_MEANS_PARALLEL

train

train

train

train

train

train

getK

setK

getMaxIterations

setMaxIterations

getInitializationMode

setInitializationMode

getRuns

setRuns

getInitializationSteps

setInitializationSteps

getEpsilon

setEpsilon

getSeed

setSeed

setInitialModel

run