public class Statistics
extends java.lang.Object
Constructor and Description |
---|
Statistics() |
Modifier and Type | Method and Description |
---|---|
static ChiSqTestResult[] |
chiSqTest(JavaRDD<LabeledPoint> data)
Java-friendly version of
chiSqTest() |
static ChiSqTestResult |
chiSqTest(Matrix observed)
Conduct Pearson's independence test on the input contingency matrix, which cannot contain
negative entries or columns or rows that sum up to 0.
|
static ChiSqTestResult[] |
chiSqTest(RDD<LabeledPoint> data)
Conduct Pearson's independence test for every feature against the label across the input RDD.
|
static ChiSqTestResult |
chiSqTest(Vector observed)
Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform
distribution, with each category having an expected frequency of
1 / observed.size . |
static ChiSqTestResult |
chiSqTest(Vector observed,
Vector expected)
Conduct Pearson's chi-squared goodness of fit test of the observed data against the
expected distribution.
|
static MultivariateStatisticalSummary |
colStats(RDD<Vector> X)
Computes column-wise summary statistics for the input RDD[Vector].
|
static double |
corr(JavaRDD<java.lang.Double> x,
JavaRDD<java.lang.Double> y)
Java-friendly version of
corr() |
static double |
corr(JavaRDD<java.lang.Double> x,
JavaRDD<java.lang.Double> y,
java.lang.String method)
Java-friendly version of
corr() |
static double |
corr(RDD<java.lang.Object> x,
RDD<java.lang.Object> y)
Compute the Pearson correlation for the input RDDs.
|
static double |
corr(RDD<java.lang.Object> x,
RDD<java.lang.Object> y,
java.lang.String method)
Compute the correlation for the input RDDs using the specified method.
|
static Matrix |
corr(RDD<Vector> X)
Compute the Pearson correlation matrix for the input RDD of Vectors.
|
static Matrix |
corr(RDD<Vector> X,
java.lang.String method)
Compute the correlation matrix for the input RDD of Vectors using the specified method.
|
static KolmogorovSmirnovTestResult |
kolmogorovSmirnovTest(JavaDoubleRDD data,
java.lang.String distName,
double... params)
Java-friendly version of
kolmogorovSmirnovTest() |
static KolmogorovSmirnovTestResult |
kolmogorovSmirnovTest(JavaDoubleRDD data,
java.lang.String distName,
scala.collection.Seq<java.lang.Object> params)
Java-friendly version of
kolmogorovSmirnovTest() |
static KolmogorovSmirnovTestResult |
kolmogorovSmirnovTest(RDD<java.lang.Object> data,
scala.Function1<java.lang.Object,java.lang.Object> cdf)
Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a
continuous distribution.
|
static KolmogorovSmirnovTestResult |
kolmogorovSmirnovTest(RDD<java.lang.Object> data,
java.lang.String distName,
double... params)
Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability
distribution equality.
|
static KolmogorovSmirnovTestResult |
kolmogorovSmirnovTest(RDD<java.lang.Object> data,
java.lang.String distName,
scala.collection.Seq<java.lang.Object> params)
Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability
distribution equality.
|
public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<java.lang.Object> data, java.lang.String distName, double... params)
data
- an RDD[Double]
containing the sample of data to testdistName
- a String
name for a theoretical distributionparams
- Double*
specifying the parameters to be used for the theoretical distributionKolmogorovSmirnovTestResult
object containing test
statistic, p-value, and null hypothesis.public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, java.lang.String distName, double... params)
kolmogorovSmirnovTest()
public static MultivariateStatisticalSummary colStats(RDD<Vector> X)
X
- an RDD[Vector] for which column-wise summary statistics are to be computed.MultivariateStatisticalSummary
object containing column-wise summary statistics.public static Matrix corr(RDD<Vector> X)
X
- an RDD[Vector] for which the correlation matrix is to be computed.public static Matrix corr(RDD<Vector> X, java.lang.String method)
pearson
(default), spearman
.
Note that for Spearman, a rank correlation, we need to create an RDD[Double] for each column
and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
which is fairly costly. Cache the input RDD before calling corr with method = "spearman"
to
avoid recomputing the common lineage.
X
- an RDD[Vector] for which the correlation matrix is to be computed.method
- String specifying the method to use for computing correlation.
Supported: pearson
(default), spearman
public static double corr(RDD<java.lang.Object> x, RDD<java.lang.Object> y)
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
- RDD[Double] of the same cardinality as y.y
- RDD[Double] of the same cardinality as x.public static double corr(JavaRDD<java.lang.Double> x, JavaRDD<java.lang.Double> y)
corr()
x
- (undocumented)y
- (undocumented)public static double corr(RDD<java.lang.Object> x, RDD<java.lang.Object> y, java.lang.String method)
pearson
(default), spearman
.
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
- RDD[Double] of the same cardinality as y.y
- RDD[Double] of the same cardinality as x.method
- String specifying the method to use for computing correlation.
Supported: pearson
(default), spearman
public static double corr(JavaRDD<java.lang.Double> x, JavaRDD<java.lang.Double> y, java.lang.String method)
corr()
x
- (undocumented)y
- (undocumented)method
- (undocumented)public static ChiSqTestResult chiSqTest(Vector observed, Vector expected)
Note: the two input Vectors need to have the same size.
observed
cannot contain negative values.
expected
cannot contain nonpositive values.
observed
- Vector containing the observed categorical counts/relative frequencies.expected
- Vector containing the expected categorical counts/relative frequencies.
expected
is rescaled if the expected
sum differs from the observed
sum.public static ChiSqTestResult chiSqTest(Vector observed)
1 / observed.size
.
Note: observed
cannot contain negative values.
observed
- Vector containing the observed categorical counts/relative frequencies.public static ChiSqTestResult chiSqTest(Matrix observed)
observed
- The contingency matrix (containing either counts or relative frequencies).public static ChiSqTestResult[] chiSqTest(RDD<LabeledPoint> data)
data
- an RDD[LabeledPoint]
containing the labeled dataset with categorical features.
Real-valued features will be treated as categorical for each distinct value.public static ChiSqTestResult[] chiSqTest(JavaRDD<LabeledPoint> data)
chiSqTest()
public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<java.lang.Object> data, scala.Function1<java.lang.Object,java.lang.Object> cdf)
data
- an RDD[Double]
containing the sample of data to testcdf
- a Double => Double
function to calculate the theoretical CDF at a given valueKolmogorovSmirnovTestResult
object containing test
statistic, p-value, and null hypothesis.public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<java.lang.Object> data, java.lang.String distName, scala.collection.Seq<java.lang.Object> params)
data
- an RDD[Double]
containing the sample of data to testdistName
- a String
name for a theoretical distributionparams
- Double*
specifying the parameters to be used for the theoretical distributionKolmogorovSmirnovTestResult
object containing test
statistic, p-value, and null hypothesis.public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, java.lang.String distName, scala.collection.Seq<java.lang.Object> params)
kolmogorovSmirnovTest()