public class Dataset<T>
extends Object
implements scala.Serializable
DataFrame, which is a Dataset of Row.
 
 Operations available on Datasets are divided into transformations and actions. Transformations
 are the ones that produce new Datasets, and actions are the ones that trigger computation and
 return results. Example transformations include map, filter, select, and aggregate (groupBy).
 Example actions count, show, or writing data out to file systems.
 
 Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
 a Dataset represents a logical plan that describes the computation required to produce the data.
 When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
 physical plan for efficient execution in a parallel and distributed manner. To explore the
 logical plan as well as optimized physical plan, use the explain function.
 
 To efficiently support domain-specific objects, an Encoder is required. The encoder maps
 the domain specific type T to Spark's internal type system. For example, given a class Person
 with two fields, name (string) and age (int), an encoder is used to tell Spark to generate
 code at runtime to serialize the Person object into a binary structure. This binary structure
 often has much lower memory footprint as well as are optimized for efficiency in data processing
 (e.g. in a columnar format). To understand the internal binary representation for data, use the
 schema function.
 
 There are typically two ways to create a Dataset. The most common way is by pointing Spark
 to some files on storage systems, using the read function available on a SparkSession.
 
   val people = spark.read.parquet("...").as[Person]  // Scala
   Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
 
 Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
   val names = people.map(_.name)  // in Scala; names is a Dataset[String]
   Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
 
 
 Dataset operations can also be untyped, through various domain-specific-language (DSL)
 functions defined in: Dataset (this class), Column, and functions. These operations
 are very similar to the operations available in the data frame abstraction in R or Python.
 
 To select a column from the Dataset, use apply method in Scala and col in Java.
 
   val ageCol = people("age")  // in Scala
   Column ageCol = people.col("age"); // in Java
 
 
 Note that the Column type can also be manipulated through its various functions.
 
   // The following creates a new column that increases everybody's age by 10.
   people("age") + 10  // in Scala
   people.col("age").plus(10);  // in Java
 
 A more concrete example in Scala:
   // To create Dataset[Row] using SparkSession
   val people = spark.read.parquet("...")
   val department = spark.read.parquet("...")
   people.filter("age > 30")
     .join(department, people("deptId") === department("id"))
     .groupBy(department("name"), people("gender"))
     .agg(avg(people("salary")), max(people("age")))
 
 and in Java:
   // To create Dataset<Row> using SparkSession
   Dataset<Row> people = spark.read().parquet("...");
   Dataset<Row> department = spark.read().parquet("...");
   people.filter(people.col("age").gt(30))
     .join(department, people.col("deptId").equalTo(department.col("id")))
     .groupBy(department.col("name"), people.col("gender"))
     .agg(avg(people.col("salary")), max(people.col("age")));
 
 | Constructor and Description | 
|---|
Dataset(SparkSession sparkSession,
       org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
       Encoder<T> encoder)  | 
Dataset(SQLContext sqlContext,
       org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
       Encoder<T> encoder)  | 
| Modifier and Type | Method and Description | 
|---|---|
Dataset<Row> | 
agg(Column expr,
   Column... exprs)
Aggregates on the entire Dataset without groups. 
 | 
Dataset<Row> | 
agg(Column expr,
   scala.collection.Seq<Column> exprs)
Aggregates on the entire Dataset without groups. 
 | 
Dataset<Row> | 
agg(scala.collection.immutable.Map<String,String> exprs)
(Scala-specific) Aggregates on the entire Dataset without groups. 
 | 
Dataset<Row> | 
agg(java.util.Map<String,String> exprs)
(Java-specific) Aggregates on the entire Dataset without groups. 
 | 
Dataset<Row> | 
agg(scala.Tuple2<String,String> aggExpr,
   scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
(Scala-specific) Aggregates on the entire Dataset without groups. 
 | 
Dataset<T> | 
alias(String alias)
Returns a new Dataset with an alias set. 
 | 
Dataset<T> | 
alias(scala.Symbol alias)
(Scala-specific) Returns a new Dataset with an alias set. 
 | 
Column | 
apply(String colName)
Selects column based on the column name and returns it as a  
Column. | 
<U> Dataset<U> | 
as(Encoder<U> evidence$2)
:: Experimental ::
 Returns a new Dataset where each record has been mapped on to the specified type. 
 | 
Dataset<T> | 
as(String alias)
Returns a new Dataset with an alias set. 
 | 
Dataset<T> | 
as(scala.Symbol alias)
(Scala-specific) Returns a new Dataset with an alias set. 
 | 
Dataset<T> | 
cache()
Persist this Dataset with the default storage level ( 
MEMORY_AND_DISK). | 
Dataset<T> | 
checkpoint()
Eagerly checkpoint a Dataset and return the new Dataset. 
 | 
Dataset<T> | 
checkpoint(boolean eager)
Returns a checkpointed version of this Dataset. 
 | 
scala.reflect.ClassTag<T> | 
classTag()  | 
Dataset<T> | 
coalesce(int numPartitions)
Returns a new Dataset that has exactly  
numPartitions partitions, when the fewer partitions
 are requested. | 
Column | 
col(String colName)
Selects column based on the column name and returns it as a  
Column. | 
Object | 
collect()
Returns an array that contains all rows in this Dataset. 
 | 
java.util.List<T> | 
collectAsList()
Returns a Java list that contains all rows in this Dataset. 
 | 
Column | 
colRegex(String colName)
Selects column based on the column name specified as a regex and returns it as  
Column. | 
String[] | 
columns()
Returns all column names as an array. 
 | 
long | 
count()
Returns the number of rows in the Dataset. 
 | 
void | 
createGlobalTempView(String viewName)
Creates a global temporary view using the given name. 
 | 
void | 
createOrReplaceGlobalTempView(String viewName)
Creates or replaces a global temporary view using the given name. 
 | 
void | 
createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name. 
 | 
void | 
createTempView(String viewName)
Creates a local temporary view using the given name. 
 | 
Dataset<Row> | 
crossJoin(Dataset<?> right)
Explicit cartesian join with another  
DataFrame. | 
RelationalGroupedDataset | 
cube(Column... cols)
Create a multi-dimensional cube for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
cube(scala.collection.Seq<Column> cols)
Create a multi-dimensional cube for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
cube(String col1,
    scala.collection.Seq<String> cols)
Create a multi-dimensional cube for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
cube(String col1,
    String... cols)
Create a multi-dimensional cube for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
Dataset<Row> | 
describe(scala.collection.Seq<String> cols)
Computes basic statistics for numeric and string columns, including count, mean, stddev, min,
 and max. 
 | 
Dataset<Row> | 
describe(String... cols)
Computes basic statistics for numeric and string columns, including count, mean, stddev, min,
 and max. 
 | 
Dataset<T> | 
distinct()
Returns a new Dataset that contains only the unique rows from this Dataset. 
 | 
Dataset<Row> | 
drop(Column col)
Returns a new Dataset with a column dropped. 
 | 
Dataset<Row> | 
drop(scala.collection.Seq<String> colNames)
Returns a new Dataset with columns dropped. 
 | 
Dataset<Row> | 
drop(String... colNames)
Returns a new Dataset with columns dropped. 
 | 
Dataset<Row> | 
drop(String colName)
Returns a new Dataset with a column dropped. 
 | 
Dataset<T> | 
dropDuplicates()
Returns a new Dataset that contains only the unique rows from this Dataset. 
 | 
Dataset<T> | 
dropDuplicates(scala.collection.Seq<String> colNames)
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only
 the subset of columns. 
 | 
Dataset<T> | 
dropDuplicates(String[] colNames)
Returns a new Dataset with duplicate rows removed, considering only
 the subset of columns. 
 | 
Dataset<T> | 
dropDuplicates(String col1,
              scala.collection.Seq<String> cols)
Returns a new  
Dataset with duplicate rows removed, considering only
 the subset of columns. | 
Dataset<T> | 
dropDuplicates(String col1,
              String... cols)
Returns a new  
Dataset with duplicate rows removed, considering only
 the subset of columns. | 
scala.Tuple2<String,String>[] | 
dtypes()
Returns all column names and their data types as an array. 
 | 
Dataset<T> | 
except(Dataset<T> other)
Returns a new Dataset containing rows in this Dataset but not in another Dataset. 
 | 
Dataset<T> | 
exceptAll(Dataset<T> other)
Returns a new Dataset containing rows in this Dataset but not in another Dataset while
 preserving the duplicates. 
 | 
void | 
explain()
Prints the physical plan to the console for debugging purposes. 
 | 
void | 
explain(boolean extended)
Prints the plans (logical and physical) to the console for debugging purposes. 
 | 
<A extends scala.Product> | 
explode(scala.collection.Seq<Column> input,
       scala.Function1<Row,scala.collection.TraversableOnce<A>> f,
       scala.reflect.api.TypeTags.TypeTag<A> evidence$4)
Deprecated. 
 
use flatMap() or select() with functions.explode() instead. Since 2.0.0. 
 | 
<A,B> Dataset<Row> | 
explode(String inputColumn,
       String outputColumn,
       scala.Function1<A,scala.collection.TraversableOnce<B>> f,
       scala.reflect.api.TypeTags.TypeTag<B> evidence$5)
Deprecated. 
 
use flatMap() or select() with functions.explode() instead. Since 2.0.0. 
 | 
Dataset<T> | 
filter(Column condition)
Filters rows using the given condition. 
 | 
Dataset<T> | 
filter(FilterFunction<T> func)
:: Experimental ::
 (Java-specific)
 Returns a new Dataset that only contains elements where  
func returns true. | 
Dataset<T> | 
filter(scala.Function1<T,Object> func)
:: Experimental ::
 (Scala-specific)
 Returns a new Dataset that only contains elements where  
func returns true. | 
Dataset<T> | 
filter(String conditionExpr)
Filters rows using the given SQL expression. 
 | 
T | 
first()
Returns the first row. 
 | 
<U> Dataset<U> | 
flatMap(FlatMapFunction<T,U> f,
       Encoder<U> encoder)
:: Experimental ::
 (Java-specific)
 Returns a new Dataset by first applying a function to all elements of this Dataset,
 and then flattening the results. 
 | 
<U> Dataset<U> | 
flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func,
       Encoder<U> evidence$8)
:: Experimental ::
 (Scala-specific)
 Returns a new Dataset by first applying a function to all elements of this Dataset,
 and then flattening the results. 
 | 
void | 
foreach(ForeachFunction<T> func)
(Java-specific)
 Runs  
func on each element of this Dataset. | 
void | 
foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
Applies a function  
f to all rows. | 
void | 
foreachPartition(ForeachPartitionFunction<T> func)
(Java-specific)
 Runs  
func on each partition of this Dataset. | 
void | 
foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
Applies a function  
f to each partition of this Dataset. | 
RelationalGroupedDataset | 
groupBy(Column... cols)
Groups the Dataset using the specified columns, so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
groupBy(scala.collection.Seq<Column> cols)
Groups the Dataset using the specified columns, so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
groupBy(String col1,
       scala.collection.Seq<String> cols)
Groups the Dataset using the specified columns, so that we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
groupBy(String col1,
       String... cols)
Groups the Dataset using the specified columns, so that we can run aggregation on them. 
 | 
<K> KeyValueGroupedDataset<K,T> | 
groupByKey(scala.Function1<T,K> func,
          Encoder<K> evidence$3)
:: Experimental ::
 (Scala-specific)
 Returns a  
KeyValueGroupedDataset where the data is grouped by the given key func. | 
<K> KeyValueGroupedDataset<K,T> | 
groupByKey(MapFunction<T,K> func,
          Encoder<K> encoder)
:: Experimental ::
 (Java-specific)
 Returns a  
KeyValueGroupedDataset where the data is grouped by the given key func. | 
T | 
head()
Returns the first row. 
 | 
Object | 
head(int n)
Returns the first  
n rows. | 
Dataset<T> | 
hint(String name,
    Object... parameters)
Specifies some hint on the current Dataset. 
 | 
Dataset<T> | 
hint(String name,
    scala.collection.Seq<Object> parameters)
Specifies some hint on the current Dataset. 
 | 
String[] | 
inputFiles()
Returns a best-effort snapshot of the files that compose this Dataset. 
 | 
Dataset<T> | 
intersect(Dataset<T> other)
Returns a new Dataset containing rows only in both this Dataset and another Dataset. 
 | 
Dataset<T> | 
intersectAll(Dataset<T> other)
Returns a new Dataset containing rows only in both this Dataset and another Dataset while
 preserving the duplicates. 
 | 
boolean | 
isEmpty()
Returns true if the  
Dataset is empty. | 
boolean | 
isLocal()
Returns true if the  
collect and take methods can be run locally
 (without any Spark executors). | 
boolean | 
isStreaming()
Returns true if this Dataset contains one or more sources that continuously
 return data as it arrives. 
 | 
JavaRDD<T> | 
javaRDD()
Returns the content of the Dataset as a  
JavaRDD of Ts. | 
Dataset<Row> | 
join(Dataset<?> right)
Join with another  
DataFrame. | 
Dataset<Row> | 
join(Dataset<?> right,
    Column joinExprs)
Inner join with another  
DataFrame, using the given join expression. | 
Dataset<Row> | 
join(Dataset<?> right,
    Column joinExprs,
    String joinType)
Join with another  
DataFrame, using the given join expression. | 
Dataset<Row> | 
join(Dataset<?> right,
    scala.collection.Seq<String> usingColumns)
Inner equi-join with another  
DataFrame using the given columns. | 
Dataset<Row> | 
join(Dataset<?> right,
    scala.collection.Seq<String> usingColumns,
    String joinType)
Equi-join with another  
DataFrame using the given columns. | 
Dataset<Row> | 
join(Dataset<?> right,
    String usingColumn)
Inner equi-join with another  
DataFrame using the given column. | 
<U> Dataset<scala.Tuple2<T,U>> | 
joinWith(Dataset<U> other,
        Column condition)
:: Experimental ::
 Using inner equi-join to join this Dataset returning a  
Tuple2 for each pair
 where condition evaluates to true. | 
<U> Dataset<scala.Tuple2<T,U>> | 
joinWith(Dataset<U> other,
        Column condition,
        String joinType)
:: Experimental ::
 Joins this Dataset returning a  
Tuple2 for each pair where condition evaluates to
 true. | 
Dataset<T> | 
limit(int n)
Returns a new Dataset by taking the first  
n rows. | 
Dataset<T> | 
localCheckpoint()
Eagerly locally checkpoints a Dataset and return the new Dataset. 
 | 
Dataset<T> | 
localCheckpoint(boolean eager)
Locally checkpoints a Dataset and return the new Dataset. 
 | 
<U> Dataset<U> | 
map(scala.Function1<T,U> func,
   Encoder<U> evidence$6)
:: Experimental ::
 (Scala-specific)
 Returns a new Dataset that contains the result of applying  
func to each element. | 
<U> Dataset<U> | 
map(MapFunction<T,U> func,
   Encoder<U> encoder)
:: Experimental ::
 (Java-specific)
 Returns a new Dataset that contains the result of applying  
func to each element. | 
<U> Dataset<U> | 
mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func,
             Encoder<U> evidence$7)
:: Experimental ::
 (Scala-specific)
 Returns a new Dataset that contains the result of applying  
func to each partition. | 
<U> Dataset<U> | 
mapPartitions(MapPartitionsFunction<T,U> f,
             Encoder<U> encoder)
:: Experimental ::
 (Java-specific)
 Returns a new Dataset that contains the result of applying  
f to each partition. | 
DataFrameNaFunctions | 
na()
Returns a  
DataFrameNaFunctions for working with missing data. | 
static Dataset<Row> | 
ofRows(SparkSession sparkSession,
      org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)  | 
Dataset<T> | 
orderBy(Column... sortExprs)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
orderBy(scala.collection.Seq<Column> sortExprs)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
orderBy(String sortCol,
       scala.collection.Seq<String> sortCols)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
orderBy(String sortCol,
       String... sortCols)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
persist()
Persist this Dataset with the default storage level ( 
MEMORY_AND_DISK). | 
Dataset<T> | 
persist(StorageLevel newLevel)
Persist this Dataset with the given storage level. 
 | 
void | 
printSchema()
Prints the schema to the console in a nice tree format. 
 | 
org.apache.spark.sql.execution.QueryExecution | 
queryExecution()  | 
Dataset<T>[] | 
randomSplit(double[] weights)
Randomly splits this Dataset with the provided weights. 
 | 
Dataset<T>[] | 
randomSplit(double[] weights,
           long seed)
Randomly splits this Dataset with the provided weights. 
 | 
java.util.List<Dataset<T>> | 
randomSplitAsList(double[] weights,
                 long seed)
Returns a Java list that contains randomly split Dataset with the provided weights. 
 | 
RDD<T> | 
rdd()
Represents the content of the Dataset as an  
RDD of T. | 
T | 
reduce(scala.Function2<T,T,T> func)
:: Experimental ::
 (Scala-specific)
 Reduces the elements of this Dataset using the specified binary function. 
 | 
T | 
reduce(ReduceFunction<T> func)
:: Experimental ::
 (Java-specific)
 Reduces the elements of this Dataset using the specified binary function. 
 | 
void | 
registerTempTable(String tableName)
Deprecated. 
 
Use createOrReplaceTempView(viewName) instead. Since 2.0.0. 
 | 
Dataset<T> | 
repartition(Column... partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions, using
  
spark.sql.shuffle.partitions as number of partitions. | 
Dataset<T> | 
repartition(int numPartitions)
Returns a new Dataset that has exactly  
numPartitions partitions. | 
Dataset<T> | 
repartition(int numPartitions,
           Column... partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions into
  
numPartitions. | 
Dataset<T> | 
repartition(int numPartitions,
           scala.collection.Seq<Column> partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions into
  
numPartitions. | 
Dataset<T> | 
repartition(scala.collection.Seq<Column> partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions, using
  
spark.sql.shuffle.partitions as number of partitions. | 
Dataset<T> | 
repartitionByRange(Column... partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions, using
  
spark.sql.shuffle.partitions as number of partitions. | 
Dataset<T> | 
repartitionByRange(int numPartitions,
                  Column... partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions into
  
numPartitions. | 
Dataset<T> | 
repartitionByRange(int numPartitions,
                  scala.collection.Seq<Column> partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions into
  
numPartitions. | 
Dataset<T> | 
repartitionByRange(scala.collection.Seq<Column> partitionExprs)
Returns a new Dataset partitioned by the given partitioning expressions, using
  
spark.sql.shuffle.partitions as number of partitions. | 
RelationalGroupedDataset | 
rollup(Column... cols)
Create a multi-dimensional rollup for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
rollup(scala.collection.Seq<Column> cols)
Create a multi-dimensional rollup for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
rollup(String col1,
      scala.collection.Seq<String> cols)
Create a multi-dimensional rollup for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
RelationalGroupedDataset | 
rollup(String col1,
      String... cols)
Create a multi-dimensional rollup for the current Dataset using the specified columns,
 so we can run aggregation on them. 
 | 
Dataset<T> | 
sample(boolean withReplacement,
      double fraction)
Returns a new  
Dataset by sampling a fraction of rows, using a random seed. | 
Dataset<T> | 
sample(boolean withReplacement,
      double fraction,
      long seed)
Returns a new  
Dataset by sampling a fraction of rows, using a user-supplied seed. | 
Dataset<T> | 
sample(double fraction)
Returns a new  
Dataset by sampling a fraction of rows (without replacement),
 using a random seed. | 
Dataset<T> | 
sample(double fraction,
      long seed)
Returns a new  
Dataset by sampling a fraction of rows (without replacement),
 using a user-supplied seed. | 
StructType | 
schema()
Returns the schema of this Dataset. 
 | 
Dataset<Row> | 
select(Column... cols)
Selects a set of column based expressions. 
 | 
Dataset<Row> | 
select(scala.collection.Seq<Column> cols)
Selects a set of column based expressions. 
 | 
Dataset<Row> | 
select(String col,
      scala.collection.Seq<String> cols)
Selects a set of columns. 
 | 
Dataset<Row> | 
select(String col,
      String... cols)
Selects a set of columns. 
 | 
<U1> Dataset<U1> | 
select(TypedColumn<T,U1> c1)
:: Experimental ::
 Returns a new Dataset by computing the given  
Column expression for each element. | 
<U1,U2> Dataset<scala.Tuple2<U1,U2>> | 
select(TypedColumn<T,U1> c1,
      TypedColumn<T,U2> c2)
:: Experimental ::
 Returns a new Dataset by computing the given  
Column expressions for each element. | 
<U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> | 
select(TypedColumn<T,U1> c1,
      TypedColumn<T,U2> c2,
      TypedColumn<T,U3> c3)
:: Experimental ::
 Returns a new Dataset by computing the given  
Column expressions for each element. | 
<U1,U2,U3,U4> | 
select(TypedColumn<T,U1> c1,
      TypedColumn<T,U2> c2,
      TypedColumn<T,U3> c3,
      TypedColumn<T,U4> c4)
:: Experimental ::
 Returns a new Dataset by computing the given  
Column expressions for each element. | 
<U1,U2,U3,U4,U5> | 
select(TypedColumn<T,U1> c1,
      TypedColumn<T,U2> c2,
      TypedColumn<T,U3> c3,
      TypedColumn<T,U4> c4,
      TypedColumn<T,U5> c5)
:: Experimental ::
 Returns a new Dataset by computing the given  
Column expressions for each element. | 
Dataset<Row> | 
selectExpr(scala.collection.Seq<String> exprs)
Selects a set of SQL expressions. 
 | 
Dataset<Row> | 
selectExpr(String... exprs)
Selects a set of SQL expressions. 
 | 
void | 
show()
Displays the top 20 rows of Dataset in a tabular form. 
 | 
void | 
show(boolean truncate)
Displays the top 20 rows of Dataset in a tabular form. 
 | 
void | 
show(int numRows)
Displays the Dataset in a tabular form. 
 | 
void | 
show(int numRows,
    boolean truncate)
Displays the Dataset in a tabular form. 
 | 
void | 
show(int numRows,
    int truncate)
Displays the Dataset in a tabular form. 
 | 
void | 
show(int numRows,
    int truncate,
    boolean vertical)
Displays the Dataset in a tabular form. 
 | 
Dataset<T> | 
sort(Column... sortExprs)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
sort(scala.collection.Seq<Column> sortExprs)
Returns a new Dataset sorted by the given expressions. 
 | 
Dataset<T> | 
sort(String sortCol,
    scala.collection.Seq<String> sortCols)
Returns a new Dataset sorted by the specified column, all in ascending order. 
 | 
Dataset<T> | 
sort(String sortCol,
    String... sortCols)
Returns a new Dataset sorted by the specified column, all in ascending order. 
 | 
Dataset<T> | 
sortWithinPartitions(Column... sortExprs)
Returns a new Dataset with each partition sorted by the given expressions. 
 | 
Dataset<T> | 
sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
Returns a new Dataset with each partition sorted by the given expressions. 
 | 
Dataset<T> | 
sortWithinPartitions(String sortCol,
                    scala.collection.Seq<String> sortCols)
Returns a new Dataset with each partition sorted by the given expressions. 
 | 
Dataset<T> | 
sortWithinPartitions(String sortCol,
                    String... sortCols)
Returns a new Dataset with each partition sorted by the given expressions. 
 | 
SparkSession | 
sparkSession()  | 
SQLContext | 
sqlContext()  | 
DataFrameStatFunctions | 
stat()
Returns a  
DataFrameStatFunctions for working statistic functions support. | 
StorageLevel | 
storageLevel()
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted. 
 | 
Dataset<Row> | 
summary(scala.collection.Seq<String> statistics)
Computes specified statistics for numeric and string columns. 
 | 
Dataset<Row> | 
summary(String... statistics)
Computes specified statistics for numeric and string columns. 
 | 
Object | 
take(int n)
Returns the first  
n rows in the Dataset. | 
java.util.List<T> | 
takeAsList(int n)
Returns the first  
n rows in the Dataset as a list. | 
Dataset<Row> | 
toDF()
Converts this strongly typed collection of data to generic Dataframe. 
 | 
Dataset<Row> | 
toDF(scala.collection.Seq<String> colNames)
Converts this strongly typed collection of data to generic  
DataFrame with columns renamed. | 
Dataset<Row> | 
toDF(String... colNames)
Converts this strongly typed collection of data to generic  
DataFrame with columns renamed. | 
JavaRDD<T> | 
toJavaRDD()
Returns the content of the Dataset as a  
JavaRDD of Ts. | 
Dataset<String> | 
toJSON()
Returns the content of the Dataset as a Dataset of JSON strings. 
 | 
java.util.Iterator<T> | 
toLocalIterator()
Returns an iterator that contains all rows in this Dataset. 
 | 
String | 
toString()  | 
<U> Dataset<U> | 
transform(scala.Function1<Dataset<T>,Dataset<U>> t)
Concise syntax for chaining custom transformations. 
 | 
Dataset<T> | 
union(Dataset<T> other)
Returns a new Dataset containing union of rows in this Dataset and another Dataset. 
 | 
Dataset<T> | 
unionAll(Dataset<T> other)
Deprecated. 
 
use union(). Since 2.0.0. 
 | 
Dataset<T> | 
unionByName(Dataset<T> other)
Returns a new Dataset containing union of rows in this Dataset and another Dataset. 
 | 
Dataset<T> | 
unpersist()
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. 
 | 
Dataset<T> | 
unpersist(boolean blocking)
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. 
 | 
Dataset<T> | 
where(Column condition)
Filters rows using the given condition. 
 | 
Dataset<T> | 
where(String conditionExpr)
Filters rows using the given SQL expression. 
 | 
Dataset<Row> | 
withColumn(String colName,
          Column col)
Returns a new Dataset by adding a column or replacing the existing column that has
 the same name. 
 | 
Dataset<Row> | 
withColumnRenamed(String existingName,
                 String newName)
Returns a new Dataset with a column renamed. 
 | 
Dataset<T> | 
withWatermark(String eventTime,
             String delayThreshold)
Defines an event time watermark for this  
Dataset. | 
DataFrameWriter<T> | 
write()
Interface for saving the content of the non-streaming Dataset out into external storage. 
 | 
DataStreamWriter<T> | 
writeStream()
Interface for saving the content of the streaming Dataset out into external storage. 
 | 
public Dataset(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
public Dataset(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
public Dataset<Row> toDF(String... colNames)
DataFrame with columns renamed.
 This can be quite convenient in conversion from an RDD of tuples into a DataFrame with
 meaningful names. For example:
 
   val rdd: RDD[(Int, String)] = ...
   rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
   rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
 
 colNames - (undocumented)public Dataset<T> sortWithinPartitions(String sortCol, String... sortCols)
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> sortWithinPartitions(Column... sortExprs)
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs - (undocumented)public Dataset<T> sort(String sortCol, String... sortCols)
   // The following 3 are equivalent
   ds.sort("sortcol")
   ds.sort($"sortcol")
   ds.sort($"sortcol".asc)
 
 sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> sort(Column... sortExprs)
   ds.sort($"col1", $"col2".desc)
 
 sortExprs - (undocumented)public Dataset<T> orderBy(String sortCol, String... sortCols)
sort function.
 sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> orderBy(Column... sortExprs)
sort function.
 sortExprs - (undocumented)public Dataset<T> hint(String name, Object... parameters)
   df1.join(df2.hint("broadcast"))
 
 name - (undocumented)parameters - (undocumented)public Dataset<Row> select(Column... cols)
   ds.select($"colA", $"colB" + 1)
 
 cols - (undocumented)public Dataset<Row> select(String col, String... cols)
select that can only select
 existing columns using column names (i.e. cannot construct expressions).
 
   // The following two are equivalent:
   ds.select("colA", "colB")
   ds.select($"colA", $"colB")
 
 col - (undocumented)cols - (undocumented)public Dataset<Row> selectExpr(String... exprs)
select that accepts
 SQL expressions.
 
   // The following are equivalent:
   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
 
 exprs - (undocumented)public RelationalGroupedDataset groupBy(Column... cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns grouped by department.
   ds.groupBy($"department").avg()
   // Compute the max age and average salary, grouped by department and gender.
   ds.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset rollup(Column... cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns rolluped by department and group.
   ds.rollup($"department", $"group").avg()
   // Compute the max age and average salary, rolluped by department and gender.
   ds.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset cube(Column... cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns cubed by department and group.
   ds.cube($"department", $"group").avg()
   // Compute the max age and average salary, cubed by department and gender.
   ds.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset groupBy(String col1, String... cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns grouped by department.
   ds.groupBy("department").avg()
   // Compute the max age and average salary, grouped by department and gender.
   ds.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 col1 - (undocumented)cols - (undocumented)public RelationalGroupedDataset rollup(String col1, String... cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns rolluped by department and group.
   ds.rollup("department", "group").avg()
   // Compute the max age and average salary, rolluped by department and gender.
   ds.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 col1 - (undocumented)cols - (undocumented)public RelationalGroupedDataset cube(String col1, String... cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns cubed by department and group.
   ds.cube("department", "group").avg()
   // Compute the max age and average salary, cubed by department and gender.
   ds.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 col1 - (undocumented)cols - (undocumented)public Dataset<Row> agg(Column expr, Column... exprs)
   // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
   ds.agg(max($"age"), avg($"salary"))
   ds.groupBy().agg(max($"age"), avg($"salary"))
 
 expr - (undocumented)exprs - (undocumented)public Dataset<Row> drop(String... colNames)
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
colNames - (undocumented)public Dataset<T> dropDuplicates(String col1, String... cols)
Dataset with duplicate rows removed, considering only
 the subset of columns.
 
 For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
 will keep all data across triggers as intermediate state to drop duplicates rows. You can use
 withWatermark to limit how late the duplicate data can be and system will accordingly limit
 the state. In addition, too late data older than watermark will be dropped to avoid any
 possibility of duplicates.
 
col1 - (undocumented)cols - (undocumented)public Dataset<Row> describe(String... cols)
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting Dataset. If you want to
 programmatically compute summary statistics, use the agg function instead.
 
   ds.describe("age", "height").show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // max     92.0  192.0
 
 
 Use summary for expanded statistics and control over which statistics to compute.
 
cols - Columns to compute statistics on.
 public Dataset<Row> summary(String... statistics)
- count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting Dataset. If you want to
 programmatically compute summary statistics, use the agg function instead.
 
   ds.summary().show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // 25%     24.0  176.0
   // 50%     24.0  176.0
   // 75%     32.0  180.0
   // max     92.0  192.0
 
 
   ds.summary("count", "min", "25%", "75%", "max").show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // min     18.0  163.0
   // 25%     24.0  176.0
   // 75%     32.0  180.0
   // max     92.0  192.0
 
 To do a summary for specific columns first select them:
   ds.select("age", "height").summary().show()
 
 
 See also describe for basic statistics.
 
statistics - Statistics from above list to be computed.
 public Dataset<T> repartition(int numPartitions, Column... partitionExprs)
numPartitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions - (undocumented)partitionExprs - (undocumented)public Dataset<T> repartition(Column... partitionExprs)
spark.sql.shuffle.partitions as number of partitions.
 The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs - (undocumented)public Dataset<T> repartitionByRange(int numPartitions, Column... partitionExprs)
numPartitions. The resulting Dataset is range partitioned.
 At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
numPartitions - (undocumented)partitionExprs - (undocumented)public Dataset<T> repartitionByRange(Column... partitionExprs)
spark.sql.shuffle.partitions as number of partitions.
 The resulting Dataset is range partitioned.
 At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
partitionExprs - (undocumented)public SparkSession sparkSession()
public org.apache.spark.sql.execution.QueryExecution queryExecution()
public scala.reflect.ClassTag<T> classTag()
public SQLContext sqlContext()
public String toString()
toString in class Objectpublic Dataset<Row> toDF()
Row
 objects that allow fields to be accessed by ordinal or name.
 public <U> Dataset<U> as(Encoder<U> evidence$2)
U:
  - When U is a class, fields for the class will be mapped to columns of the same name
    (case sensitivity is determined by spark.sql.caseSensitive).
  - When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will
    be assigned to _1).
  - When U is a primitive type (i.e. String, Int, etc), then the first column of the
    DataFrame will be used.
 
 If the schema of the Dataset does not match the desired U type, you can use select
 along with alias or as to rearrange or rename as required.
 
 Note that as[] only changes the view of the data that is passed into typed operations,
 such as map(), and does not eagerly project away any columns that are not present in
 the specified class.
 
evidence$2 - (undocumented)public Dataset<Row> toDF(scala.collection.Seq<String> colNames)
DataFrame with columns renamed.
 This can be quite convenient in conversion from an RDD of tuples into a DataFrame with
 meaningful names. For example:
 
   val rdd: RDD[(Int, String)] = ...
   rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
   rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
 
 colNames - (undocumented)public StructType schema()
public void printSchema()
public void explain(boolean extended)
extended - (undocumented)public void explain()
public scala.Tuple2<String,String>[] dtypes()
public String[] columns()
public boolean isLocal()
collect and take methods can be run locally
 (without any Spark executors).
 public boolean isEmpty()
Dataset is empty.
 public boolean isStreaming()
StreamingQuery using the start() method in
 DataStreamWriter. Methods that return a single answer, e.g. count() or
 collect(), will throw an AnalysisException when there is a streaming
 source present.
 public Dataset<T> checkpoint()
SparkContext#setCheckpointDir.
 public Dataset<T> checkpoint(boolean eager)
SparkContext#setCheckpointDir.
 eager - (undocumented)public Dataset<T> localCheckpoint()
public Dataset<T> localCheckpoint(boolean eager)
eager - (undocumented)public Dataset<T> withWatermark(String eventTime, String delayThreshold)
Dataset. A watermark tracks a point in time
 before which we assume no more late data is going to arrive.
 
 Spark will use this watermark for several purposes:
  - To know when a given time window aggregation can be finalized and thus can be emitted when
    using output modes that do not allow updates.
  - To minimize the amount of state that we need to keep for on-going aggregations,
    mapGroupsWithState and dropDuplicates operators.
 
  The current watermark is computed by looking at the MAX(eventTime) seen across
  all of the partitions in the query minus a user specified delayThreshold.  Due to the cost
  of coordinating this value across partitions, the actual watermark used is only guaranteed
  to be at least delayThreshold behind the actual event time.  In some cases we may still
  process records that arrive more than delayThreshold late.
 
eventTime - the name of the column that contains the event time of the row.delayThreshold - the minimum delay to wait to data to arrive late, relative to the latest
                       record that has been processed in the form of an interval
                       (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.
 public void show(int numRows)
   year  month AVG('Adj Close) MAX('Adj Close)
   1980  12    0.503218        0.595103
   1981  01    0.523289        0.570307
   1982  02    0.436504        0.475256
   1983  03    0.410516        0.442194
   1984  04    0.450090        0.483521
 
 numRows - Number of rows to show
 public void show()
public void show(boolean truncate)
truncate - Whether truncate long strings. If true, strings more than 20 characters will
                 be truncated and all cells will be aligned right
 public void show(int numRows,
                 boolean truncate)
   year  month AVG('Adj Close) MAX('Adj Close)
   1980  12    0.503218        0.595103
   1981  01    0.523289        0.570307
   1982  02    0.436504        0.475256
   1983  03    0.410516        0.442194
   1984  04    0.450090        0.483521
 numRows - Number of rows to showtruncate - Whether truncate long strings. If true, strings more than 20 characters will
              be truncated and all cells will be aligned right
 public void show(int numRows,
                 int truncate)
   year  month AVG('Adj Close) MAX('Adj Close)
   1980  12    0.503218        0.595103
   1981  01    0.523289        0.570307
   1982  02    0.436504        0.475256
   1983  03    0.410516        0.442194
   1984  04    0.450090        0.483521
 
 numRows - Number of rows to showtruncate - If set to more than 0, truncates strings to truncate characters and
                    all cells will be aligned right.public void show(int numRows,
                 int truncate,
                 boolean vertical)
   year  month AVG('Adj Close) MAX('Adj Close)
   1980  12    0.503218        0.595103
   1981  01    0.523289        0.570307
   1982  02    0.436504        0.475256
   1983  03    0.410516        0.442194
   1984  04    0.450090        0.483521
 
 
 If vertical enabled, this command prints output rows vertically (one line per column value)?
 
 -RECORD 0-------------------
  year            | 1980
  month           | 12
  AVG('Adj Close) | 0.503218
  AVG('Adj Close) | 0.595103
 -RECORD 1-------------------
  year            | 1981
  month           | 01
  AVG('Adj Close) | 0.523289
  AVG('Adj Close) | 0.570307
 -RECORD 2-------------------
  year            | 1982
  month           | 02
  AVG('Adj Close) | 0.436504
  AVG('Adj Close) | 0.475256
 -RECORD 3-------------------
  year            | 1983
  month           | 03
  AVG('Adj Close) | 0.410516
  AVG('Adj Close) | 0.442194
 -RECORD 4-------------------
  year            | 1984
  month           | 04
  AVG('Adj Close) | 0.450090
  AVG('Adj Close) | 0.483521
 
 numRows - Number of rows to showtruncate - If set to more than 0, truncates strings to truncate characters and
                    all cells will be aligned right.vertical - If set to true, prints output rows vertically (one line per column value).public DataFrameNaFunctions na()
DataFrameNaFunctions for working with missing data.
 
   // Dropping rows containing any null values.
   ds.na.drop()
 
 public DataFrameStatFunctions stat()
DataFrameStatFunctions for working statistic functions support.
 
   // Finding frequent items in column with name 'a'.
   ds.stat.freqItems(Seq("a"))
 
 public Dataset<Row> join(Dataset<?> right)
DataFrame.
 Behaves as an INNER JOIN and requires a subsequent join predicate.
right - Right side of the join operation.
 public Dataset<Row> join(Dataset<?> right, String usingColumn)
DataFrame using the given column.
 
 Different from other join functions, the join column will only appear once in the output,
 i.e. similar to SQL's JOIN USING syntax.
 
   // Joining df1 and df2 using the column "user_id"
   df1.join(df2, "user_id")
 
 right - Right side of the join operation.usingColumn - Name of the column to join on. This column must exist on both sides.
 DataFrames, you will NOT be able to reference any columns after the join, since
 there is no way to disambiguate which side of the join you would like to reference.
 public Dataset<Row> join(Dataset<?> right, scala.collection.Seq<String> usingColumns)
DataFrame using the given columns.
 
 Different from other join functions, the join columns will only appear once in the output,
 i.e. similar to SQL's JOIN USING syntax.
 
   // Joining df1 and df2 using the columns "user_id" and "user_name"
   df1.join(df2, Seq("user_id", "user_name"))
 
 right - Right side of the join operation.usingColumns - Names of the columns to join on. This columns must exist on both sides.
 DataFrames, you will NOT be able to reference any columns after the join, since
 there is no way to disambiguate which side of the join you would like to reference.
 public Dataset<Row> join(Dataset<?> right, scala.collection.Seq<String> usingColumns, String joinType)
DataFrame using the given columns. A cross join with a predicate
 is specified as an inner join. If you would explicitly like to perform a cross join use the
 crossJoin method.
 
 Different from other join functions, the join columns will only appear once in the output,
 i.e. similar to SQL's JOIN USING syntax.
 
right - Right side of the join operation.usingColumns - Names of the columns to join on. This columns must exist on both sides.joinType - Type of join to perform. Default inner. Must be one of:
                 inner, cross, outer, full, full_outer, left, left_outer,
                 right, right_outer, left_semi, left_anti.
 DataFrames, you will NOT be able to reference any columns after the join, since
 there is no way to disambiguate which side of the join you would like to reference.
 public Dataset<Row> join(Dataset<?> right, Column joinExprs)
DataFrame, using the given join expression.
 
   // The following two are equivalent:
   df1.join(df2, $"df1Key" === $"df2Key")
   df1.join(df2).where($"df1Key" === $"df2Key")
 
 right - (undocumented)joinExprs - (undocumented)public Dataset<Row> join(Dataset<?> right, Column joinExprs, String joinType)
DataFrame, using the given join expression. The following performs
 a full outer join between df1 and df2.
 
   // Scala:
   import org.apache.spark.sql.functions._
   df1.join(df2, $"df1Key" === $"df2Key", "outer")
   // Java:
   import static org.apache.spark.sql.functions.*;
   df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
 
 right - Right side of the join.joinExprs - Join expression.joinType - Type of join to perform. Default inner. Must be one of:
                 inner, cross, outer, full, full_outer, left, left_outer,
                 right, right_outer, left_semi, left_anti.
 public Dataset<Row> crossJoin(Dataset<?> right)
DataFrame.
 right - Right side of the join operation.
 public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, String joinType)
Tuple2 for each pair where condition evaluates to
 true.
 
 This is similar to the relation join function with one important difference in the
 result schema. Since joinWith preserves objects present on either side of the join, the
 result schema is similarly nested into a tuple under the column names _1 and _2.
 
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other - Right side of the join.condition - Join expression.joinType - Type of join to perform. Default inner. Must be one of:
                 inner, cross, outer, full, full_outer, left, left_outer,
                 right, right_outer.
 public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition)
Tuple2 for each pair
 where condition evaluates to true.
 other - Right side of the join.condition - Join expression.
 public Dataset<T> sortWithinPartitions(String sortCol, scala.collection.Seq<String> sortCols)
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs - (undocumented)public Dataset<T> sort(String sortCol, scala.collection.Seq<String> sortCols)
   // The following 3 are equivalent
   ds.sort("sortcol")
   ds.sort($"sortcol")
   ds.sort($"sortcol".asc)
 
 sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> sort(scala.collection.Seq<Column> sortExprs)
   ds.sort($"col1", $"col2".desc)
 
 sortExprs - (undocumented)public Dataset<T> orderBy(String sortCol, scala.collection.Seq<String> sortCols)
sort function.
 sortCol - (undocumented)sortCols - (undocumented)public Dataset<T> orderBy(scala.collection.Seq<Column> sortExprs)
sort function.
 sortExprs - (undocumented)public Column apply(String colName)
Column.
 colName - (undocumented)a.b.
 public Dataset<T> hint(String name, scala.collection.Seq<Object> parameters)
   df1.join(df2.hint("broadcast"))
 
 name - (undocumented)parameters - (undocumented)public Column col(String colName)
Column.
 colName - (undocumented)a.b.
 public Column colRegex(String colName)
Column.colName - (undocumented)public Dataset<T> as(String alias)
alias - (undocumented)public Dataset<T> as(scala.Symbol alias)
alias - (undocumented)public Dataset<T> alias(String alias)
as.
 alias - (undocumented)public Dataset<T> alias(scala.Symbol alias)
as.
 alias - (undocumented)public Dataset<Row> select(scala.collection.Seq<Column> cols)
   ds.select($"colA", $"colB" + 1)
 
 cols - (undocumented)public Dataset<Row> select(String col, scala.collection.Seq<String> cols)
select that can only select
 existing columns using column names (i.e. cannot construct expressions).
 
   // The following two are equivalent:
   ds.select("colA", "colB")
   ds.select($"colA", $"colB")
 
 col - (undocumented)cols - (undocumented)public Dataset<Row> selectExpr(scala.collection.Seq<String> exprs)
select that accepts
 SQL expressions.
 
   // The following are equivalent:
   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
 
 exprs - (undocumented)public <U1> Dataset<U1> select(TypedColumn<T,U1> c1)
Column expression for each element.
 
   val ds = Seq(1, 2, 3).toDS()
   val newDS = ds.select(expr("value + 1").as[Int])
 
 c1 - (undocumented)public <U1,U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)
Column expressions for each element.
 c1 - (undocumented)c2 - (undocumented)public <U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)
Column expressions for each element.
 c1 - (undocumented)c2 - (undocumented)c3 - (undocumented)public <U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)
Column expressions for each element.
 c1 - (undocumented)c2 - (undocumented)c3 - (undocumented)c4 - (undocumented)public <U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)
Column expressions for each element.
 c1 - (undocumented)c2 - (undocumented)c3 - (undocumented)c4 - (undocumented)c5 - (undocumented)public Dataset<T> filter(Column condition)
   // The following are equivalent:
   peopleDs.filter($"age" > 15)
   peopleDs.where($"age" > 15)
 
 condition - (undocumented)public Dataset<T> filter(String conditionExpr)
   peopleDs.filter("age > 15")
 
 conditionExpr - (undocumented)public Dataset<T> where(Column condition)
filter.
 
   // The following are equivalent:
   peopleDs.filter($"age" > 15)
   peopleDs.where($"age" > 15)
 
 condition - (undocumented)public Dataset<T> where(String conditionExpr)
   peopleDs.where("age > 15")
 
 conditionExpr - (undocumented)public RelationalGroupedDataset groupBy(scala.collection.Seq<Column> cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns grouped by department.
   ds.groupBy($"department").avg()
   // Compute the max age and average salary, grouped by department and gender.
   ds.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset rollup(scala.collection.Seq<Column> cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns rolluped by department and group.
   ds.rollup($"department", $"group").avg()
   // Compute the max age and average salary, rolluped by department and gender.
   ds.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset cube(scala.collection.Seq<Column> cols)
RelationalGroupedDataset for all the available aggregate functions.
 
   // Compute the average for all numeric columns cubed by department and group.
   ds.cube($"department", $"group").avg()
   // Compute the max age and average salary, cubed by department and gender.
   ds.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 cols - (undocumented)public RelationalGroupedDataset groupBy(String col1, scala.collection.Seq<String> cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns grouped by department.
   ds.groupBy("department").avg()
   // Compute the max age and average salary, grouped by department and gender.
   ds.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 col1 - (undocumented)cols - (undocumented)public T reduce(scala.Function2<T,T,T> func)
func
 must be commutative and associative or the result may be non-deterministic.
 func - (undocumented)public T reduce(ReduceFunction<T> func)
func
 must be commutative and associative or the result may be non-deterministic.
 func - (undocumented)public <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func, Encoder<K> evidence$3)
KeyValueGroupedDataset where the data is grouped by the given key func.
 func - (undocumented)evidence$3 - (undocumented)public <K> KeyValueGroupedDataset<K,T> groupByKey(MapFunction<T,K> func, Encoder<K> encoder)
KeyValueGroupedDataset where the data is grouped by the given key func.
 func - (undocumented)encoder - (undocumented)public RelationalGroupedDataset rollup(String col1, scala.collection.Seq<String> cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns rolluped by department and group.
   ds.rollup("department", "group").avg()
   // Compute the max age and average salary, rolluped by department and gender.
   ds.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 
 col1 - (undocumented)cols - (undocumented)public RelationalGroupedDataset cube(String col1, scala.collection.Seq<String> cols)
RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
   // Compute the average for all numeric columns cubed by department and group.
   ds.cube("department", "group").avg()
   // Compute the max age and average salary, cubed by department and gender.
   ds.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 col1 - (undocumented)cols - (undocumented)public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr, scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
   // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
   ds.agg("age" -> "max", "salary" -> "avg")
   ds.groupBy().agg("age" -> "max", "salary" -> "avg")
 
 aggExpr - (undocumented)aggExprs - (undocumented)public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)
   // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
   ds.agg(Map("age" -> "max", "salary" -> "avg"))
   ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 
 exprs - (undocumented)public Dataset<Row> agg(java.util.Map<String,String> exprs)
   // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
   ds.agg(Map("age" -> "max", "salary" -> "avg"))
   ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 
 exprs - (undocumented)public Dataset<Row> agg(Column expr, scala.collection.Seq<Column> exprs)
   // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
   ds.agg(max($"age"), avg($"salary"))
   ds.groupBy().agg(max($"age"), avg($"salary"))
 
 expr - (undocumented)exprs - (undocumented)public Dataset<T> limit(int n)
n rows. The difference between this function
 and head is that head is an action and returns an array (by triggering query execution)
 while limit returns a new Dataset.
 n - (undocumented)public Dataset<T> unionAll(Dataset<T> other)
 This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
 deduplication of elements), use this function followed by a distinct.
 
Also as standard in SQL, this function resolves columns by position (not by name).
other - (undocumented)public Dataset<T> union(Dataset<T> other)
 This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
 deduplication of elements), use this function followed by a distinct.
 
Also as standard in SQL, this function resolves columns by position (not by name):
   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
   val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
   df1.union(df2).show
   // output:
   // +----+----+----+
   // |col0|col1|col2|
   // +----+----+----+
   // |   1|   2|   3|
   // |   4|   5|   6|
   // +----+----+----+
 
 
 Notice that the column positions in the schema aren't necessarily matched with the
 fields in the strongly typed objects in a Dataset. This function resolves columns
 by their positions in the schema, not the fields in the strongly typed objects. Use
 unionByName to resolve columns by field name in the typed objects.
 
other - (undocumented)public Dataset<T> unionByName(Dataset<T> other)
 This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set
 union (that does deduplication of elements), use this function followed by a distinct.
 
 The difference between this function and union is that this function
 resolves columns by name (not by position):
 
   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
   val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
   df1.unionByName(df2).show
   // output:
   // +----+----+----+
   // |col0|col1|col2|
   // +----+----+----+
   // |   1|   2|   3|
   // |   6|   4|   5|
   // +----+----+----+
 
 other - (undocumented)public Dataset<T> intersect(Dataset<T> other)
INTERSECT in SQL.
 other - (undocumented)equals function defined on T.
 public Dataset<T> intersectAll(Dataset<T> other)
INTERSECT ALL in SQL.
 other - (undocumented)equals function defined on T. Also as standard
 in SQL, this function resolves columns by position (not by name).
 public Dataset<T> except(Dataset<T> other)
EXCEPT DISTINCT in SQL.
 other - (undocumented)equals function defined on T.
 public Dataset<T> exceptAll(Dataset<T> other)
EXCEPT ALL in SQL.
 other - (undocumented)equals function defined on T. Also as standard in
 SQL, this function resolves columns by position (not by name).
 public Dataset<T> sample(double fraction, long seed)
Dataset by sampling a fraction of rows (without replacement),
 using a user-supplied seed.
 fraction - Fraction of rows to generate, range [0.0, 1.0].seed - Seed for sampling.
 Dataset.
 public Dataset<T> sample(double fraction)
Dataset by sampling a fraction of rows (without replacement),
 using a random seed.
 fraction - Fraction of rows to generate, range [0.0, 1.0].
 Dataset.
 public Dataset<T> sample(boolean withReplacement, double fraction, long seed)
Dataset by sampling a fraction of rows, using a user-supplied seed.
 withReplacement - Sample with replacement or not.fraction - Fraction of rows to generate, range [0.0, 1.0].seed - Seed for sampling.
 Dataset.
 public Dataset<T> sample(boolean withReplacement, double fraction)
Dataset by sampling a fraction of rows, using a random seed.
 withReplacement - Sample with replacement or not.fraction - Fraction of rows to generate, range [0.0, 1.0].
 Dataset.
 public Dataset<T>[] randomSplit(double[] weights, long seed)
weights - weights for splits, will be normalized if they don't sum to 1.seed - Seed for sampling.
 
 For Java API, use randomSplitAsList.
 
public java.util.List<Dataset<T>> randomSplitAsList(double[] weights, long seed)
weights - weights for splits, will be normalized if they don't sum to 1.seed - Seed for sampling.
 public Dataset<T>[] randomSplit(double[] weights)
weights - weights for splits, will be normalized if they don't sum to 1.public <A extends scala.Product> Dataset<Row> explode(scala.collection.Seq<Column> input, scala.Function1<Row,scala.collection.TraversableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$4)
LATERAL VIEW in HiveQL. The columns of
 the input row are implicitly joined with each row that is output by the function.
 
 Given that this is deprecated, as an alternative, you can explode columns either using
 functions.explode() or flatMap(). The following example uses these alternatives to count
 the number of books that contain a given word:
 
   case class Book(title: String, words: String)
   val ds: Dataset[Book]
   val allWords = ds.select('title, explode(split('words, " ")).as("word"))
   val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
 
 
 Using flatMap() this can similarly be exploded as:
 
   ds.flatMap(_.words.split(" "))
 
 input - (undocumented)f - (undocumented)evidence$4 - (undocumented)public <A,B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A,scala.collection.TraversableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$5)
LATERAL VIEW in HiveQL. All
 columns of the input row are implicitly joined with each value that is output by the function.
 
 Given that this is deprecated, as an alternative, you can explode columns either using
 functions.explode():
 
   ds.select(explode(split('words, " ")).as("word"))
 
 
 or flatMap():
 
   ds.flatMap(_.words.split(" "))
 
 inputColumn - (undocumented)outputColumn - (undocumented)f - (undocumented)evidence$5 - (undocumented)public Dataset<Row> withColumn(String colName, Column col)
 column's expression must only refer to attributes supplied by this Dataset. It is an
 error to add a column that refers to some other Dataset.
 
colName - (undocumented)col - (undocumented)public Dataset<Row> withColumnRenamed(String existingName, String newName)
existingName - (undocumented)newName - (undocumented)public Dataset<Row> drop(String colName)
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
colName - (undocumented)public Dataset<Row> drop(scala.collection.Seq<String> colNames)
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
colNames - (undocumented)public Dataset<Row> drop(Column col)
Column rather than a name.
 This is a no-op if the Dataset doesn't have a column
 with an equivalent expression.
 col - (undocumented)public Dataset<T> dropDuplicates()
distinct.
 
 For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
 will keep all data across triggers as intermediate state to drop duplicates rows. You can use
 withWatermark to limit how late the duplicate data can be and system will accordingly limit
 the state. In addition, too late data older than watermark will be dropped to avoid any
 possibility of duplicates.
 
public Dataset<T> dropDuplicates(scala.collection.Seq<String> colNames)
 For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
 will keep all data across triggers as intermediate state to drop duplicates rows. You can use
 withWatermark to limit how late the duplicate data can be and system will accordingly limit
 the state. In addition, too late data older than watermark will be dropped to avoid any
 possibility of duplicates.
 
colNames - (undocumented)public Dataset<T> dropDuplicates(String[] colNames)
 For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
 will keep all data across triggers as intermediate state to drop duplicates rows. You can use
 withWatermark to limit how late the duplicate data can be and system will accordingly limit
 the state. In addition, too late data older than watermark will be dropped to avoid any
 possibility of duplicates.
 
colNames - (undocumented)public Dataset<T> dropDuplicates(String col1, scala.collection.Seq<String> cols)
Dataset with duplicate rows removed, considering only
 the subset of columns.
 
 For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
 will keep all data across triggers as intermediate state to drop duplicates rows. You can use
 withWatermark to limit how late the duplicate data can be and system will accordingly limit
 the state. In addition, too late data older than watermark will be dropped to avoid any
 possibility of duplicates.
 
col1 - (undocumented)cols - (undocumented)public Dataset<Row> describe(scala.collection.Seq<String> cols)
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting Dataset. If you want to
 programmatically compute summary statistics, use the agg function instead.
 
   ds.describe("age", "height").show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // max     92.0  192.0
 
 
 Use summary for expanded statistics and control over which statistics to compute.
 
cols - Columns to compute statistics on.
 public Dataset<Row> summary(scala.collection.Seq<String> statistics)
- count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
 This function is meant for exploratory data analysis, as we make no guarantee about the
 backward compatibility of the schema of the resulting Dataset. If you want to
 programmatically compute summary statistics, use the agg function instead.
 
   ds.summary().show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // 25%     24.0  176.0
   // 50%     24.0  176.0
   // 75%     32.0  180.0
   // max     92.0  192.0
 
 
   ds.summary("count", "min", "25%", "75%", "max").show()
   // output:
   // summary age   height
   // count   10.0  10.0
   // min     18.0  163.0
   // 25%     24.0  176.0
   // 75%     32.0  180.0
   // max     92.0  192.0
 
 To do a summary for specific columns first select them:
   ds.select("age", "height").summary().show()
 
 
 See also describe for basic statistics.
 
statistics - Statistics from above list to be computed.
 public Object head(int n)
n rows.
 n - (undocumented)public T head()
public T first()
public <U> Dataset<U> transform(scala.Function1<Dataset<T>,Dataset<U>> t)
   def featurize(ds: Dataset[T]): Dataset[U] = ...
   ds
     .transform(featurize)
     .transform(...)
 
 t - (undocumented)public Dataset<T> filter(scala.Function1<T,Object> func)
func returns true.
 func - (undocumented)public Dataset<T> filter(FilterFunction<T> func)
func returns true.
 func - (undocumented)public <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence$6)
func to each element.
 func - (undocumented)evidence$6 - (undocumented)public <U> Dataset<U> map(MapFunction<T,U> func, Encoder<U> encoder)
func to each element.
 func - (undocumented)encoder - (undocumented)public <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$7)
func to each partition.
 func - (undocumented)evidence$7 - (undocumented)public <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
f to each partition.
 f - (undocumented)encoder - (undocumented)public <U> Dataset<U> flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func, Encoder<U> evidence$8)
func - (undocumented)evidence$8 - (undocumented)public <U> Dataset<U> flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)
f - (undocumented)encoder - (undocumented)public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
f to all rows.
 f - (undocumented)public void foreach(ForeachFunction<T> func)
func on each element of this Dataset.
 func - (undocumented)public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
f to each partition of this Dataset.
 f - (undocumented)public void foreachPartition(ForeachPartitionFunction<T> func)
func on each partition of this Dataset.
 func - (undocumented)public Object take(int n)
n rows in the Dataset.
 
 Running take requires moving data into the application's driver process, and doing so with
 a very large n can crash the driver process with OutOfMemoryError.
 
n - (undocumented)public java.util.List<T> takeAsList(int n)
n rows in the Dataset as a list.
 
 Running take requires moving data into the application's driver process, and doing so with
 a very large n can crash the driver process with OutOfMemoryError.
 
n - (undocumented)public Object collect()
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
 For Java API, use collectAsList.
 
public java.util.List<T> collectAsList()
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
public java.util.Iterator<T> toLocalIterator()
The iterator will consume as much memory as the largest partition in this Dataset.
public long count()
public Dataset<T> repartition(int numPartitions)
numPartitions partitions.
 numPartitions - (undocumented)public Dataset<T> repartition(int numPartitions, scala.collection.Seq<Column> partitionExprs)
numPartitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions - (undocumented)partitionExprs - (undocumented)public Dataset<T> repartition(scala.collection.Seq<Column> partitionExprs)
spark.sql.shuffle.partitions as number of partitions.
 The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs - (undocumented)public Dataset<T> repartitionByRange(int numPartitions, scala.collection.Seq<Column> partitionExprs)
numPartitions. The resulting Dataset is range partitioned.
 At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
numPartitions - (undocumented)partitionExprs - (undocumented)public Dataset<T> repartitionByRange(scala.collection.Seq<Column> partitionExprs)
spark.sql.shuffle.partitions as number of partitions.
 The resulting Dataset is range partitioned.
 At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
partitionExprs - (undocumented)public Dataset<T> coalesce(int numPartitions)
numPartitions partitions, when the fewer partitions
 are requested. If a larger number of partitions is requested, it will stay at the current
 number of partitions. Similar to coalesce defined on an RDD, this operation results in
 a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not
 be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
 However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
numPartitions - (undocumented)public Dataset<T> distinct()
dropDuplicates.
 equals function defined on T.
 public Dataset<T> persist()
MEMORY_AND_DISK).
 public Dataset<T> cache()
MEMORY_AND_DISK).
 public Dataset<T> persist(StorageLevel newLevel)
newLevel - One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
                 MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2,
                 MEMORY_AND_DISK_2, etc.
 public StorageLevel storageLevel()
public Dataset<T> unpersist(boolean blocking)
blocking - Whether to block until all blocks are deleted.
 public Dataset<T> unpersist()
public RDD<T> rdd()
RDD of T.
 public JavaRDD<T> toJavaRDD()
JavaRDD of Ts.public JavaRDD<T> javaRDD()
JavaRDD of Ts.public void registerTempTable(String tableName)
SparkSession that was used to create this Dataset.
 tableName - (undocumented)public void createTempView(String viewName)
                    throws AnalysisException
SparkSession that was used to create this Dataset.
 
 Local temporary view is session-scoped. Its lifetime is the lifetime of the session that
 created it, i.e. it will be automatically dropped when the session terminates. It's not
 tied to any databases, i.e. we can't use db1.view1 to reference a local temporary view.
 
viewName - (undocumented)AnalysisException - if the view name is invalid or already exists
 public void createOrReplaceTempView(String viewName)
SparkSession that was used to create this Dataset.
 viewName - (undocumented)public void createGlobalTempView(String viewName)
                          throws AnalysisException
 Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application,
 i.e. it will be automatically dropped when the application terminates. It's tied to a system
 preserved database global_temp, and we must use the qualified name to refer a global temp
 view, e.g. SELECT * FROM global_temp.view1.
 
viewName - (undocumented)AnalysisException - if the view name is invalid or already exists
 public void createOrReplaceGlobalTempView(String viewName)
 Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application,
 i.e. it will be automatically dropped when the application terminates. It's tied to a system
 preserved database global_temp, and we must use the qualified name to refer a global temp
 view, e.g. SELECT * FROM global_temp.view1.
 
viewName - (undocumented)public DataFrameWriter<T> write()
public DataStreamWriter<T> writeStream()
public Dataset<String> toJSON()
public String[] inputFiles()