Dataset

java.lang.Object
- org.apache.spark.sql.Dataset<T>

All Implemented Interfaces:

java.io.Serializable
```
public class Dataset<T>
extends java.lang.Object
implements scala.Serializable
```
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.
```
 val people = spark.read.parquet("...").as[Person] // Scala
 Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class) // Java
 
```
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
```
 val names = people.map(_.name) // in Scala; names is a Dataset[String]
 Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING)) // in Java 8
 
```
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply method in Scala and col in Java.
```
 val ageCol = people("age") // in Scala
 Column ageCol = people.col("age") // in Java
 
```
Note that the Column type can also be manipulated through its various functions.
```
 // The following creates a new column that increases everybody's age by 10.
 people("age") + 10 // in Scala
 people.col("age").plus(10); // in Java
 
```
A more concrete example in Scala:
```
 // To create Dataset[Row] using SQLContext
 val people = spark.read.parquet("...")
 val department = spark.read.parquet("...")

 people.filter("age > 30")
 .join(department, people("deptId") === department("id"))
 .groupBy(department("name"), "gender")
 .agg(avg(people("salary")), max(people("age")))
 
```
and in Java:
```
 // To create Dataset<Row> using SQLContext
 Dataset<Row> people = spark.read().parquet("...");
 Dataset<Row> department = spark.read().parquet("...");

 people.filter("age".gt(30))
 .join(department, people.col("deptId").equalTo(department("id")))
 .groupBy(department.col("name"), "gender")
 .agg(avg(people.col("salary")), max(people.col("age")));
 
```
Since:

1.6.0

See Also:
Serialized Form

Constructor Summary

Constructors
Constructor and Description
`Dataset(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)`
`Dataset(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)`

Method Summary

Methods
Modifier and Type	Method and Description
`Dataset<Row>`	`agg(Column expr, Column... exprs)` Aggregates on the entire `Dataset` without groups.
`Dataset<Row>`	`agg(Column expr, scala.collection.Seq<Column> exprs)` Aggregates on the entire `Dataset` without groups.
`Dataset<Row>`	`agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)` (Scala-specific) Aggregates on the entire `Dataset` without groups.
`Dataset<Row>`	`agg(java.util.Map<java.lang.String,java.lang.String> exprs)` (Java-specific) Aggregates on the entire `Dataset` without groups.
`Dataset<Row>`	`agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr, scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)` (Scala-specific) Aggregates on the entire `Dataset` without groups.
`Dataset<T>`	`alias(java.lang.String alias)` Returns a new `Dataset` with an alias set.
`Dataset<T>`	`alias(scala.Symbol alias)` (Scala-specific) Returns a new `Dataset` with an alias set.
`Column`	`apply(java.lang.String colName)` Selects column based on the column name and return it as a `Column`.
`<U> Dataset<U>`	`as(Encoder<U> evidence$2)` :: Experimental :: Returns a new `Dataset` where each record has been mapped on to the specified type.
`Dataset<T>`	`as(java.lang.String alias)` Returns a new `Dataset` with an alias set.
`Dataset<T>`	`as(scala.Symbol alias)` (Scala-specific) Returns a new `Dataset` with an alias set.
`Dataset<T>`	`cache()` Persist this `Dataset` with the default storage level (`MEMORY_AND_DISK`).
`Dataset<T>`	`coalesce(int numPartitions)` Returns a new `Dataset` that has exactly `numPartitions` partitions.
`Column`	`col(java.lang.String colName)` Selects column based on the column name and return it as a `Column`.
`java.lang.Object`	`collect()` Returns an array that contains all of `Row`s in this `Dataset`.
`java.util.List<T>`	`collectAsList()` Returns a Java list that contains all of `Row`s in this `Dataset`.
`protected int`	`collectToPython()`
`java.lang.String[]`	`columns()` Returns all column names as an array.
`long`	`count()` Returns the number of rows in the `Dataset`.
`void`	`createOrReplaceTempView(java.lang.String viewName)` Creates a temporary view using the given name.
`void`	`createTempView(java.lang.String viewName)` Creates a temporary view using the given name.
`RelationalGroupedDataset`	`cube(Column... cols)` Create a multi-dimensional cube for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`cube(scala.collection.Seq<Column> cols)` Create a multi-dimensional cube for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`cube(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)` Create a multi-dimensional cube for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`cube(java.lang.String col1, java.lang.String... cols)` Create a multi-dimensional cube for the current `Dataset` using the specified columns, so we can run aggregation on them.
`Dataset<Row>`	`describe(scala.collection.Seq<java.lang.String> cols)` Computes statistics for numeric columns, including count, mean, stddev, min, and max.
`Dataset<Row>`	`describe(java.lang.String... cols)` Computes statistics for numeric columns, including count, mean, stddev, min, and max.
`Dataset<T>`	`distinct()` Returns a new `Dataset` that contains only the unique rows from this `Dataset`.
`Dataset<Row>`	`drop(Column col)` Returns a new `Dataset` with a column dropped.
`Dataset<Row>`	`drop(scala.collection.Seq<java.lang.String> colNames)` Returns a new `Dataset` with columns dropped.
`Dataset<Row>`	`drop(java.lang.String... colNames)` Returns a new `Dataset` with columns dropped.
`Dataset<Row>`	`drop(java.lang.String colName)` Returns a new `Dataset` with a column dropped.
`Dataset<T>`	`dropDuplicates()` Returns a new `Dataset` that contains only the unique rows from this `Dataset`.
`Dataset<T>`	`dropDuplicates(scala.collection.Seq<java.lang.String> colNames)` (Scala-specific) Returns a new `Dataset` with duplicate rows removed, considering only the subset of columns.
`Dataset<T>`	`dropDuplicates(java.lang.String[] colNames)` Returns a new `Dataset` with duplicate rows removed, considering only the subset of columns.
`scala.Tuple2<java.lang.String,java.lang.String>[]`	`dtypes()` Returns all column names and their data types as an array.
`Dataset<T>`	`except(Dataset<T> other)` Returns a new `Dataset` containing rows in this Dataset but not in another Dataset.
`void`	`explain()` Prints the physical plan to the console for debugging purposes.
`void`	`explain(boolean extended)` Prints the plans (logical and physical) to the console for debugging purposes.
`<A extends scala.Product> Dataset<Row>`	`explode(scala.collection.Seq<Column> input, scala.Function1<Row,scala.collection.TraversableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$5)` :: Experimental :: (Scala-specific) Returns a new `Dataset` where each row has been expanded to zero or more rows by the provided function.
`<A,B> Dataset<Row>`	`explode(java.lang.String inputColumn, java.lang.String outputColumn, scala.Function1<A,scala.collection.TraversableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$6)` :: Experimental :: (Scala-specific) Returns a new `Dataset` where a single column has been expanded to zero or more rows by the provided function.
`Dataset<T>`	`filter(Column condition)` Filters rows using the given condition.
`Dataset<T>`	`filter(FilterFunction<T> func)` :: Experimental :: (Java-specific) Returns a new `Dataset` that only contains elements where `func` returns `true`.
`Dataset<T>`	`filter(scala.Function1<T,java.lang.Object> func)` :: Experimental :: (Scala-specific) Returns a new `Dataset` that only contains elements where `func` returns `true`.
`Dataset<T>`	`filter(java.lang.String conditionExpr)` Filters rows using the given SQL expression.
`T`	`first()` Returns the first row.
`<U> Dataset<U>`	`flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)` :: Experimental :: (Java-specific) Returns a new `Dataset` by first applying a function to all elements of this `Dataset`, and then flattening the results.
`<U> Dataset<U>`	`flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func, Encoder<U> evidence$9)` :: Experimental :: (Scala-specific) Returns a new `Dataset` by first applying a function to all elements of this `Dataset`, and then flattening the results.
`void`	`foreach(ForeachFunction<T> func)` (Java-specific) Runs `func` on each element of this `Dataset`.
`void`	`foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)` Applies a function `f` to all rows.
`void`	`foreachPartition(ForeachPartitionFunction<T> func)` (Java-specific) Runs `func` on each partition of this `Dataset`.
`void`	`foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)` Applies a function `f` to each partition of this `Dataset`.
`RelationalGroupedDataset`	`groupBy(Column... cols)` Groups the `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`groupBy(scala.collection.Seq<Column> cols)` Groups the `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`groupBy(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)` Groups the `Dataset` using the specified columns, so that we can run aggregation on them.
`RelationalGroupedDataset`	`groupBy(java.lang.String col1, java.lang.String... cols)` Groups the `Dataset` using the specified columns, so that we can run aggregation on them.
`<K> KeyValueGroupedDataset<K,T>`	`groupByKey(scala.Function1<T,K> func, Encoder<K> evidence$4)` :: Experimental :: (Scala-specific) Returns a `KeyValueGroupedDataset` where the data is grouped by the given key `func`.
`<K> KeyValueGroupedDataset<K,T>`	`groupByKey(MapFunction<T,K> func, Encoder<K> encoder)` :: Experimental :: (Java-specific) Returns a `KeyValueGroupedDataset` where the data is grouped by the given key `func`.
`T`	`head()` Returns the first row.
`java.lang.Object`	`head(int n)` Returns the first `n` rows.
`java.lang.String[]`	`inputFiles()` Returns a best-effort snapshot of the files that compose this Dataset.
`Dataset<T>`	`intersect(Dataset<T> other)` Returns a new `Dataset` containing rows only in both this Dataset and another Dataset.
`boolean`	`isLocal()` Returns true if the `collect` and `take` methods can be run locally (without any Spark executors).
`boolean`	`isStreaming()` Returns true if this `Dataset` contains one or more sources that continuously return data as it arrives.
`JavaRDD<T>`	`javaRDD()` Returns the content of the `Dataset` as a `JavaRDD` of `Row`s.
`protected JavaRDD<byte[]>`	`javaToPython()` Converts a JavaRDD to a PythonRDD.
`Dataset<Row>`	`join(Dataset<?> right)` Cartesian join with another `DataFrame`.
`Dataset<Row>`	`join(Dataset<?> right, Column joinExprs)` Inner join with another `DataFrame`, using the given join expression.
`Dataset<Row>`	`join(Dataset<?> right, Column joinExprs, java.lang.String joinType)` Join with another `DataFrame`, using the given join expression.
`Dataset<Row>`	`join(Dataset<?> right, scala.collection.Seq<java.lang.String> usingColumns)` Inner equi-join with another `DataFrame` using the given columns.
`Dataset<Row>`	`join(Dataset<?> right, scala.collection.Seq<java.lang.String> usingColumns, java.lang.String joinType)` Equi-join with another `DataFrame` using the given columns.
`Dataset<Row>`	`join(Dataset<?> right, java.lang.String usingColumn)` Inner equi-join with another `DataFrame` using the given column.
`<U> Dataset<scala.Tuple2<T,U>>`	`joinWith(Dataset<U> other, Column condition)` :: Experimental :: Using inner equi-join to join this `Dataset` returning a `Tuple2` for each pair where `condition` evaluates to true.
`<U> Dataset<scala.Tuple2<T,U>>`	`joinWith(Dataset<U> other, Column condition, java.lang.String joinType)` :: Experimental :: Joins this `Dataset` returning a `Tuple2` for each pair where `condition` evaluates to true.
`Dataset<T>`	`limit(int n)` Returns a new `Dataset` by taking the first `n` rows.
`protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan`	`logicalPlan()`
`<U> Dataset<U>`	`map(scala.Function1<T,U> func, Encoder<U> evidence$7)` :: Experimental :: (Scala-specific) Returns a new `Dataset` that contains the result of applying `func` to each element.
`<U> Dataset<U>`	`map(MapFunction<T,U> func, Encoder<U> encoder)` :: Experimental :: (Java-specific) Returns a new `Dataset` that contains the result of applying `func` to each element.
`<U> Dataset<U>`	`mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$8)` :: Experimental :: (Scala-specific) Returns a new `Dataset` that contains the result of applying `func` to each partition.
`<U> Dataset<U>`	`mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)` :: Experimental :: (Java-specific) Returns a new `Dataset` that contains the result of applying `f` to each partition.
`DataFrameNaFunctions`	`na()` Returns a `DataFrameNaFunctions` for working with missing data.
`protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression>`	`numericColumns()`
`static Dataset<Row>`	`ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)`
`Dataset<T>`	`orderBy(Column... sortExprs)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`orderBy(scala.collection.Seq<Column> sortExprs)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`orderBy(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`orderBy(java.lang.String sortCol, java.lang.String... sortCols)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`persist()` Persist this `Dataset` with the default storage level (`MEMORY_AND_DISK`).
`Dataset<T>`	`persist(StorageLevel newLevel)` Persist this `Dataset` with the given storage level.
`void`	`printSchema()` Prints the schema to the console in a nice tree format.
`org.apache.spark.sql.execution.QueryExecution`	`queryExecution()`
`Dataset<T>[]`	`randomSplit(double[] weights)` Randomly splits this `Dataset` with the provided weights.
`Dataset<T>[]`	`randomSplit(double[] weights, long seed)` Randomly splits this `Dataset` with the provided weights.
`java.util.List<Dataset<T>>`	`randomSplitAsList(double[] weights, long seed)` Returns a Java list that contains randomly split `Dataset` with the provided weights.
`RDD<T>`	`rdd()` Represents the content of the `Dataset` as an `RDD` of `T`.
`T`	`reduce(scala.Function2<T,T,T> func)` :: Experimental :: (Scala-specific) Reduces the elements of this `Dataset` using the specified binary function.
`T`	`reduce(ReduceFunction<T> func)` :: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function.
`void`	`registerTempTable(java.lang.String tableName)` Deprecated. Use createOrReplaceTempView(viewName) instead. Since 2.0.0.
`Dataset<T>`	`repartition(Column... partitionExprs)` Returns a new `Dataset` partitioned by the given partitioning expressions, using `spark.sql.shuffle.partitions` as number of partitions.
`Dataset<T>`	`repartition(int numPartitions)` Returns a new `Dataset` that has exactly `numPartitions` partitions.
`Dataset<T>`	`repartition(int numPartitions, Column... partitionExprs)` Returns a new `Dataset` partitioned by the given partitioning expressions into `numPartitions`.
`Dataset<T>`	`repartition(int numPartitions, scala.collection.Seq<Column> partitionExprs)` Returns a new `Dataset` partitioned by the given partitioning expressions into `numPartitions`.
`Dataset<T>`	`repartition(scala.collection.Seq<Column> partitionExprs)` Returns a new `Dataset` partitioned by the given partitioning expressions, using `spark.sql.shuffle.partitions` as number of partitions.
`protected org.apache.spark.sql.catalyst.expressions.NamedExpression`	`resolve(java.lang.String colName)`
`RelationalGroupedDataset`	`rollup(Column... cols)` Create a multi-dimensional rollup for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`rollup(scala.collection.Seq<Column> cols)` Create a multi-dimensional rollup for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`rollup(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)` Create a multi-dimensional rollup for the current `Dataset` using the specified columns, so we can run aggregation on them.
`RelationalGroupedDataset`	`rollup(java.lang.String col1, java.lang.String... cols)` Create a multi-dimensional rollup for the current `Dataset` using the specified columns, so we can run aggregation on them.
`Dataset<T>`	`sample(boolean withReplacement, double fraction)` Returns a new `Dataset` by sampling a fraction of rows, using a random seed.
`Dataset<T>`	`sample(boolean withReplacement, double fraction, long seed)` Returns a new `Dataset` by sampling a fraction of rows.
`StructType`	`schema()` Returns the schema of this `Dataset`.
`Dataset<Row>`	`select(Column... cols)` Selects a set of column based expressions.
`Dataset<Row>`	`select(scala.collection.Seq<Column> cols)` Selects a set of column based expressions.
`Dataset<Row>`	`select(java.lang.String col, scala.collection.Seq<java.lang.String> cols)` Selects a set of columns.
`Dataset<Row>`	`select(java.lang.String col, java.lang.String... cols)` Selects a set of columns.
`<U1> Dataset<U1>`	`select(TypedColumn<T,U1> c1, Encoder<U1> evidence$3)` :: Experimental :: Returns a new `Dataset` by computing the given `Column` expression for each element.
`<U1,U2> Dataset<scala.Tuple2<U1,U2>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)` :: Experimental :: Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)` :: Experimental :: Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)` :: Experimental :: Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)` :: Experimental :: Returns a new `Dataset` by computing the given `Column` expressions for each element.
`Dataset<Row>`	`selectExpr(scala.collection.Seq<java.lang.String> exprs)` Selects a set of SQL expressions.
`Dataset<Row>`	`selectExpr(java.lang.String... exprs)` Selects a set of SQL expressions.
`protected Dataset<?>`	`selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)` Internal helper function for building typed selects that return tuples.
`void`	`show()` Displays the top 20 rows of `Dataset` in a tabular form.
`void`	`show(boolean truncate)` Displays the top 20 rows of `Dataset` in a tabular form.
`void`	`show(int numRows)` Displays the `Dataset` in a tabular form.
`void`	`show(int numRows, boolean truncate)` Displays the `Dataset` in a tabular form.
`Dataset<T>`	`sort(Column... sortExprs)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`sort(scala.collection.Seq<Column> sortExprs)` Returns a new `Dataset` sorted by the given expressions.
`Dataset<T>`	`sort(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)` Returns a new `Dataset` sorted by the specified column, all in ascending order.
`Dataset<T>`	`sort(java.lang.String sortCol, java.lang.String... sortCols)` Returns a new `Dataset` sorted by the specified column, all in ascending order.
`Dataset<T>`	`sortWithinPartitions(Column... sortExprs)` Returns a new `Dataset` with each partition sorted by the given expressions.
`Dataset<T>`	`sortWithinPartitions(scala.collection.Seq<Column> sortExprs)` Returns a new `Dataset` with each partition sorted by the given expressions.
`Dataset<T>`	`sortWithinPartitions(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)` Returns a new `Dataset` with each partition sorted by the given expressions.
`Dataset<T>`	`sortWithinPartitions(java.lang.String sortCol, java.lang.String... sortCols)` Returns a new `Dataset` with each partition sorted by the given expressions.
`SparkSession`	`sparkSession()`
`SQLContext`	`sqlContext()`
`DataFrameStatFunctions`	`stat()` Returns a `DataFrameStatFunctions` for working statistic functions support.
`java.lang.Object`	`take(int n)` Returns the first `n` rows in the `Dataset`.
`java.util.List<T>`	`takeAsList(int n)` Returns the first `n` rows in the `Dataset` as a list.
`Dataset<Row>`	`toDF()` Converts this strongly typed collection of data to generic Dataframe.
`Dataset<Row>`	`toDF(scala.collection.Seq<java.lang.String> colNames)` Converts this strongly typed collection of data to generic `DataFrame` with columns renamed.
`Dataset<Row>`	`toDF(java.lang.String... colNames)` Converts this strongly typed collection of data to generic `DataFrame` with columns renamed.
`JavaRDD<T>`	`toJavaRDD()` Returns the content of the `Dataset` as a `JavaRDD` of `Row`s.
`Dataset<java.lang.String>`	`toJSON()` Returns the content of the `Dataset` as a Dataset of JSON strings.
`java.util.Iterator<T>`	`toLocalIterator()` Return an iterator that contains all of `Row`s in this `Dataset`.
`protected int`	`toPythonIterator()`
`java.lang.String`	`toString()`
`<U> Dataset<U>`	`transform(scala.Function1<Dataset<T>,Dataset<U>> t)` Concise syntax for chaining custom transformations.
`Dataset<T>`	`union(Dataset<T> other)` Returns a new `Dataset` containing union of rows in this Dataset and another Dataset.
`Dataset<T>`	`unionAll(Dataset<T> other)` Deprecated. use union(). Since 2.0.0.
`Dataset<T>`	`unpersist()` Mark the `Dataset` as non-persistent, and remove all blocks for it from memory and disk.
`Dataset<T>`	`unpersist(boolean blocking)` Mark the `Dataset` as non-persistent, and remove all blocks for it from memory and disk.
`Dataset<T>`	`where(Column condition)` Filters rows using the given condition.
`Dataset<T>`	`where(java.lang.String conditionExpr)` Filters rows using the given SQL expression.
`Dataset<Row>`	`withColumn(java.lang.String colName, Column col)` Returns a new `Dataset` by adding a column or replacing the existing column that has the same name.
`Dataset<Row>`	`withColumnRenamed(java.lang.String existingName, java.lang.String newName)` Returns a new `Dataset` with a column renamed.
`DataFrameWriter`	`write()` :: Experimental :: Interface for saving the content of the `Dataset` out into external storage or streams.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
 - Dataset
```
public Dataset(SparkSession sparkSession,
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
 Encoder<T> encoder)
```
 - Dataset
```
public Dataset(SQLContext sqlContext,
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
 Encoder<T> encoder)
```
- Method Detail
 - ofRows
```
public static Dataset<Row> ofRows(SparkSession sparkSession,
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
```
 - toDF
```
public Dataset<Row> toDF(java.lang.String... colNames)
```
 Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:
```
 val rdd: RDD[(Int, String)] = ...
 rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
 rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
 
```
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sortWithinPartitions
```
public Dataset<T> sortWithinPartitions(java.lang.String sortCol,
 java.lang.String... sortCols)
```
 Returns a new Dataset with each partition sorted by the given expressions.
 This is the same operation as "SORT BY" in SQL (Hive QL).
 
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sortWithinPartitions
```
public Dataset<T> sortWithinPartitions(Column... sortExprs)
```
 Returns a new Dataset with each partition sorted by the given expressions.
 This is the same operation as "SORT BY" in SQL (Hive QL).
 
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sort
```
public Dataset<T> sort(java.lang.String sortCol,
 java.lang.String... sortCols)
```
 Returns a new Dataset sorted by the specified column, all in ascending order.
```
 // The following 3 are equivalent
 ds.sort("sortcol")
 ds.sort($"sortcol")
 ds.sort($"sortcol".asc)
 
```
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sort
```
public Dataset<T> sort(Column... sortExprs)
```
 Returns a new Dataset sorted by the given expressions. For example:
```
 ds.sort($"col1", $"col2".desc)
 
```
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - orderBy
```
public Dataset<T> orderBy(java.lang.String sortCol,
 java.lang.String... sortCols)
```
 Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
 
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - orderBy
```
public Dataset<T> orderBy(Column... sortExprs)
```
 Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
 
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - select
```
public Dataset<Row> select(Column... cols)
```
 Selects a set of column based expressions.
```
 ds.select($"colA", $"colB" + 1)
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - select
```
public Dataset<Row> select(java.lang.String col,
 java.lang.String... cols)
```
 Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).
```
 // The following two are equivalent:
 ds.select("colA", "colB")
 ds.select($"colA", $"colB")
 
```
 Parameters:
 col - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - selectExpr
```
public Dataset<Row> selectExpr(java.lang.String... exprs)
```
 Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
```
 // The following are equivalent:
 ds.selectExpr("colA", "colB as newName", "abs(colC)")
 ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
 
```
 Parameters:
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - groupBy
```
public RelationalGroupedDataset groupBy(Column... cols)
```
 Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns grouped by department.
 ds.groupBy($"department").avg()

 // Compute the max age and average salary, grouped by department and gender.
 ds.groupBy($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - rollup
```
public RelationalGroupedDataset rollup(Column... cols)
```
 Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns rolluped by department and group.
 ds.rollup($"department", $"group").avg()

 // Compute the max age and average salary, rolluped by department and gender.
 ds.rollup($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - cube
```
public RelationalGroupedDataset cube(Column... cols)
```
 Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns cubed by department and group.
 ds.cube($"department", $"group").avg()

 // Compute the max age and average salary, cubed by department and gender.
 ds.cube($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - groupBy
```
public RelationalGroupedDataset groupBy(java.lang.String col1,
 java.lang.String... cols)
```
 Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns grouped by department.
 ds.groupBy("department").avg()

 // Compute the max age and average salary, grouped by department and gender.
 ds.groupBy($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - rollup
```
public RelationalGroupedDataset rollup(java.lang.String col1,
 java.lang.String... cols)
```
 Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns rolluped by department and group.
 ds.rollup("department", "group").avg()

 // Compute the max age and average salary, rolluped by department and gender.
 ds.rollup($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - cube
```
public RelationalGroupedDataset cube(java.lang.String col1,
 java.lang.String... cols)
```
 Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns cubed by department and group.
 ds.cube("department", "group").avg()

 // Compute the max age and average salary, cubed by department and gender.
 ds.cube($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - agg
```
public Dataset<Row> agg(Column expr,
 Column... exprs)
```
 Aggregates on the entire Dataset without groups.
```
 // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
 ds.agg(max($"age"), avg($"salary"))
 ds.groupBy().agg(max($"age"), avg($"salary"))
 
```
 Parameters:
 expr - (undocumented)
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - drop
```
public Dataset<Row> drop(java.lang.String... colNames)
```
 Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
 
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - describe
```
public Dataset<Row> describe(java.lang.String... cols)
```
 Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
 This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
```
 ds.describe("age", "height").show()

 // output:
 // summary age height
 // count 10.0 10.0
 // mean 53.3 178.05
 // stddev 11.6 15.7
 // min 18.0 163.0
 // max 92.0 192.0
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - repartition
```
public Dataset<T> repartition(int numPartitions,
 Column... partitionExprs)
```
 Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 
 Parameters:
 numPartitions - (undocumented)
 partitionExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - repartition
```
public Dataset<T> repartition(Column... partitionExprs)
```
 Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 
 Parameters:
 partitionExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sparkSession
```
public SparkSession sparkSession()
```
 - queryExecution
```
public org.apache.spark.sql.execution.QueryExecution queryExecution()
```
 - logicalPlan
```
protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan()
```
 - sqlContext
```
public SQLContext sqlContext()
```
 - resolve
```
protected org.apache.spark.sql.catalyst.expressions.NamedExpression resolve(java.lang.String colName)
```
 - numericColumns
```
protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> numericColumns()
```
 - toString
```
public java.lang.String toString()
```
 Overrides:
 
 toString in class java.lang.Object
 - toDF
```
public Dataset<Row> toDF()
```
 Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - as
```
public  Dataset as(Encoder evidence$2)
```
 :: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U: - When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive). - When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1). - When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
 If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
 
 Parameters:
 evidence$2 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - toDF
```
public Dataset<Row> toDF(scala.collection.Seq<java.lang.String> colNames)
```
 Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:
```
 val rdd: RDD[(Int, String)] = ...
 rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
 rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
 
```
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - schema
```
public StructType schema()
```
 Returns the schema of this Dataset.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - printSchema
```
public void printSchema()
```
 Prints the schema to the console in a nice tree format.
 
 Since:
 
 1.6.0
 - explain
```
public void explain(boolean extended)
```
 Prints the plans (logical and physical) to the console for debugging purposes.
 
 Parameters:
 extended - (undocumented)
 Since:
 
 1.6.0
 - explain
```
public void explain()
```
 Prints the physical plan to the console for debugging purposes.
 
 Since:
 
 1.6.0
 - dtypes
```
public scala.Tuple2<java.lang.String,java.lang.String>[] dtypes()
```
 Returns all column names and their data types as an array.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - columns
```
public java.lang.String[] columns()
```
 Returns all column names as an array.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - isLocal
```
public boolean isLocal()
```
 Returns true if the collect and take methods can be run locally (without any Spark executors).
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - isStreaming
```
public boolean isStreaming()
```
 Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a ContinuousQuery using the startStream() method in DataFrameWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - show
```
public void show(int numRows)
```
 Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
```
 year month AVG('Adj Close) MAX('Adj Close)
 1980 12 0.503218 0.595103
 1981 01 0.523289 0.570307
 1982 02 0.436504 0.475256
 1983 03 0.410516 0.442194
 1984 04 0.450090 0.483521
 
```
 Parameters:
 numRows - Number of rows to show
 Since:
 
 1.6.0
 - show
```
public void show()
```
 Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
 
 Since:
 
 1.6.0
 - show
```
public void show(boolean truncate)
```
 Displays the top 20 rows of Dataset in a tabular form.
 
 Parameters:
 truncate - Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
 Since:
 
 1.6.0
 - show
```
public void show(int numRows,
 boolean truncate)
```
 Displays the Dataset in a tabular form. For example:
```
 year month AVG('Adj Close) MAX('Adj Close)
 1980 12 0.503218 0.595103
 1981 01 0.523289 0.570307
 1982 02 0.436504 0.475256
 1983 03 0.410516 0.442194
 1984 04 0.450090 0.483521
 
```
 Parameters:
 numRows - Number of rows to show
 truncate - Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
 Since:
 
 1.6.0
 - na
```
public DataFrameNaFunctions na()
```
 Returns a DataFrameNaFunctions for working with missing data.
```
 // Dropping rows containing any null values.
 ds.na.drop()
 
```
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - stat
```
public DataFrameStatFunctions stat()
```
 Returns a DataFrameStatFunctions for working statistic functions support.
```
 // Finding frequent items in column with name 'a'.
 ds.stat.freqItems(Seq("a"))
 
```
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - join
```
public Dataset<Row> join(Dataset<?> right)
```
 Cartesian join with another DataFrame.
 Note that cartesian joins are very expensive without an extra filter that can be pushed down.
 
 Parameters:
 right - Right side of the join operation.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - join
```
public Dataset<Row> join(Dataset<?> right,
 java.lang.String usingColumn)
```
 Inner equi-join with another DataFrame using the given column.
 Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
 // Joining df1 and df2 using the column "user_id"
 df1.join(df2, "user_id")
 
```
 Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
 Parameters:
 right - Right side of the join operation.
 usingColumn - Name of the column to join on. This column must exist on both sides.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - join
```
public Dataset<Row> join(Dataset<?> right,
 scala.collection.Seq<java.lang.String> usingColumns)
```
 Inner equi-join with another DataFrame using the given columns.
 Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
 // Joining df1 and df2 using the columns "user_id" and "user_name"
 df1.join(df2, Seq("user_id", "user_name"))
 
```
 Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
 Parameters:
 right - Right side of the join operation.
 usingColumns - Names of the columns to join on. This columns must exist on both sides.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - join
```
public Dataset<Row> join(Dataset<?> right,
 scala.collection.Seq<java.lang.String> usingColumns,
 java.lang.String joinType)
```
 Equi-join with another DataFrame using the given columns.
 Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
 Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
 
 Parameters:
 right - Right side of the join operation.
 usingColumns - Names of the columns to join on. This columns must exist on both sides.
 joinType - One of: inner, outer, left_outer, right_outer, leftsemi.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - join
```
public Dataset<Row> join(Dataset<?> right,
 Column joinExprs)
```
 Inner join with another DataFrame, using the given join expression.
```
 // The following two are equivalent:
 df1.join(df2, $"df1Key" === $"df2Key")
 df1.join(df2).where($"df1Key" === $"df2Key")
 
```
 Parameters:
 right - (undocumented)
 joinExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - join
```
public Dataset<Row> join(Dataset<?> right,
 Column joinExprs,
 java.lang.String joinType)
```
 Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.
```
 // Scala:
 import org.apache.spark.sql.functions._
 df1.join(df2, $"df1Key" === $"df2Key", "outer")

 // Java:
 import static org.apache.spark.sql.functions.*;
 df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
 
```
 Parameters:
 right - Right side of the join.
 joinExprs - Join expression.
 joinType - One of: inner, outer, left_outer, right_outer, leftsemi.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - joinWith
```
public  Dataset<scala.Tuple2<T,U>> joinWith(Dataset other,
 Column condition,
 java.lang.String joinType)
```
 :: Experimental :: Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
 This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.
 This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
 
 Parameters:
 other - Right side of the join.
 condition - Join expression.
 joinType - One of: inner, outer, left_outer, right_outer, leftsemi.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - joinWith
```
public  Dataset<scala.Tuple2<T,U>> joinWith(Dataset other,
 Column condition)
```
 :: Experimental :: Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
 
 Parameters:
 other - Right side of the join.
 condition - Join expression.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - sortWithinPartitions
```
public Dataset<T> sortWithinPartitions(java.lang.String sortCol,
 scala.collection.Seq<java.lang.String> sortCols)
```
 Returns a new Dataset with each partition sorted by the given expressions.
 This is the same operation as "SORT BY" in SQL (Hive QL).
 
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sortWithinPartitions
```
public Dataset<T> sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
```
 Returns a new Dataset with each partition sorted by the given expressions.
 This is the same operation as "SORT BY" in SQL (Hive QL).
 
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sort
```
public Dataset<T> sort(java.lang.String sortCol,
 scala.collection.Seq<java.lang.String> sortCols)
```
 Returns a new Dataset sorted by the specified column, all in ascending order.
```
 // The following 3 are equivalent
 ds.sort("sortcol")
 ds.sort($"sortcol")
 ds.sort($"sortcol".asc)
 
```
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sort
```
public Dataset<T> sort(scala.collection.Seq<Column> sortExprs)
```
 Returns a new Dataset sorted by the given expressions. For example:
```
 ds.sort($"col1", $"col2".desc)
 
```
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - orderBy
```
public Dataset<T> orderBy(java.lang.String sortCol,
 scala.collection.Seq<java.lang.String> sortCols)
```
 Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
 
 Parameters:
 sortCol - (undocumented)
 sortCols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - orderBy
```
public Dataset<T> orderBy(scala.collection.Seq<Column> sortExprs)
```
 Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
 
 Parameters:
 sortExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - apply
```
public Column apply(java.lang.String colName)
```
 Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.
 
 Parameters:
 colName - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - col
```
public Column col(java.lang.String colName)
```
 Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.
 
 Parameters:
 colName - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - as
```
public Dataset<T> as(java.lang.String alias)
```
 Returns a new Dataset with an alias set.
 
 Parameters:
 alias - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - as
```
public Dataset<T> as(scala.Symbol alias)
```
 (Scala-specific) Returns a new Dataset with an alias set.
 
 Parameters:
 alias - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - alias
```
public Dataset<T> alias(java.lang.String alias)
```
 Returns a new Dataset with an alias set. Same as as.
 
 Parameters:
 alias - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - alias
```
public Dataset<T> alias(scala.Symbol alias)
```
 (Scala-specific) Returns a new Dataset with an alias set. Same as as.
 
 Parameters:
 alias - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - select
```
public Dataset<Row> select(scala.collection.Seq<Column> cols)
```
 Selects a set of column based expressions.
```
 ds.select($"colA", $"colB" + 1)
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - select
```
public Dataset<Row> select(java.lang.String col,
 scala.collection.Seq<java.lang.String> cols)
```
 Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).
```
 // The following two are equivalent:
 ds.select("colA", "colB")
 ds.select($"colA", $"colB")
 
```
 Parameters:
 col - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - selectExpr
```
public Dataset<Row> selectExpr(scala.collection.Seq<java.lang.String> exprs)
```
 Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
```
 // The following are equivalent:
 ds.selectExpr("colA", "colB as newName", "abs(colC)")
 ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
 
```
 Parameters:
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - select
```
public <U1> Dataset<U1> select(TypedColumn<T,U1> c1,
 Encoder<U1> evidence$3)
```
 :: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
```
 val ds = Seq(1, 2, 3).toDS()
 val newDS = ds.select(expr("value + 1").as[Int])
 
```
 Parameters:
 c1 - (undocumented)
 evidence$3 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - selectUntyped
```
protected Dataset<?> selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
```
 Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
 
 Parameters:
 columns - (undocumented)
 
 Returns:
 (undocumented)
 - select
```
public <U1,U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2)
```
 :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3)
```
 :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3,
 TypedColumn<T,U4> c4)
```
 :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 c4 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3,
 TypedColumn<T,U4> c4,
 TypedColumn<T,U5> c5)
```
 :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 c4 - (undocumented)
 c5 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(Column condition)
```
 Filters rows using the given condition.
```
 // The following are equivalent:
 peopleDs.filter($"age" > 15)
 peopleDs.where($"age" > 15)
 
```
 Parameters:
 condition - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(java.lang.String conditionExpr)
```
 Filters rows using the given SQL expression.
```
 peopleDs.filter("age > 15")
 
```
 Parameters:
 conditionExpr - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - where
```
public Dataset<T> where(Column condition)
```
 Filters rows using the given condition. This is an alias for filter.
```
 // The following are equivalent:
 peopleDs.filter($"age" > 15)
 peopleDs.where($"age" > 15)
 
```
 Parameters:
 condition - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - where
```
public Dataset<T> where(java.lang.String conditionExpr)
```
 Filters rows using the given SQL expression.
```
 peopleDs.where("age > 15")
 
```
 Parameters:
 conditionExpr - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - groupBy
```
public RelationalGroupedDataset groupBy(scala.collection.Seq<Column> cols)
```
 Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns grouped by department.
 ds.groupBy($"department").avg()

 // Compute the max age and average salary, grouped by department and gender.
 ds.groupBy($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - rollup
```
public RelationalGroupedDataset rollup(scala.collection.Seq<Column> cols)
```
 Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns rolluped by department and group.
 ds.rollup($"department", $"group").avg()

 // Compute the max age and average salary, rolluped by department and gender.
 ds.rollup($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - cube
```
public RelationalGroupedDataset cube(scala.collection.Seq<Column> cols)
```
 Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
```
 // Compute the average for all numeric columns cubed by department and group.
 ds.cube($"department", $"group").avg()

 // Compute the max age and average salary, cubed by department and gender.
 ds.cube($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - groupBy
```
public RelationalGroupedDataset groupBy(java.lang.String col1,
 scala.collection.Seq<java.lang.String> cols)
```
 Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns grouped by department.
 ds.groupBy("department").avg()

 // Compute the max age and average salary, grouped by department and gender.
 ds.groupBy($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - reduce
```
public T reduce(scala.Function2<T,T,T> func)
```
 :: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - reduce
```
public T reduce(ReduceFunction<T> func)
```
 :: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - groupByKey
```
public <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func,
 Encoder<K> evidence$4)
```
 :: Experimental :: (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
 
 Parameters:
 func - (undocumented)
 evidence$4 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - groupByKey
```
public <K> KeyValueGroupedDataset<K,T> groupByKey(MapFunction<T,K> func,
 Encoder<K> encoder)
```
 :: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
 
 Parameters:
 func - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - rollup
```
public RelationalGroupedDataset rollup(java.lang.String col1,
 scala.collection.Seq<java.lang.String> cols)
```
 Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns rolluped by department and group.
 ds.rollup("department", "group").avg()

 // Compute the max age and average salary, rolluped by department and gender.
 ds.rollup($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - cube
```
public RelationalGroupedDataset cube(java.lang.String col1,
 scala.collection.Seq<java.lang.String> cols)
```
 Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
 This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
```
 // Compute the average for all numeric columns cubed by department and group.
 ds.cube("department", "group").avg()

 // Compute the max age and average salary, cubed by department and gender.
 ds.cube($"department", $"gender").agg(Map(
 "salary" -> "avg",
 "age" -> "max"
 ))
 
```
 Parameters:
 col1 - (undocumented)
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - agg
```
public Dataset<Row> agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr,
 scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
```
 (Scala-specific) Aggregates on the entire Dataset without groups.
```
 // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
 ds.agg("age" -> "max", "salary" -> "avg")
 ds.groupBy().agg("age" -> "max", "salary" -> "avg")
 
```
 Parameters:
 aggExpr - (undocumented)
 aggExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - agg
```
public Dataset<Row> agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
```
 (Scala-specific) Aggregates on the entire Dataset without groups.
```
 // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
 ds.agg(Map("age" -> "max", "salary" -> "avg"))
 ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 
```
 Parameters:
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - agg
```
public Dataset<Row> agg(java.util.Map<java.lang.String,java.lang.String> exprs)
```
 (Java-specific) Aggregates on the entire Dataset without groups.
```
 // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
 ds.agg(Map("age" -> "max", "salary" -> "avg"))
 ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 
```
 Parameters:
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - agg
```
public Dataset<Row> agg(Column expr,
 scala.collection.Seq<Column> exprs)
```
 Aggregates on the entire Dataset without groups.
```
 // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
 ds.agg(max($"age"), avg($"salary"))
 ds.groupBy().agg(max($"age"), avg($"salary"))
 
```
 Parameters:
 expr - (undocumented)
 exprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - limit
```
public Dataset<T> limit(int n)
```
 Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.
 
 Parameters:
 n - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - unionAll
```
public Dataset<T> unionAll(Dataset<T> other)
```
 Deprecated. use union(). Since 2.0.0.
 
 Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.
 To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - union
```
public Dataset<T> union(Dataset<T> other)
```
 Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.
 To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - intersect
```
public Dataset<T> intersect(Dataset<T> other)
```
 Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - except
```
public Dataset<T> except(Dataset<T> other)
```
 Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT in SQL.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - sample
```
public Dataset<T> sample(boolean withReplacement,
 double fraction,
 long seed)
```
 Returns a new Dataset by sampling a fraction of rows.
 
 Parameters:
 withReplacement - Sample with replacement or not.
 fraction - Fraction of rows to generate.
 seed - Seed for sampling.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - sample
```
public Dataset<T> sample(boolean withReplacement,
 double fraction)
```
 Returns a new Dataset by sampling a fraction of rows, using a random seed.
 
 Parameters:
 withReplacement - Sample with replacement or not.
 fraction - Fraction of rows to generate.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - randomSplit
```
public Dataset<T>[] randomSplit(double[] weights,
 long seed)
```
 Randomly splits this Dataset with the provided weights.
 
 Parameters:
 weights - weights for splits, will be normalized if they don't sum to 1.
 seed - Seed for sampling.
 For Java API, use randomSplitAsList.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - randomSplitAsList
```
public java.util.List<Dataset<T>> randomSplitAsList(double[] weights,
 long seed)
```
 Returns a Java list that contains randomly split Dataset with the provided weights.
 
 Parameters:
 weights - weights for splits, will be normalized if they don't sum to 1.
 seed - Seed for sampling.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - randomSplit
```
public Dataset<T>[] randomSplit(double[] weights)
```
 Randomly splits this Dataset with the provided weights.
 
 Parameters:
 weights - weights for splits, will be normalized if they don't sum to 1.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - explode
```
public <A extends scala.Product> Dataset<Row> explode(scala.collection.Seq<Column> input,
 scala.Function1<Row,scala.collection.TraversableOnce<A>> f,
 scala.reflect.api.TypeTags.TypeTag<A> evidence$5)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.
 The following example uses this function to count the number of books which contain a given word:
```
 case class Book(title: String, words: String)
 val ds: Dataset[Book]

 case class Word(word: String)
 val allWords = ds.explode('words) {
 case Row(words: String) => words.split(" ").map(Word(_))
 }

 val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
 
```
 Parameters:
 input - (undocumented)
 f - (undocumented)
 evidence$5 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - explode
```
public <A,B> Dataset<Row> explode(java.lang.String inputColumn,
 java.lang.String outputColumn,
 scala.Function1<A,scala.collection.TraversableOnce> f,
 scala.reflect.api.TypeTags.TypeTag evidence$6)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
```
 ds.explode("words", "word") {words: String => words.split(" ")}
 
```
 Parameters:
 inputColumn - (undocumented)
 outputColumn - (undocumented)
 f - (undocumented)
 evidence$6 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - withColumn
```
public Dataset<Row> withColumn(java.lang.String colName,
 Column col)
```
 Returns a new Dataset by adding a column or replacing the existing column that has the same name.
 
 Parameters:
 colName - (undocumented)
 col - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - withColumnRenamed
```
public Dataset<Row> withColumnRenamed(java.lang.String existingName,
 java.lang.String newName)
```
 Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
 
 Parameters:
 existingName - (undocumented)
 newName - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - drop
```
public Dataset<Row> drop(java.lang.String colName)
```
 Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
 
 Parameters:
 colName - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - drop
```
public Dataset<Row> drop(scala.collection.Seq<java.lang.String> colNames)
```
 Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
 
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - drop
```
public Dataset<Row> drop(Column col)
```
 Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.
 
 Parameters:
 col - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - dropDuplicates
```
public Dataset<T> dropDuplicates()
```
 Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - dropDuplicates
```
public Dataset<T> dropDuplicates(scala.collection.Seq<java.lang.String> colNames)
```
 (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
 
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - dropDuplicates
```
public Dataset<T> dropDuplicates(java.lang.String[] colNames)
```
 Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
 
 Parameters:
 colNames - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - describe
```
public Dataset<Row> describe(scala.collection.Seq<java.lang.String> cols)
```
 Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
 This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
```
 ds.describe("age", "height").show()

 // output:
 // summary age height
 // count 10.0 10.0
 // mean 53.3 178.05
 // stddev 11.6 15.7
 // min 18.0 163.0
 // max 92.0 192.0
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - head
```
public java.lang.Object head(int n)
```
 Returns the first n rows.
 
 Parameters:
 n - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - head
```
public T head()
```
 Returns the first row.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - first
```
public T first()
```
 Returns the first row. Alias for head().
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - transform
```
public  Dataset transform(scala.Function1<Dataset<T>,Dataset> t)
```
 Concise syntax for chaining custom transformations.
```
 def featurize(ds: Dataset[T]): Dataset[U] = ...

 ds
 .transform(featurize)
 .transform(...)
 
```
 Parameters:
 t - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(scala.Function1<T,java.lang.Object> func)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset that only contains elements where func returns true.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(FilterFunction<T> func)
```
 :: Experimental :: (Java-specific) Returns a new Dataset that only contains elements where func returns true.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - map
```
public  Dataset map(scala.Function1<T,U> func,
 Encoder evidence$7)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.
 
 Parameters:
 func - (undocumented)
 evidence$7 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - map
```
public  Dataset map(MapFunction<T,U> func,
 Encoder encoder)
```
 :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying func to each element.
 
 Parameters:
 func - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - mapPartitions
```
public  Dataset mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator> func,
 Encoder evidence$8)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.
 
 Parameters:
 func - (undocumented)
 evidence$8 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - mapPartitions
```
public  Dataset mapPartitions(MapPartitionsFunction<T,U> f,
 Encoder encoder)
```
 :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.
 
 Parameters:
 f - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - flatMap
```
public  Dataset flatMap(scala.Function1<T,scala.collection.TraversableOnce> func,
 Encoder evidence$9)
```
 :: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
 
 Parameters:
 func - (undocumented)
 evidence$9 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - flatMap
```
public  Dataset flatMap(FlatMapFunction<T,U> f,
 Encoder encoder)
```
 :: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
 
 Parameters:
 f - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - foreach
```
public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
```
 Applies a function f to all rows.
 
 Parameters:
 f - (undocumented)
 Since:
 
 1.6.0
 - foreach
```
public void foreach(ForeachFunction<T> func)
```
 (Java-specific) Runs func on each element of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - foreachPartition
```
public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
```
 Applies a function f to each partition of this Dataset.
 
 Parameters:
 f - (undocumented)
 Since:
 
 1.6.0
 - foreachPartition
```
public void foreachPartition(ForeachPartitionFunction<T> func)
```
 (Java-specific) Runs func on each partition of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - take
```
public java.lang.Object take(int n)
```
 Returns the first n rows in the Dataset.
 Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.
 
 Parameters:
 n - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - takeAsList
```
public java.util.List<T> takeAsList(int n)
```
 Returns the first n rows in the Dataset as a list.
 Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.
 
 Parameters:
 n - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - collect
```
public java.lang.Object collect()
```
 Returns an array that contains all of Rows in this Dataset.
 Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
 For Java API, use collectAsList.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - collectAsList
```
public java.util.List<T> collectAsList()
```
 Returns a Java list that contains all of Rows in this Dataset.
 Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - toLocalIterator
```
public java.util.Iterator<T> toLocalIterator()
```
 Return an iterator that contains all of Rows in this Dataset.
 The iterator will consume as much memory as the largest partition in this Dataset.
 Note: this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - count
```
public long count()
```
 Returns the number of rows in the Dataset.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - repartition
```
public Dataset<T> repartition(int numPartitions)
```
 Returns a new Dataset that has exactly numPartitions partitions.
 
 Parameters:
 numPartitions - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - repartition
```
public Dataset<T> repartition(int numPartitions,
 scala.collection.Seq<Column> partitionExprs)
```
 Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 
 Parameters:
 numPartitions - (undocumented)
 partitionExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - repartition
```
public Dataset<T> repartition(scala.collection.Seq<Column> partitionExprs)
```
 Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.
 This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 
 Parameters:
 partitionExprs - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - coalesce
```
public Dataset<T> coalesce(int numPartitions)
```
 Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
 
 Parameters:
 numPartitions - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - distinct
```
public Dataset<T> distinct()
```
 Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - persist
```
public Dataset<T> persist()
```
 Persist this Dataset with the default storage level (MEMORY_AND_DISK).
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - cache
```
public Dataset<T> cache()
```
 Persist this Dataset with the default storage level (MEMORY_AND_DISK).
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - persist
```
public Dataset<T> persist(StorageLevel newLevel)
```
 Persist this Dataset with the given storage level.
 
 Parameters:
 newLevel - One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - unpersist
```
public Dataset<T> unpersist(boolean blocking)
```
 Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
 
 Parameters:
 blocking - Whether to block until all blocks are deleted.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - unpersist
```
public Dataset<T> unpersist()
```
 Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - rdd
```
public RDD<T> rdd()
```
 Represents the content of the Dataset as an RDD of T.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - toJavaRDD
```
public JavaRDD<T> toJavaRDD()
```
 Returns the content of the Dataset as a JavaRDD of Rows.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - javaRDD
```
public JavaRDD<T> javaRDD()
```
 Returns the content of the Dataset as a JavaRDD of Rows.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - registerTempTable
```
public void registerTempTable(java.lang.String tableName)
```
 Deprecated. Use createOrReplaceTempView(viewName) instead. Since 2.0.0.
 
 Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.
 
 Parameters:
 tableName - (undocumented)
 Since:
 
 1.6.0
 - createTempView
```
public void createTempView(java.lang.String viewName)
 throws AnalysisException
```
 Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
 
 Parameters:
 viewName - (undocumented)
 
 Throws:
 
 AnalysisException - if the view name already exists
 Since:
 
 2.0.0
 - createOrReplaceTempView
```
public void createOrReplaceTempView(java.lang.String viewName)
```
 Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
 
 Parameters:
 viewName - (undocumented)
 Since:
 
 2.0.0
 - write
```
public DataFrameWriter write()
```
 :: Experimental :: Interface for saving the content of the Dataset out into external storage or streams.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - toJSON
```
public Dataset<java.lang.String> toJSON()
```
 Returns the content of the Dataset as a Dataset of JSON strings.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - inputFiles
```
public java.lang.String[] inputFiles()
```
 Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
 
 Returns:
 (undocumented)
 Since:
 
 2.0.0
 - javaToPython
```
protected JavaRDD<byte[]> javaToPython()
```
 Converts a JavaRDD to a PythonRDD.
 
 Returns:
 (undocumented)
 - collectToPython
```
protected int collectToPython()
```
 - toPythonIterator
```
protected int toPythonIterator()
```

Class Dataset<T>

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Dataset

Dataset

Method Detail

ofRows

toDF

sortWithinPartitions

sortWithinPartitions

sort

sort

orderBy

orderBy

select

select

selectExpr

groupBy

rollup

cube

groupBy

rollup

cube

agg

drop

describe

repartition

repartition

sparkSession

queryExecution

logicalPlan

sqlContext

resolve

numericColumns

toString

toDF

as

toDF

schema

printSchema

explain

explain

dtypes

columns

isLocal

isStreaming

show

show

show

show

na

stat

join

join

join

join

join

join

joinWith

joinWith

sortWithinPartitions

sortWithinPartitions

sort

sort

orderBy

orderBy

apply

col

as

as

alias

alias

select

select

selectExpr

select

selectUntyped

select