public class Dataset<T>
extends java.lang.Object
implements scala.Serializable
Dataset
is a strongly typed collection of domain-specific objects that can be transformed
in parallel using functional or relational operations. Each Dataset also has an untyped view
called a DataFrame
, which is a Dataset of Row
.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy
).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
a Dataset represents a logical plan that describes the computation required to produce the data.
When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
physical plan for efficient execution in a parallel and distributed manner. To explore the
logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder
is required. The encoder maps
the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate
code at runtime to serialize the Person
object into a binary structure. This binary structure
often has much lower memory footprint as well as are optimized for efficiency in data processing
(e.g. in a columnar format). To understand the internal binary representation for data, use the
schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark
to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class) // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String]
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING)) // in Java 8
Dataset operations can also be untyped, through various domain-specific-language (DSL)
functions defined in: Dataset
(this class), Column
, and functions
. These operations
are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age") // in Java
Note that the Column
type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SQLContext
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SQLContext
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
people.filter("age".gt(30))
.join(department, people.col("deptId").equalTo(department("id")))
.groupBy(department.col("name"), "gender")
.agg(avg(people.col("salary")), max(people.col("age")));
Constructor and Description |
---|
Dataset(SparkSession sparkSession,
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
Encoder<T> encoder) |
Dataset(SQLContext sqlContext,
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan,
Encoder<T> encoder) |
Modifier and Type | Method and Description |
---|---|
Dataset<Row> |
agg(Column expr,
Column... exprs)
Aggregates on the entire
Dataset without groups. |
Dataset<Row> |
agg(Column expr,
scala.collection.Seq<Column> exprs)
Aggregates on the entire
Dataset without groups. |
Dataset<Row> |
agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
(Scala-specific) Aggregates on the entire
Dataset without groups. |
Dataset<Row> |
agg(java.util.Map<java.lang.String,java.lang.String> exprs)
(Java-specific) Aggregates on the entire
Dataset without groups. |
Dataset<Row> |
agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr,
scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
(Scala-specific) Aggregates on the entire
Dataset without groups. |
Dataset<T> |
alias(java.lang.String alias)
Returns a new
Dataset with an alias set. |
Dataset<T> |
alias(scala.Symbol alias)
(Scala-specific) Returns a new
Dataset with an alias set. |
Column |
apply(java.lang.String colName)
Selects column based on the column name and return it as a
Column . |
<U> Dataset<U> |
as(Encoder<U> evidence$2)
:: Experimental ::
Returns a new
Dataset where each record has been mapped on to the specified type. |
Dataset<T> |
as(java.lang.String alias)
Returns a new
Dataset with an alias set. |
Dataset<T> |
as(scala.Symbol alias)
(Scala-specific) Returns a new
Dataset with an alias set. |
Dataset<T> |
cache()
Persist this
Dataset with the default storage level (MEMORY_AND_DISK ). |
Dataset<T> |
coalesce(int numPartitions)
Returns a new
Dataset that has exactly numPartitions partitions. |
Column |
col(java.lang.String colName)
Selects column based on the column name and return it as a
Column . |
java.lang.Object |
collect()
|
java.util.List<T> |
collectAsList()
|
protected int |
collectToPython() |
java.lang.String[] |
columns()
Returns all column names as an array.
|
long |
count()
Returns the number of rows in the
Dataset . |
void |
createOrReplaceTempView(java.lang.String viewName)
Creates a temporary view using the given name.
|
void |
createTempView(java.lang.String viewName)
Creates a temporary view using the given name.
|
RelationalGroupedDataset |
cube(Column... cols)
Create a multi-dimensional cube for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
cube(scala.collection.Seq<Column> cols)
Create a multi-dimensional cube for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
cube(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Create a multi-dimensional cube for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
cube(java.lang.String col1,
java.lang.String... cols)
Create a multi-dimensional cube for the current
Dataset using the specified columns,
so we can run aggregation on them. |
Dataset<Row> |
describe(scala.collection.Seq<java.lang.String> cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
|
Dataset<Row> |
describe(java.lang.String... cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
|
Dataset<T> |
distinct()
|
Dataset<Row> |
drop(Column col)
Returns a new
Dataset with a column dropped. |
Dataset<Row> |
drop(scala.collection.Seq<java.lang.String> colNames)
Returns a new
Dataset with columns dropped. |
Dataset<Row> |
drop(java.lang.String... colNames)
Returns a new
Dataset with columns dropped. |
Dataset<Row> |
drop(java.lang.String colName)
Returns a new
Dataset with a column dropped. |
Dataset<T> |
dropDuplicates()
|
Dataset<T> |
dropDuplicates(scala.collection.Seq<java.lang.String> colNames)
(Scala-specific) Returns a new
Dataset with duplicate rows removed, considering only
the subset of columns. |
Dataset<T> |
dropDuplicates(java.lang.String[] colNames)
Returns a new
Dataset with duplicate rows removed, considering only
the subset of columns. |
scala.Tuple2<java.lang.String,java.lang.String>[] |
dtypes()
Returns all column names and their data types as an array.
|
Dataset<T> |
except(Dataset<T> other)
Returns a new
Dataset containing rows in this Dataset but not in another Dataset. |
void |
explain()
Prints the physical plan to the console for debugging purposes.
|
void |
explain(boolean extended)
Prints the plans (logical and physical) to the console for debugging purposes.
|
<A extends scala.Product> |
explode(scala.collection.Seq<Column> input,
scala.Function1<Row,scala.collection.TraversableOnce<A>> f,
scala.reflect.api.TypeTags.TypeTag<A> evidence$5)
:: Experimental ::
(Scala-specific) Returns a new
Dataset where each row has been expanded to zero or more
rows by the provided function. |
<A,B> Dataset<Row> |
explode(java.lang.String inputColumn,
java.lang.String outputColumn,
scala.Function1<A,scala.collection.TraversableOnce<B>> f,
scala.reflect.api.TypeTags.TypeTag<B> evidence$6)
:: Experimental ::
(Scala-specific) Returns a new
Dataset where a single column has been expanded to zero
or more rows by the provided function. |
Dataset<T> |
filter(Column condition)
Filters rows using the given condition.
|
Dataset<T> |
filter(FilterFunction<T> func)
:: Experimental ::
(Java-specific)
Returns a new
Dataset that only contains elements where func returns true . |
Dataset<T> |
filter(scala.Function1<T,java.lang.Object> func)
:: Experimental ::
(Scala-specific)
Returns a new
Dataset that only contains elements where func returns true . |
Dataset<T> |
filter(java.lang.String conditionExpr)
Filters rows using the given SQL expression.
|
T |
first()
Returns the first row.
|
<U> Dataset<U> |
flatMap(FlatMapFunction<T,U> f,
Encoder<U> encoder)
|
<U> Dataset<U> |
flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func,
Encoder<U> evidence$9)
|
void |
foreach(ForeachFunction<T> func)
(Java-specific)
Runs
func on each element of this Dataset . |
void |
foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
Applies a function
f to all rows. |
void |
foreachPartition(ForeachPartitionFunction<T> func)
(Java-specific)
Runs
func on each partition of this Dataset . |
void |
foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
Applies a function
f to each partition of this Dataset . |
RelationalGroupedDataset |
groupBy(Column... cols)
Groups the
Dataset using the specified columns, so we can run aggregation on them. |
RelationalGroupedDataset |
groupBy(scala.collection.Seq<Column> cols)
Groups the
Dataset using the specified columns, so we can run aggregation on them. |
RelationalGroupedDataset |
groupBy(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Groups the
Dataset using the specified columns, so that we can run aggregation on them. |
RelationalGroupedDataset |
groupBy(java.lang.String col1,
java.lang.String... cols)
Groups the
Dataset using the specified columns, so that we can run aggregation on them. |
<K> KeyValueGroupedDataset<K,T> |
groupByKey(scala.Function1<T,K> func,
Encoder<K> evidence$4)
:: Experimental ::
(Scala-specific)
Returns a
KeyValueGroupedDataset where the data is grouped by the given key func . |
<K> KeyValueGroupedDataset<K,T> |
groupByKey(MapFunction<T,K> func,
Encoder<K> encoder)
:: Experimental ::
(Java-specific)
Returns a
KeyValueGroupedDataset where the data is grouped by the given key func . |
T |
head()
Returns the first row.
|
java.lang.Object |
head(int n)
Returns the first
n rows. |
java.lang.String[] |
inputFiles()
Returns a best-effort snapshot of the files that compose this Dataset.
|
Dataset<T> |
intersect(Dataset<T> other)
Returns a new
Dataset containing rows only in both this Dataset and another Dataset. |
boolean |
isLocal()
Returns true if the
collect and take methods can be run locally
(without any Spark executors). |
boolean |
isStreaming()
Returns true if this
Dataset contains one or more sources that continuously
return data as it arrives. |
JavaRDD<T> |
javaRDD()
|
protected JavaRDD<byte[]> |
javaToPython()
Converts a JavaRDD to a PythonRDD.
|
Dataset<Row> |
join(Dataset<?> right)
Cartesian join with another
DataFrame . |
Dataset<Row> |
join(Dataset<?> right,
Column joinExprs)
Inner join with another
DataFrame , using the given join expression. |
Dataset<Row> |
join(Dataset<?> right,
Column joinExprs,
java.lang.String joinType)
Join with another
DataFrame , using the given join expression. |
Dataset<Row> |
join(Dataset<?> right,
scala.collection.Seq<java.lang.String> usingColumns)
Inner equi-join with another
DataFrame using the given columns. |
Dataset<Row> |
join(Dataset<?> right,
scala.collection.Seq<java.lang.String> usingColumns,
java.lang.String joinType)
Equi-join with another
DataFrame using the given columns. |
Dataset<Row> |
join(Dataset<?> right,
java.lang.String usingColumn)
Inner equi-join with another
DataFrame using the given column. |
<U> Dataset<scala.Tuple2<T,U>> |
joinWith(Dataset<U> other,
Column condition)
:: Experimental ::
Using inner equi-join to join this
Dataset returning a Tuple2 for each pair
where condition evaluates to true. |
<U> Dataset<scala.Tuple2<T,U>> |
joinWith(Dataset<U> other,
Column condition,
java.lang.String joinType)
:: Experimental ::
Joins this
Dataset returning a Tuple2 for each pair where condition evaluates to
true. |
Dataset<T> |
limit(int n)
Returns a new
Dataset by taking the first n rows. |
protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan |
logicalPlan() |
<U> Dataset<U> |
map(scala.Function1<T,U> func,
Encoder<U> evidence$7)
:: Experimental ::
(Scala-specific)
Returns a new
Dataset that contains the result of applying func to each element. |
<U> Dataset<U> |
map(MapFunction<T,U> func,
Encoder<U> encoder)
:: Experimental ::
(Java-specific)
Returns a new
Dataset that contains the result of applying func to each element. |
<U> Dataset<U> |
mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func,
Encoder<U> evidence$8)
:: Experimental ::
(Scala-specific)
Returns a new
Dataset that contains the result of applying func to each partition. |
<U> Dataset<U> |
mapPartitions(MapPartitionsFunction<T,U> f,
Encoder<U> encoder)
:: Experimental ::
(Java-specific)
Returns a new
Dataset that contains the result of applying f to each partition. |
DataFrameNaFunctions |
na()
Returns a
DataFrameNaFunctions for working with missing data. |
protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> |
numericColumns() |
static Dataset<Row> |
ofRows(SparkSession sparkSession,
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan) |
Dataset<T> |
orderBy(Column... sortExprs)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
orderBy(scala.collection.Seq<Column> sortExprs)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
orderBy(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
orderBy(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
persist()
Persist this
Dataset with the default storage level (MEMORY_AND_DISK ). |
Dataset<T> |
persist(StorageLevel newLevel)
Persist this
Dataset with the given storage level. |
void |
printSchema()
Prints the schema to the console in a nice tree format.
|
org.apache.spark.sql.execution.QueryExecution |
queryExecution() |
Dataset<T>[] |
randomSplit(double[] weights)
Randomly splits this
Dataset with the provided weights. |
Dataset<T>[] |
randomSplit(double[] weights,
long seed)
Randomly splits this
Dataset with the provided weights. |
java.util.List<Dataset<T>> |
randomSplitAsList(double[] weights,
long seed)
Returns a Java list that contains randomly split
Dataset with the provided weights. |
RDD<T> |
rdd()
|
T |
reduce(scala.Function2<T,T,T> func)
:: Experimental ::
(Scala-specific)
Reduces the elements of this
Dataset using the specified binary function. |
T |
reduce(ReduceFunction<T> func)
:: Experimental ::
(Java-specific)
Reduces the elements of this Dataset using the specified binary function.
|
void |
registerTempTable(java.lang.String tableName)
Deprecated.
Use createOrReplaceTempView(viewName) instead. Since 2.0.0.
|
Dataset<T> |
repartition(Column... partitionExprs)
Returns a new
Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions as number of partitions. |
Dataset<T> |
repartition(int numPartitions)
Returns a new
Dataset that has exactly numPartitions partitions. |
Dataset<T> |
repartition(int numPartitions,
Column... partitionExprs)
Returns a new
Dataset partitioned by the given partitioning expressions into
numPartitions . |
Dataset<T> |
repartition(int numPartitions,
scala.collection.Seq<Column> partitionExprs)
Returns a new
Dataset partitioned by the given partitioning expressions into
numPartitions . |
Dataset<T> |
repartition(scala.collection.Seq<Column> partitionExprs)
Returns a new
Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions as number of partitions. |
protected org.apache.spark.sql.catalyst.expressions.NamedExpression |
resolve(java.lang.String colName) |
RelationalGroupedDataset |
rollup(Column... cols)
Create a multi-dimensional rollup for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
rollup(scala.collection.Seq<Column> cols)
Create a multi-dimensional rollup for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
rollup(java.lang.String col1,
scala.collection.Seq<java.lang.String> cols)
Create a multi-dimensional rollup for the current
Dataset using the specified columns,
so we can run aggregation on them. |
RelationalGroupedDataset |
rollup(java.lang.String col1,
java.lang.String... cols)
Create a multi-dimensional rollup for the current
Dataset using the specified columns,
so we can run aggregation on them. |
Dataset<T> |
sample(boolean withReplacement,
double fraction)
Returns a new
Dataset by sampling a fraction of rows, using a random seed. |
Dataset<T> |
sample(boolean withReplacement,
double fraction,
long seed)
Returns a new
Dataset by sampling a fraction of rows. |
StructType |
schema()
Returns the schema of this
Dataset . |
Dataset<Row> |
select(Column... cols)
Selects a set of column based expressions.
|
Dataset<Row> |
select(scala.collection.Seq<Column> cols)
Selects a set of column based expressions.
|
Dataset<Row> |
select(java.lang.String col,
scala.collection.Seq<java.lang.String> cols)
Selects a set of columns.
|
Dataset<Row> |
select(java.lang.String col,
java.lang.String... cols)
Selects a set of columns.
|
<U1> Dataset<U1> |
select(TypedColumn<T,U1> c1,
Encoder<U1> evidence$3)
|
<U1,U2> Dataset<scala.Tuple2<U1,U2>> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2)
|
<U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3)
|
<U1,U2,U3,U4> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3,
TypedColumn<T,U4> c4)
|
<U1,U2,U3,U4,U5> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3,
TypedColumn<T,U4> c4,
TypedColumn<T,U5> c5)
|
Dataset<Row> |
selectExpr(scala.collection.Seq<java.lang.String> exprs)
Selects a set of SQL expressions.
|
Dataset<Row> |
selectExpr(java.lang.String... exprs)
Selects a set of SQL expressions.
|
protected Dataset<?> |
selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
Internal helper function for building typed selects that return tuples.
|
void |
show()
Displays the top 20 rows of
Dataset in a tabular form. |
void |
show(boolean truncate)
Displays the top 20 rows of
Dataset in a tabular form. |
void |
show(int numRows)
Displays the
Dataset in a tabular form. |
void |
show(int numRows,
boolean truncate)
Displays the
Dataset in a tabular form. |
Dataset<T> |
sort(Column... sortExprs)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
sort(scala.collection.Seq<Column> sortExprs)
Returns a new
Dataset sorted by the given expressions. |
Dataset<T> |
sort(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
Dataset sorted by the specified column, all in ascending order. |
Dataset<T> |
sort(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
Dataset sorted by the specified column, all in ascending order. |
Dataset<T> |
sortWithinPartitions(Column... sortExprs)
Returns a new
Dataset with each partition sorted by the given expressions. |
Dataset<T> |
sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
Returns a new
Dataset with each partition sorted by the given expressions. |
Dataset<T> |
sortWithinPartitions(java.lang.String sortCol,
scala.collection.Seq<java.lang.String> sortCols)
Returns a new
Dataset with each partition sorted by the given expressions. |
Dataset<T> |
sortWithinPartitions(java.lang.String sortCol,
java.lang.String... sortCols)
Returns a new
Dataset with each partition sorted by the given expressions. |
SparkSession |
sparkSession() |
SQLContext |
sqlContext() |
DataFrameStatFunctions |
stat()
Returns a
DataFrameStatFunctions for working statistic functions support. |
java.lang.Object |
take(int n)
Returns the first
n rows in the Dataset . |
java.util.List<T> |
takeAsList(int n)
Returns the first
n rows in the Dataset as a list. |
Dataset<Row> |
toDF()
Converts this strongly typed collection of data to generic Dataframe.
|
Dataset<Row> |
toDF(scala.collection.Seq<java.lang.String> colNames)
Converts this strongly typed collection of data to generic
DataFrame with columns renamed. |
Dataset<Row> |
toDF(java.lang.String... colNames)
Converts this strongly typed collection of data to generic
DataFrame with columns renamed. |
JavaRDD<T> |
toJavaRDD()
|
Dataset<java.lang.String> |
toJSON()
Returns the content of the
Dataset as a Dataset of JSON strings. |
java.util.Iterator<T> |
toLocalIterator()
|
protected int |
toPythonIterator() |
java.lang.String |
toString() |
<U> Dataset<U> |
transform(scala.Function1<Dataset<T>,Dataset<U>> t)
Concise syntax for chaining custom transformations.
|
Dataset<T> |
union(Dataset<T> other)
Returns a new
Dataset containing union of rows in this Dataset and another Dataset. |
Dataset<T> |
unionAll(Dataset<T> other)
Deprecated.
use union(). Since 2.0.0.
|
Dataset<T> |
unpersist()
Mark the
Dataset as non-persistent, and remove all blocks for it from memory and disk. |
Dataset<T> |
unpersist(boolean blocking)
Mark the
Dataset as non-persistent, and remove all blocks for it from memory and disk. |
Dataset<T> |
where(Column condition)
Filters rows using the given condition.
|
Dataset<T> |
where(java.lang.String conditionExpr)
Filters rows using the given SQL expression.
|
Dataset<Row> |
withColumn(java.lang.String colName,
Column col)
Returns a new
Dataset by adding a column or replacing the existing column that has
the same name. |
Dataset<Row> |
withColumnRenamed(java.lang.String existingName,
java.lang.String newName)
Returns a new
Dataset with a column renamed. |
DataFrameWriter |
write()
:: Experimental ::
Interface for saving the content of the
Dataset out into external storage or streams. |
public Dataset(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
public Dataset(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
public Dataset<Row> toDF(java.lang.String... colNames)
DataFrame
with columns renamed.
This can be quite convenient in conversion from a RDD of tuples into a DataFrame
with
meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)public Dataset<T> sortWithinPartitions(java.lang.String sortCol, java.lang.String... sortCols)
Dataset
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> sortWithinPartitions(Column... sortExprs)
Dataset
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)public Dataset<T> sort(java.lang.String sortCol, java.lang.String... sortCols)
Dataset
sorted by the specified column, all in ascending order.
// The following 3 are equivalent
ds.sort("sortcol")
ds.sort($"sortcol")
ds.sort($"sortcol".asc)
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> sort(Column... sortExprs)
Dataset
sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)public Dataset<T> orderBy(java.lang.String sortCol, java.lang.String... sortCols)
Dataset
sorted by the given expressions.
This is an alias of the sort
function.
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> orderBy(Column... sortExprs)
Dataset
sorted by the given expressions.
This is an alias of the sort
function.
sortExprs
- (undocumented)public Dataset<Row> select(Column... cols)
ds.select($"colA", $"colB" + 1)
cols
- (undocumented)public Dataset<Row> select(java.lang.String col, java.lang.String... cols)
select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
ds.select("colA", "colB")
ds.select($"colA", $"colB")
col
- (undocumented)cols
- (undocumented)public Dataset<Row> selectExpr(java.lang.String... exprs)
select
that accepts
SQL expressions.
// The following are equivalent:
ds.selectExpr("colA", "colB as newName", "abs(colC)")
ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)public RelationalGroupedDataset groupBy(Column... cols)
Dataset
using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
ds.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset rollup(Column... cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset cube(Column... cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
ds.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset groupBy(java.lang.String col1, java.lang.String... cols)
Dataset
using the specified columns, so that we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
ds.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public RelationalGroupedDataset rollup(java.lang.String col1, java.lang.String... cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup("department", "group").avg()
// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public RelationalGroupedDataset cube(java.lang.String col1, java.lang.String... cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
ds.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public Dataset<Row> agg(Column expr, Column... exprs)
Dataset
without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(max($"age"), avg($"salary"))
ds.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)exprs
- (undocumented)public Dataset<Row> drop(java.lang.String... colNames)
Dataset
with columns dropped.
This is a no-op if schema doesn't contain column name(s).
colNames
- (undocumented)public Dataset<Row> describe(java.lang.String... cols)
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting Dataset
. If you want to
programmatically compute summary statistics, use the agg
function instead.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
cols
- (undocumented)public Dataset<T> repartition(int numPartitions, Column... partitionExprs)
Dataset
partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)partitionExprs
- (undocumented)public Dataset<T> repartition(Column... partitionExprs)
Dataset
partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.
The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)public SparkSession sparkSession()
public org.apache.spark.sql.execution.QueryExecution queryExecution()
protected org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan()
public SQLContext sqlContext()
protected org.apache.spark.sql.catalyst.expressions.NamedExpression resolve(java.lang.String colName)
protected scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> numericColumns()
public java.lang.String toString()
toString
in class java.lang.Object
public Dataset<Row> toDF()
Row
objects that allow fields to be accessed by ordinal or name.
public <U> Dataset<U> as(Encoder<U> evidence$2)
Dataset
where each record has been mapped on to the specified type. The
method used to map columns depend on the type of U
:
- When U
is a class, fields for the class will be mapped to columns of the same name
(case sensitivity is determined by spark.sql.caseSensitive
).
- When U
is a tuple, the columns will be be mapped by ordinal (i.e. the first column will
be assigned to _1
).
- When U
is a primitive type (i.e. String, Int, etc), then the first column of the
DataFrame
will be used.
If the schema of the Dataset
does not match the desired U
type, you can use select
along with alias
or as
to rearrange or rename as required.
evidence$2
- (undocumented)public Dataset<Row> toDF(scala.collection.Seq<java.lang.String> colNames)
DataFrame
with columns renamed.
This can be quite convenient in conversion from a RDD of tuples into a DataFrame
with
meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2`
rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
colNames
- (undocumented)public StructType schema()
Dataset
.
public void printSchema()
public void explain(boolean extended)
extended
- (undocumented)public void explain()
public scala.Tuple2<java.lang.String,java.lang.String>[] dtypes()
public java.lang.String[] columns()
public boolean isLocal()
collect
and take
methods can be run locally
(without any Spark executors).
public boolean isStreaming()
Dataset
contains one or more sources that continuously
return data as it arrives. A Dataset
that reads data from a streaming source
must be executed as a ContinuousQuery
using the startStream()
method in
DataFrameWriter
. Methods that return a single answer, e.g. count()
or
collect()
, will throw an AnalysisException
when there is a streaming
source present.
public void show(int numRows)
Dataset
in a tabular form. Strings more than 20 characters will be truncated,
and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
public void show()
Dataset
in a tabular form. Strings more than 20 characters
will be truncated, and all cells will be aligned right.
public void show(boolean truncate)
Dataset
in a tabular form.
truncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public void show(int numRows, boolean truncate)
Dataset
in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to showtruncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public DataFrameNaFunctions na()
DataFrameNaFunctions
for working with missing data.
// Dropping rows containing any null values.
ds.na.drop()
public DataFrameStatFunctions stat()
DataFrameStatFunctions
for working statistic functions support.
// Finding frequent items in column with name 'a'.
ds.stat.freqItems(Seq("a"))
public Dataset<Row> join(Dataset<?> right)
DataFrame
.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
right
- Right side of the join operation.
public Dataset<Row> join(Dataset<?> right, java.lang.String usingColumn)
DataFrame
using the given column.
Different from other join functions, the join column will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumn
- Name of the column to join on. This column must exist on both sides.
public Dataset<Row> join(Dataset<?> right, scala.collection.Seq<java.lang.String> usingColumns)
DataFrame
using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumns
- Names of the columns to join on. This columns must exist on both sides.
public Dataset<Row> join(Dataset<?> right, scala.collection.Seq<java.lang.String> usingColumns, java.lang.String joinType)
DataFrame
using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
Note that if you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since
there is no way to disambiguate which side of the join you would like to reference.
right
- Right side of the join operation.usingColumns
- Names of the columns to join on. This columns must exist on both sides.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
public Dataset<Row> join(Dataset<?> right, Column joinExprs)
DataFrame
, using the given join expression.
// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
right
- (undocumented)joinExprs
- (undocumented)public Dataset<Row> join(Dataset<?> right, Column joinExprs, java.lang.String joinType)
DataFrame
, using the given join expression. The following performs
a full outer join between df1
and df2
.
// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")
// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
right
- Right side of the join.joinExprs
- Join expression.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, java.lang.String joinType)
Dataset
returning a Tuple2
for each pair where condition
evaluates to
true.
This is similar to the relation join
function with one important difference in the
result schema. Since joinWith
preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other
- Right side of the join.condition
- Join expression.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition)
Dataset
returning a Tuple2
for each pair
where condition
evaluates to true.
other
- Right side of the join.condition
- Join expression.
public Dataset<T> sortWithinPartitions(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
Dataset
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> sortWithinPartitions(scala.collection.Seq<Column> sortExprs)
Dataset
with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
sortExprs
- (undocumented)public Dataset<T> sort(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
Dataset
sorted by the specified column, all in ascending order.
// The following 3 are equivalent
ds.sort("sortcol")
ds.sort($"sortcol")
ds.sort($"sortcol".asc)
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> sort(scala.collection.Seq<Column> sortExprs)
Dataset
sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
sortExprs
- (undocumented)public Dataset<T> orderBy(java.lang.String sortCol, scala.collection.Seq<java.lang.String> sortCols)
Dataset
sorted by the given expressions.
This is an alias of the sort
function.
sortCol
- (undocumented)sortCols
- (undocumented)public Dataset<T> orderBy(scala.collection.Seq<Column> sortExprs)
Dataset
sorted by the given expressions.
This is an alias of the sort
function.
sortExprs
- (undocumented)public Column apply(java.lang.String colName)
Column
.
Note that the column name can also reference to a nested column like a.b
.
colName
- (undocumented)public Column col(java.lang.String colName)
Column
.
Note that the column name can also reference to a nested column like a.b
.
colName
- (undocumented)public Dataset<T> as(java.lang.String alias)
Dataset
with an alias set.
alias
- (undocumented)public Dataset<T> as(scala.Symbol alias)
Dataset
with an alias set.
alias
- (undocumented)public Dataset<T> alias(java.lang.String alias)
Dataset
with an alias set. Same as as
.
alias
- (undocumented)public Dataset<T> alias(scala.Symbol alias)
Dataset
with an alias set. Same as as
.
alias
- (undocumented)public Dataset<Row> select(scala.collection.Seq<Column> cols)
ds.select($"colA", $"colB" + 1)
cols
- (undocumented)public Dataset<Row> select(java.lang.String col, scala.collection.Seq<java.lang.String> cols)
select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent:
ds.select("colA", "colB")
ds.select($"colA", $"colB")
col
- (undocumented)cols
- (undocumented)public Dataset<Row> selectExpr(scala.collection.Seq<java.lang.String> exprs)
select
that accepts
SQL expressions.
// The following are equivalent:
ds.selectExpr("colA", "colB as newName", "abs(colC)")
ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
exprs
- (undocumented)public <U1> Dataset<U1> select(TypedColumn<T,U1> c1, Encoder<U1> evidence$3)
Dataset
by computing the given Column
expression for each element.
val ds = Seq(1, 2, 3).toDS()
val newDS = ds.select(expr("value + 1").as[Int])
c1
- (undocumented)evidence$3
- (undocumented)protected Dataset<?> selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
columns
- (undocumented)public <U1,U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)
Dataset
by computing the given Column
expressions for each element.
c1
- (undocumented)c2
- (undocumented)public <U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)
Dataset
by computing the given Column
expressions for each element.
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)public <U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)
Dataset
by computing the given Column
expressions for each element.
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)c4
- (undocumented)public <U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)
Dataset
by computing the given Column
expressions for each element.
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)c4
- (undocumented)c5
- (undocumented)public Dataset<T> filter(Column condition)
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
condition
- (undocumented)public Dataset<T> filter(java.lang.String conditionExpr)
peopleDs.filter("age > 15")
conditionExpr
- (undocumented)public Dataset<T> where(Column condition)
filter
.
// The following are equivalent:
peopleDs.filter($"age" > 15)
peopleDs.where($"age" > 15)
condition
- (undocumented)public Dataset<T> where(java.lang.String conditionExpr)
peopleDs.where("age > 15")
conditionExpr
- (undocumented)public RelationalGroupedDataset groupBy(scala.collection.Seq<Column> cols)
Dataset
using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department.
ds.groupBy($"department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset rollup(scala.collection.Seq<Column> cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup($"department", $"group").avg()
// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset cube(scala.collection.Seq<Column> cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group.
ds.cube($"department", $"group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
cols
- (undocumented)public RelationalGroupedDataset groupBy(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
Dataset
using the specified columns, so that we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department.
ds.groupBy("department").avg()
// Compute the max age and average salary, grouped by department and gender.
ds.groupBy($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public T reduce(scala.Function2<T,T,T> func)
Dataset
using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.
func
- (undocumented)public T reduce(ReduceFunction<T> func)
func
must be commutative and associative or the result may be non-deterministic.
func
- (undocumented)public <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func, Encoder<K> evidence$4)
KeyValueGroupedDataset
where the data is grouped by the given key func
.
func
- (undocumented)evidence$4
- (undocumented)public <K> KeyValueGroupedDataset<K,T> groupByKey(MapFunction<T,K> func, Encoder<K> encoder)
KeyValueGroupedDataset
where the data is grouped by the given key func
.
func
- (undocumented)encoder
- (undocumented)public RelationalGroupedDataset rollup(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group.
ds.rollup("department", "group").avg()
// Compute the max age and average salary, rolluped by department and gender.
ds.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public RelationalGroupedDataset cube(java.lang.String col1, scala.collection.Seq<java.lang.String> cols)
Dataset
using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset
for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group.
ds.cube("department", "group").avg()
// Compute the max age and average salary, cubed by department and gender.
ds.cube($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
col1
- (undocumented)cols
- (undocumented)public Dataset<Row> agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr, scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
Dataset
without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg("age" -> "max", "salary" -> "avg")
ds.groupBy().agg("age" -> "max", "salary" -> "avg")
aggExpr
- (undocumented)aggExprs
- (undocumented)public Dataset<Row> agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
Dataset
without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)public Dataset<Row> agg(java.util.Map<java.lang.String,java.lang.String> exprs)
Dataset
without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
exprs
- (undocumented)public Dataset<Row> agg(Column expr, scala.collection.Seq<Column> exprs)
Dataset
without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...)
ds.agg(max($"age"), avg($"salary"))
ds.groupBy().agg(max($"age"), avg($"salary"))
expr
- (undocumented)exprs
- (undocumented)public Dataset<T> limit(int n)
Dataset
by taking the first n
rows. The difference between this function
and head
is that head
is an action and returns an array (by triggering query execution)
while limit
returns a new Dataset
.
n
- (undocumented)public Dataset<T> unionAll(Dataset<T> other)
Dataset
containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed
by a distinct
.
other
- (undocumented)public Dataset<T> union(Dataset<T> other)
Dataset
containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed
by a distinct
.
other
- (undocumented)public Dataset<T> intersect(Dataset<T> other)
Dataset
containing rows only in both this Dataset and another Dataset.
This is equivalent to INTERSECT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
other
- (undocumented)public Dataset<T> except(Dataset<T> other)
Dataset
containing rows in this Dataset but not in another Dataset.
This is equivalent to EXCEPT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
other
- (undocumented)public Dataset<T> sample(boolean withReplacement, double fraction, long seed)
Dataset
by sampling a fraction of rows.
withReplacement
- Sample with replacement or not.fraction
- Fraction of rows to generate.seed
- Seed for sampling.
public Dataset<T> sample(boolean withReplacement, double fraction)
Dataset
by sampling a fraction of rows, using a random seed.
withReplacement
- Sample with replacement or not.fraction
- Fraction of rows to generate.
public Dataset<T>[] randomSplit(double[] weights, long seed)
Dataset
with the provided weights.
weights
- weights for splits, will be normalized if they don't sum to 1.seed
- Seed for sampling.
For Java API, use randomSplitAsList
.
public java.util.List<Dataset<T>> randomSplitAsList(double[] weights, long seed)
Dataset
with the provided weights.
weights
- weights for splits, will be normalized if they don't sum to 1.seed
- Seed for sampling.
public Dataset<T>[] randomSplit(double[] weights)
Dataset
with the provided weights.
weights
- weights for splits, will be normalized if they don't sum to 1.public <A extends scala.Product> Dataset<Row> explode(scala.collection.Seq<Column> input, scala.Function1<Row,scala.collection.TraversableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$5)
Dataset
where each row has been expanded to zero or more
rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. The columns of
the input row are implicitly joined with each row that is output by the function.
The following example uses this function to count the number of books which contain a given word:
case class Book(title: String, words: String)
val ds: Dataset[Book]
case class Word(word: String)
val allWords = ds.explode('words) {
case Row(words: String) => words.split(" ").map(Word(_))
}
val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
input
- (undocumented)f
- (undocumented)evidence$5
- (undocumented)public <A,B> Dataset<Row> explode(java.lang.String inputColumn, java.lang.String outputColumn, scala.Function1<A,scala.collection.TraversableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$6)
Dataset
where a single column has been expanded to zero
or more rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. All
columns of the input row are implicitly joined with each value that is output by the function.
ds.explode("words", "word") {words: String => words.split(" ")}
inputColumn
- (undocumented)outputColumn
- (undocumented)f
- (undocumented)evidence$6
- (undocumented)public Dataset<Row> withColumn(java.lang.String colName, Column col)
Dataset
by adding a column or replacing the existing column that has
the same name.
colName
- (undocumented)col
- (undocumented)public Dataset<Row> withColumnRenamed(java.lang.String existingName, java.lang.String newName)
Dataset
with a column renamed.
This is a no-op if schema doesn't contain existingName.
existingName
- (undocumented)newName
- (undocumented)public Dataset<Row> drop(java.lang.String colName)
Dataset
with a column dropped.
This is a no-op if schema doesn't contain column name.
colName
- (undocumented)public Dataset<Row> drop(scala.collection.Seq<java.lang.String> colNames)
Dataset
with columns dropped.
This is a no-op if schema doesn't contain column name(s).
colNames
- (undocumented)public Dataset<Row> drop(Column col)
Dataset
with a column dropped.
This version of drop accepts a Column rather than a name.
This is a no-op if the Dataset doesn't have a column
with an equivalent expression.
col
- (undocumented)public Dataset<T> dropDuplicates()
Dataset
that contains only the unique rows from this Dataset
.
This is an alias for distinct
.
public Dataset<T> dropDuplicates(scala.collection.Seq<java.lang.String> colNames)
Dataset
with duplicate rows removed, considering only
the subset of columns.
colNames
- (undocumented)public Dataset<T> dropDuplicates(java.lang.String[] colNames)
Dataset
with duplicate rows removed, considering only
the subset of columns.
colNames
- (undocumented)public Dataset<Row> describe(scala.collection.Seq<java.lang.String> cols)
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting Dataset
. If you want to
programmatically compute summary statistics, use the agg
function instead.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
cols
- (undocumented)public java.lang.Object head(int n)
n
rows.
n
- (undocumented)public T head()
public T first()
public <U> Dataset<U> transform(scala.Function1<Dataset<T>,Dataset<U>> t)
def featurize(ds: Dataset[T]): Dataset[U] = ...
ds
.transform(featurize)
.transform(...)
t
- (undocumented)public Dataset<T> filter(scala.Function1<T,java.lang.Object> func)
Dataset
that only contains elements where func
returns true
.
func
- (undocumented)public Dataset<T> filter(FilterFunction<T> func)
Dataset
that only contains elements where func
returns true
.
func
- (undocumented)public <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence$7)
Dataset
that contains the result of applying func
to each element.
func
- (undocumented)evidence$7
- (undocumented)public <U> Dataset<U> map(MapFunction<T,U> func, Encoder<U> encoder)
Dataset
that contains the result of applying func
to each element.
func
- (undocumented)encoder
- (undocumented)public <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$8)
Dataset
that contains the result of applying func
to each partition.
func
- (undocumented)evidence$8
- (undocumented)public <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
Dataset
that contains the result of applying f
to each partition.
f
- (undocumented)encoder
- (undocumented)public <U> Dataset<U> flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func, Encoder<U> evidence$9)
Dataset
by first applying a function to all elements of this Dataset
,
and then flattening the results.
func
- (undocumented)evidence$9
- (undocumented)public <U> Dataset<U> flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)
Dataset
by first applying a function to all elements of this Dataset
,
and then flattening the results.
f
- (undocumented)encoder
- (undocumented)public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
f
to all rows.
f
- (undocumented)public void foreach(ForeachFunction<T> func)
func
on each element of this Dataset
.
func
- (undocumented)public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
f
to each partition of this Dataset
.
f
- (undocumented)public void foreachPartition(ForeachPartitionFunction<T> func)
func
on each partition of this Dataset
.
func
- (undocumented)public java.lang.Object take(int n)
n
rows in the Dataset
.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)public java.util.List<T> takeAsList(int n)
n
rows in the Dataset
as a list.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
n
- (undocumented)public java.lang.Object collect()
Row
s in this Dataset
.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList
.
public java.util.List<T> collectAsList()
Row
s in this Dataset
.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
public java.util.Iterator<T> toLocalIterator()
Row
s in this Dataset
.
The iterator will consume as much memory as the largest partition in this Dataset
.
Note: this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
public long count()
Dataset
.public Dataset<T> repartition(int numPartitions)
Dataset
that has exactly numPartitions
partitions.
numPartitions
- (undocumented)public Dataset<T> repartition(int numPartitions, scala.collection.Seq<Column> partitionExprs)
Dataset
partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
numPartitions
- (undocumented)partitionExprs
- (undocumented)public Dataset<T> repartition(scala.collection.Seq<Column> partitionExprs)
Dataset
partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.
The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
partitionExprs
- (undocumented)public Dataset<T> coalesce(int numPartitions)
Dataset
that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD
, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
numPartitions
- (undocumented)public Dataset<T> distinct()
Dataset
that contains only the unique rows from this Dataset
.
This is an alias for dropDuplicates
.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
public Dataset<T> persist()
Dataset
with the default storage level (MEMORY_AND_DISK
).
public Dataset<T> cache()
Dataset
with the default storage level (MEMORY_AND_DISK
).
public Dataset<T> persist(StorageLevel newLevel)
Dataset
with the given storage level.newLevel
- One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.
public Dataset<T> unpersist(boolean blocking)
Dataset
as non-persistent, and remove all blocks for it from memory and disk.
blocking
- Whether to block until all blocks are deleted.
public Dataset<T> unpersist()
Dataset
as non-persistent, and remove all blocks for it from memory and disk.
public void registerTempTable(java.lang.String tableName)
Dataset
as a temporary table using the given name. The lifetime of this
temporary table is tied to the SparkSession
that was used to create this Dataset.
tableName
- (undocumented)public void createTempView(java.lang.String viewName) throws AnalysisException
SparkSession
that was used to create this Dataset.
viewName
- (undocumented)AnalysisException
- if the view name already exists
public void createOrReplaceTempView(java.lang.String viewName)
SparkSession
that was used to create this Dataset.
viewName
- (undocumented)public DataFrameWriter write()
Dataset
out into external storage or streams.
public Dataset<java.lang.String> toJSON()
Dataset
as a Dataset of JSON strings.public java.lang.String[] inputFiles()
protected JavaRDD<byte[]> javaToPython()
protected int collectToPython()
protected int toPythonIterator()