public class DataFrameReader
extends Object
DataFrame
from external storage systems (e.g. file systems,
key-value stores, etc). Use SQLContext.read
to access this.
Modifier and Type | Method and Description |
---|---|
DataFrameReader |
format(String source)
Specifies the input data source format.
|
DataFrame |
jdbc(String url,
String table,
java.util.Properties properties)
Construct a
DataFrame representing the database table accessible via JDBC URL
url named table and connection properties. |
DataFrame |
jdbc(String url,
String table,
String[] predicates,
java.util.Properties connectionProperties)
Construct a
DataFrame representing the database table accessible via JDBC URL
url named table using connection properties. |
DataFrame |
jdbc(String url,
String table,
String columnName,
long lowerBound,
long upperBound,
int numPartitions,
java.util.Properties connectionProperties)
Construct a
DataFrame representing the database table accessible via JDBC URL
url named table. |
DataFrame |
json(JavaRDD<String> jsonRDD)
Loads an
JavaRDD[String] storing JSON objects (one object per record) and
returns the result as a DataFrame . |
DataFrame |
json(RDD<String> jsonRDD)
Loads an
RDD[String] storing JSON objects (one object per record) and
returns the result as a DataFrame . |
DataFrame |
json(String path)
Loads a JSON file (one object per line) and returns the result as a
DataFrame . |
DataFrame |
load()
Loads input in as a
DataFrame , for data sources that don't require a path (e.g. |
DataFrame |
load(String path)
Loads input in as a
DataFrame , for data sources that require a path (e.g. |
DataFrameReader |
option(String key,
String value)
Adds an input option for the underlying data source.
|
DataFrameReader |
options(scala.collection.Map<String,String> options)
(Scala-specific) Adds input options for the underlying data source.
|
DataFrameReader |
options(java.util.Map<String,String> options)
Adds input options for the underlying data source.
|
DataFrame |
parquet(scala.collection.Seq<String> paths)
Loads a Parquet file, returning the result as a
DataFrame . |
DataFrame |
parquet(String... paths)
Loads a Parquet file, returning the result as a
DataFrame . |
DataFrameReader |
schema(StructType schema)
Specifies the input schema.
|
DataFrame |
table(String tableName)
Returns the specified table as a
DataFrame . |
public DataFrame parquet(String... paths)
DataFrame
. This function returns an empty
DataFrame
if no paths are passed in.
paths
- (undocumented)public DataFrameReader format(String source)
source
- (undocumented)public DataFrameReader schema(StructType schema)
schema
- (undocumented)public DataFrameReader option(String key, String value)
key
- (undocumented)value
- (undocumented)public DataFrameReader options(scala.collection.Map<String,String> options)
options
- (undocumented)public DataFrameReader options(java.util.Map<String,String> options)
options
- (undocumented)public DataFrame load(String path)
DataFrame
, for data sources that require a path (e.g. data backed by
a local or distributed file system).
path
- (undocumented)public DataFrame load()
DataFrame
, for data sources that don't require a path (e.g. external
key-value stores).
public DataFrame jdbc(String url, String table, java.util.Properties properties)
DataFrame
representing the database table accessible via JDBC URL
url named table and connection properties.
url
- (undocumented)table
- (undocumented)properties
- (undocumented)public DataFrame jdbc(String url, String table, String columnName, long lowerBound, long upperBound, int numPartitions, java.util.Properties connectionProperties)
DataFrame
representing the database table accessible via JDBC URL
url named table. Partitions of the table will be retrieved in parallel based on the parameters
passed to this function.
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
- JDBC database url of the form jdbc:subprotocol:subname
table
- Name of the table in the external database.columnName
- the name of a column of integral type that will be used for partitioning.lowerBound
- the minimum value of columnName
used to decide partition strideupperBound
- the maximum value of columnName
used to decide partition stridenumPartitions
- the number of partitions. the range minValue
-maxValue
will be split
evenly into this many partitionsconnectionProperties
- JDBC database connection arguments, a list of arbitrary string
tag/value. Normally at least a "user" and "password" property
should be included.
public DataFrame jdbc(String url, String table, String[] predicates, java.util.Properties connectionProperties)
DataFrame
representing the database table accessible via JDBC URL
url named table using connection properties. The predicates
parameter gives a list
expressions suitable for inclusion in WHERE clauses; each one defines one partition
of the DataFrame
.
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
- JDBC database url of the form jdbc:subprotocol:subname
table
- Name of the table in the external database.predicates
- Condition in the where clause for each partition.connectionProperties
- JDBC database connection arguments, a list of arbitrary string
tag/value. Normally at least a "user" and "password" property
should be included.public DataFrame json(String path)
DataFrame
.
This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
path
- input pathpublic DataFrame json(JavaRDD<String> jsonRDD)
JavaRDD[String]
storing JSON objects (one object per record) and
returns the result as a DataFrame
.
Unless the schema is specified using schema
function, this function goes through the
input once to determine the input schema.
jsonRDD
- input RDD with one JSON object per recordpublic DataFrame json(RDD<String> jsonRDD)
RDD[String]
storing JSON objects (one object per record) and
returns the result as a DataFrame
.
Unless the schema is specified using schema
function, this function goes through the
input once to determine the input schema.
jsonRDD
- input RDD with one JSON object per recordpublic DataFrame parquet(scala.collection.Seq<String> paths)
DataFrame
. This function returns an empty
DataFrame
if no paths are passed in.
paths
- (undocumented)