@InterfaceStability.Evolving
public interface DataSourceReader
ReadSupport.createReader(DataSourceOptions)
or
ReadSupport.createReader(StructType, DataSourceOptions)
.
It can mix in various query optimization interfaces to speed up the data scan. The actual scan
logic is delegated to InputPartition
s, which are returned by
planInputPartitions()
.
There are mainly 3 kinds of query optimizations:
1. Operators push-down. E.g., filter push-down, required columns push-down(aka column
pruning), etc. Names of these interfaces start with `SupportsPushDown`.
2. Information Reporting. E.g., statistics reporting, ordering reporting, etc.
Names of these interfaces start with `SupportsReporting`.
3. Columnar scan if implements SupportsScanColumnarBatch
.
If an exception was throw when applying any of these query optimizations, the action will fail
and no Spark job will be submitted.
Spark first applies all operator push-down optimizations that this data source supports. Then
Spark collects information this data source reported for further optimizations. Finally Spark
issues the scan request and does the actual data reading.Modifier and Type | Method and Description |
---|---|
java.util.List<InputPartition<org.apache.spark.sql.catalyst.InternalRow>> |
planInputPartitions()
Returns a list of
InputPartition s. |
StructType |
readSchema()
Returns the actual schema of this data source reader, which may be different from the physical
schema of the underlying storage, as column pruning or other optimizations may happen.
|
StructType readSchema()
java.util.List<InputPartition<org.apache.spark.sql.catalyst.InternalRow>> planInputPartitions()
InputPartition
s. Each InputPartition
is responsible for
creating a data reader to output data of one RDD partition. The number of input partitions
returned here is the same as the number of RDD partitions this scan outputs.
Note that, this may not be a full scan if the data source reader mixes in other optimization
interfaces like column pruning, filter push-down, etc. These optimizations are applied before
Spark issues the scan request.
If this method fails (by throwing an exception), the action will fail and no Spark job will be
submitted.