pyspark.
SparkContext
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
RDD
When you create a new SparkContext, at least the master and app name should be set, either through the named parameters here or through conf.
Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
A name for your job, to display on the cluster web UI.
Location where Spark is installed on cluster nodes.
Collection of .zip or .py files to send to the cluster and add to PYTHONPATH. These can be paths on the local file system or HDFS, HTTP, HTTPS, or FTP URLs.
A dictionary of environment variables to set on worker nodes.
The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size
pyspark.serializers.Serializer
The serializer for RDDs.
pyspark.SparkConf
An object setting Spark properties.
py4j.java_gateway.JavaGateway
Use an existing gateway and JVM, otherwise a new JVM will be instantiated. This is only used internally.
py4j.java_gateway.JavaObject
The JavaSparkContext instance. This is only used internally.
A class of custom Profiler used to do profiling (default is pyspark.profiler.BasicProfiler).
pyspark.profiler.BasicProfiler
A class of custom Profiler used to do udf profiling (default is pyspark.profiler.UDFBasicProfiler).
pyspark.profiler.UDFBasicProfiler
Notes
Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
SparkContext instance is not supported to share across multiple processes out of the box, and PySpark does not guarantee multi-processing execution. Use threads instead for concurrent processing purpose.
Examples
>>> from pyspark.context import SparkContext >>> sc = SparkContext('local', 'test') >>> sc2 = SparkContext('local', 'test2') Traceback (most recent call last): ... ValueError: ...
Methods
accumulator(value[, accum_param])
accumulator
Create an Accumulator with the given initial value, using a given AccumulatorParam helper object to define how to add values of the data type if provided.
Accumulator
AccumulatorParam
addArchive(path)
addArchive
Add an archive to be downloaded with this Spark job on every node.
addFile(path[, recursive])
addFile
Add a file to be downloaded with this Spark job on every node.
addPyFile(path)
addPyFile
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future.
binaryFiles(path[, minPartitions])
binaryFiles
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array.
binaryRecords(path, recordLength)
binaryRecords
Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
broadcast(value)
broadcast
Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions.
Broadcast
cancelAllJobs()
cancelAllJobs
Cancel all jobs that have been scheduled or are running.
cancelJobGroup(groupId)
cancelJobGroup
Cancel active jobs for the specified group.
dump_profiles(path)
dump_profiles
Dump the profile stats into directory path
emptyRDD()
emptyRDD
Create an RDD that has no partitions or elements.
getCheckpointDir()
getCheckpointDir
Return the directory where RDDs are checkpointed.
getConf()
getConf
getLocalProperty(key)
getLocalProperty
Get a local property set in this thread, or null if it is missing.
getOrCreate([conf])
getOrCreate
Get or instantiate a SparkContext and register it as a singleton object.
hadoopFile(path, inputFormatClass, keyClass, …)
hadoopFile
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
hadoopRDD(inputFormatClass, keyClass, valueClass)
hadoopRDD
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
newAPIHadoopFile(path, inputFormatClass, …)
newAPIHadoopFile
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
newAPIHadoopRDD(inputFormatClass, keyClass, …)
newAPIHadoopRDD
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
parallelize(c[, numSlices])
parallelize
Distribute a local Python collection to form an RDD.
pickleFile(name[, minPartitions])
pickleFile
Load an RDD previously saved using RDD.saveAsPickleFile() method.
RDD.saveAsPickleFile()
range(start[, end, step, numSlices])
range
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element.
runJob(rdd, partitionFunc[, partitions, …])
runJob
Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.
sequenceFile(path[, keyClass, valueClass, …])
sequenceFile
Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
setCheckpointDir(dirName)
setCheckpointDir
Set the directory under which RDDs are going to be checkpointed.
setJobDescription(value)
setJobDescription
Set a human readable description of the current job.
setJobGroup(groupId, description[, …])
setJobGroup
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.
setLocalProperty(key, value)
setLocalProperty
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.
setLogLevel(logLevel)
setLogLevel
Control our logLevel.
setSystemProperty(key, value)
setSystemProperty
Set a Java system property, such as spark.executor.memory.
show_profiles()
show_profiles
Print the profile stats to stdout
sparkUser()
sparkUser
Get SPARK_USER for user who is running SparkContext.
statusTracker()
statusTracker
Return StatusTracker object
StatusTracker
stop()
stop
Shut down the SparkContext.
textFile(name[, minPartitions, use_unicode])
textFile
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
union(rdds)
union
Build the union of a list of RDDs.
wholeTextFiles(path[, minPartitions, …])
wholeTextFiles
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
Attributes
PACKAGE_EXTENSIONS
applicationId
A unique identifier for the Spark application.
defaultMinPartitions
Default min number of partitions for Hadoop RDDs when not given by user
defaultParallelism
Default level of parallelism to use when not given by user (e.g.
resources
startTime
Return the epoch time when the Spark Context was started.
uiWebUrl
Return the URL of the SparkUI instance started by this SparkContext
version
The version of Spark on which this application is running.