pyspark.pandas.read_json#

pyspark.pandas.read_json(path, lines=True, index_col=None, **options)[source]#

Convert a JSON string to DataFrame.

Parameters
pathstring

File path

linesbool, default True

Read the file as a JSON object per line. It should be always True for now.

index_colstr or list of str, optional, default: None

Index column of table in Spark.

optionsdict

All other options passed directly into Spark’s data source.

Examples

>>> df = ps.DataFrame([['a', 'b'], ['c', 'd']],
...                   columns=['col 1', 'col 2'])
>>> df.to_json(path=r'%s/read_json/foo.json' % path, num_files=1)
>>> ps.read_json(
...     path=r'%s/read_json/foo.json' % path
... ).sort_values(by="col 1")
  col 1 col 2
0     a     b
1     c     d
>>> df.to_json(path=r'%s/read_json/foo.json' % path, num_files=1, lineSep='___')
>>> ps.read_json(
...     path=r'%s/read_json/foo.json' % path, lineSep='___'
... ).sort_values(by="col 1")
  col 1 col 2
0     a     b
1     c     d

You can preserve the index in the roundtrip as below.

>>> df.to_json(path=r'%s/read_json/bar.json' % path, num_files=1, index_col="index")
>>> ps.read_json(
...     path=r'%s/read_json/bar.json' % path, index_col="index"
... ).sort_values(by="col 1")  
      col 1 col 2
index
0         a     b
1         c     d