You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/12 15:57:37 UTC
[arrow] branch master updated: ARROW-5556: [Doc] [Python] Document
JSON reader
This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new ac4a9ef ARROW-5556: [Doc] [Python] Document JSON reader
ac4a9ef is described below
commit ac4a9ef1096128a1a4b563eaf83f1503e1db953f
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Wed Jun 12 10:57:27 2019 -0500
ARROW-5556: [Doc] [Python] Document JSON reader
Author: Antoine Pitrou <an...@python.org>
Closes #4521 from pitrou/ARROW-5556-py-json-docs and squashes the following commits:
40064de97 <Antoine Pitrou> Address review comments
65f357a6b <Antoine Pitrou> ARROW-5556: Document JSON reader
---
docs/source/python/api/formats.rst | 18 +++++-
docs/source/python/csv.rst | 2 +-
docs/source/python/index.rst | 1 +
docs/source/python/json.rst | 114 +++++++++++++++++++++++++++++++++++++
python/pyarrow/_csv.pyx | 13 +++--
python/pyarrow/_json.pyx | 5 +-
6 files changed, 141 insertions(+), 12 deletions(-)
diff --git a/docs/source/python/api/formats.rst b/docs/source/python/api/formats.rst
index 8de30ec..f8aab4a 100644
--- a/docs/source/python/api/formats.rst
+++ b/docs/source/python/api/formats.rst
@@ -18,13 +18,13 @@
Tabular File Formats
====================
-.. currentmodule:: pyarrow.csv
-
.. _api.csv:
CSV Files
---------
+.. currentmodule:: pyarrow.csv
+
.. autosummary::
:toctree: ../generated/
@@ -46,7 +46,19 @@ Feather Files
read_feather
write_feather
-.. currentmodule:: pyarrow
+.. _api.json:
+
+JSON Files
+----------
+
+.. currentmodule:: pyarrow.json
+
+.. autosummary::
+ :toctree: ../generated/
+
+ ReadOptions
+ ParseOptions
+ read_json
.. _api.parquet:
diff --git a/docs/source/python/csv.rst b/docs/source/python/csv.rst
index 17023b1..96a79e6 100644
--- a/docs/source/python/csv.rst
+++ b/docs/source/python/csv.rst
@@ -21,7 +21,7 @@
Reading CSV files
=================
-Arrow provides preliminary support for reading data from CSV files.
+Arrow supports reading columnar data from CSV files.
The features currently offered are the following:
* multi-threaded or single-threaded reading
diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst
index 7f227c5..09367f4 100644
--- a/docs/source/python/index.rst
+++ b/docs/source/python/index.rst
@@ -43,6 +43,7 @@ files into Arrow structures.
pandas
timestamps
csv
+ json
parquet
cuda
extending
diff --git a/docs/source/python/json.rst b/docs/source/python/json.rst
new file mode 100644
index 0000000..e4abbff
--- /dev/null
+++ b/docs/source/python/json.rst
@@ -0,0 +1,114 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow.json
+.. _json:
+
+Reading JSON files
+==================
+
+Arrow supports reading columnar data from JSON files. In this context, a
+JSON file consists of multiple JSON objects, one per line, representing
+individual data rows. For example, this file represents two rows of data
+with four columns "a", "b", "c", "d":
+
+.. code-block:: json
+
+ {"a": 1, "b": 2.0, "c": "foo", "d": false}
+ {"a": 4, "b": -5.5, "c": null, "d": true}
+
+The features currently offered are the following:
+
+* multi-threaded or single-threaded reading
+* automatic decompression of input files (based on the filename extension,
+ such as ``my_data.json.gz``)
+* sophisticated type inference (see below)
+
+
+Usage
+-----
+
+JSON reading functionality is available through the :mod:`pyarrow.json` module.
+In many cases, you will simply call the :func:`read_json` function
+with the file path you want to read from::
+
+ >>> from pyarrow import json
+ >>> fn = 'my_data.json'
+ >>> table = json.read_json(fn)
+ >>> table
+ pyarrow.Table
+ a: int64
+ b: double
+ c: string
+ d: bool
+ >>> table.to_pandas()
+ a b c d
+ 0 1 2.0 foo False
+ 1 4 -5.5 None True
+
+
+Automatic Type Inference
+------------------------
+
+Arrow :ref:`data types <data.types>` are inferred from the JSON types and
+values of each column:
+
+* JSON null values convert to the ``null`` type, but can fall back to any
+ other type.
+* JSON booleans convert to ``bool_``.
+* JSON numbers convert to ``int64``, falling back to ``float64`` if a
+ non-integer is encountered.
+* JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert
+ to ``timestamp[s]``, falling back to ``utf8`` if a conversion error occurs.
+* JSON arrays convert to a ``list`` type, and inference proceeds recursively
+ on the JSON arrays' values.
+* Nested JSON objects convert to a ``struct`` type, and inference proceeds
+ recursively on the JSON objects' values.
+
+Thus, reading this JSON file:
+
+.. code-block:: json
+
+ {"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
+ {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}
+
+returns the following data::
+
+ >>> table = json.read_json("my_data.json")
+ >>> table
+ pyarrow.Table
+ a: list<item: int64>
+ child 0, item: int64
+ b: struct<c: bool, d: timestamp[s]>
+ child 0, c: bool
+ child 1, d: timestamp[s]
+ >>> table.to_pandas()
+ a b
+ 0 [1, 2] {'c': True, 'd': 1991-02-03 00:00:00}
+ 1 [3, 4, 5] {'c': False, 'd': 2019-04-01 00:00:00}
+
+
+Customized parsing
+------------------
+
+To alter the default parsing settings in case of reading JSON files with an
+unusual structure, you should create a :class:`ParseOptions` instance
+and pass it to :func:`read_json`. For example, you can pass an explicit
+:ref:`schema <data.schema>` in order to bypass automatic type inference.
+
+Similarly, you can choose performance settings by passing a
+:class:`ReadOptions` instance to :func:`read_json`.
diff --git a/python/pyarrow/_csv.pyx b/python/pyarrow/_csv.pyx
index 0cc424d..067b830 100644
--- a/python/pyarrow/_csv.pyx
+++ b/python/pyarrow/_csv.pyx
@@ -413,14 +413,15 @@ def read_csv(input_file, read_options=None, parse_options=None,
The location of CSV data. If a string or path, and if it ends
with a recognized compressed file extension (e.g. ".gz" or ".bz2"),
the data is automatically decompressed when reading.
- read_options: ReadOptions, optional
- Options for the CSV reader (see ReadOptions constructor for defaults)
- parse_options: ParseOptions, optional
+ read_options: pyarrow.csv.ReadOptions, optional
+ Options for the CSV reader (see pyarrow.csv.ReadOptions constructor
+ for defaults)
+ parse_options: pyarrow.csv.ParseOptions, optional
Options for the CSV parser
- (see ParseOptions constructor for defaults)
- convert_options: ConvertOptions, optional
+ (see pyarrow.csv.ParseOptions constructor for defaults)
+ convert_options: pyarrow.csv.ConvertOptions, optional
Options for converting CSV data
- (see ConvertOptions constructor for defaults)
+ (see pyarrow.csv.ConvertOptions constructor for defaults)
memory_pool: MemoryPool, optional
Pool to allocate Table memory from
diff --git a/python/pyarrow/_json.pyx b/python/pyarrow/_json.pyx
index b5c839b..ffbf01c 100644
--- a/python/pyarrow/_json.pyx
+++ b/python/pyarrow/_json.pyx
@@ -81,6 +81,7 @@ cdef class ReadOptions:
def block_size(self, value):
self.options.block_size = value
+
cdef class ParseOptions:
"""
Options for parsing JSON files.
@@ -160,9 +161,9 @@ def read_json(input_file, read_options=None, parse_options=None,
----------
input_file: string, path or file-like object
The location of JSON data.
- read_options: ReadOptions, optional
+ read_options: pyarrow.json.ReadOptions, optional
Options for the JSON reader (see ReadOptions constructor for defaults)
- parse_options: ParseOptions, optional
+ parse_options: pyarrow.json.ParseOptions, optional
Options for the JSON parser
(see ParseOptions constructor for defaults)
memory_pool: MemoryPool, optional