You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/12 15:57:37 UTC

[arrow] branch master updated: ARROW-5556: [Doc] [Python] Document JSON reader

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new ac4a9ef  ARROW-5556: [Doc] [Python] Document JSON reader
ac4a9ef is described below

commit ac4a9ef1096128a1a4b563eaf83f1503e1db953f
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Wed Jun 12 10:57:27 2019 -0500

    ARROW-5556: [Doc] [Python] Document JSON reader
    
    Author: Antoine Pitrou <an...@python.org>
    
    Closes #4521 from pitrou/ARROW-5556-py-json-docs and squashes the following commits:
    
    40064de97 <Antoine Pitrou> Address review comments
    65f357a6b <Antoine Pitrou> ARROW-5556:   Document JSON reader
---
 docs/source/python/api/formats.rst |  18 +++++-
 docs/source/python/csv.rst         |   2 +-
 docs/source/python/index.rst       |   1 +
 docs/source/python/json.rst        | 114 +++++++++++++++++++++++++++++++++++++
 python/pyarrow/_csv.pyx            |  13 +++--
 python/pyarrow/_json.pyx           |   5 +-
 6 files changed, 141 insertions(+), 12 deletions(-)

diff --git a/docs/source/python/api/formats.rst b/docs/source/python/api/formats.rst
index 8de30ec..f8aab4a 100644
--- a/docs/source/python/api/formats.rst
+++ b/docs/source/python/api/formats.rst
@@ -18,13 +18,13 @@
 Tabular File Formats
 ====================
 
-.. currentmodule:: pyarrow.csv
-
 .. _api.csv:
 
 CSV Files
 ---------
 
+.. currentmodule:: pyarrow.csv
+
 .. autosummary::
    :toctree: ../generated/
 
@@ -46,7 +46,19 @@ Feather Files
    read_feather
    write_feather
 
-.. currentmodule:: pyarrow
+.. _api.json:
+
+JSON Files
+----------
+
+.. currentmodule:: pyarrow.json
+
+.. autosummary::
+   :toctree: ../generated/
+
+   ReadOptions
+   ParseOptions
+   read_json
 
 .. _api.parquet:
 
diff --git a/docs/source/python/csv.rst b/docs/source/python/csv.rst
index 17023b1..96a79e6 100644
--- a/docs/source/python/csv.rst
+++ b/docs/source/python/csv.rst
@@ -21,7 +21,7 @@
 Reading CSV files
 =================
 
-Arrow provides preliminary support for reading data from CSV files.
+Arrow supports reading columnar data from CSV files.
 The features currently offered are the following:
 
 * multi-threaded or single-threaded reading
diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst
index 7f227c5..09367f4 100644
--- a/docs/source/python/index.rst
+++ b/docs/source/python/index.rst
@@ -43,6 +43,7 @@ files into Arrow structures.
    pandas
    timestamps
    csv
+   json
    parquet
    cuda
    extending
diff --git a/docs/source/python/json.rst b/docs/source/python/json.rst
new file mode 100644
index 0000000..e4abbff
--- /dev/null
+++ b/docs/source/python/json.rst
@@ -0,0 +1,114 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow.json
+.. _json:
+
+Reading JSON files
+==================
+
+Arrow supports reading columnar data from JSON files.  In this context, a
+JSON file consists of multiple JSON objects, one per line, representing
+individual data rows.  For example, this file represents two rows of data
+with four columns "a", "b", "c", "d":
+
+.. code-block:: json
+
+   {"a": 1, "b": 2.0, "c": "foo", "d": false}
+   {"a": 4, "b": -5.5, "c": null, "d": true}
+
+The features currently offered are the following:
+
+* multi-threaded or single-threaded reading
+* automatic decompression of input files (based on the filename extension,
+  such as ``my_data.json.gz``)
+* sophisticated type inference (see below)
+
+
+Usage
+-----
+
+JSON reading functionality is available through the :mod:`pyarrow.json` module.
+In many cases, you will simply call the :func:`read_json` function
+with the file path you want to read from::
+
+   >>> from pyarrow import json
+   >>> fn = 'my_data.json'
+   >>> table = json.read_json(fn)
+   >>> table
+   pyarrow.Table
+   a: int64
+   b: double
+   c: string
+   d: bool
+   >>> table.to_pandas()
+      a    b     c      d
+   0  1  2.0   foo  False
+   1  4 -5.5  None   True
+
+
+Automatic Type Inference
+------------------------
+
+Arrow :ref:`data types <data.types>` are inferred from the JSON types and
+values of each column:
+
+* JSON null values convert to the ``null`` type, but can fall back to any
+  other type.
+* JSON booleans convert to ``bool_``.
+* JSON numbers convert to ``int64``, falling back to ``float64`` if a
+  non-integer is encountered.
+* JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert
+  to ``timestamp[s]``, falling back to ``utf8`` if a conversion error occurs.
+* JSON arrays convert to a ``list`` type, and inference proceeds recursively
+  on the JSON arrays' values.
+* Nested JSON objects convert to a ``struct`` type, and inference proceeds
+  recursively on the JSON objects' values.
+
+Thus, reading this JSON file:
+
+.. code-block:: json
+
+   {"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
+   {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}
+
+returns the following data::
+
+   >>> table = json.read_json("my_data.json")
+   >>> table
+   pyarrow.Table
+   a: list<item: int64>
+     child 0, item: int64
+   b: struct<c: bool, d: timestamp[s]>
+     child 0, c: bool
+     child 1, d: timestamp[s]
+   >>> table.to_pandas()
+              a                                       b
+   0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}
+   1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}
+
+
+Customized parsing
+------------------
+
+To alter the default parsing settings in case of reading JSON files with an
+unusual structure, you should create a :class:`ParseOptions` instance
+and pass it to :func:`read_json`.  For example, you can pass an explicit
+:ref:`schema <data.schema>` in order to bypass automatic type inference.
+
+Similarly, you can choose performance settings by passing a
+:class:`ReadOptions` instance to :func:`read_json`.
diff --git a/python/pyarrow/_csv.pyx b/python/pyarrow/_csv.pyx
index 0cc424d..067b830 100644
--- a/python/pyarrow/_csv.pyx
+++ b/python/pyarrow/_csv.pyx
@@ -413,14 +413,15 @@ def read_csv(input_file, read_options=None, parse_options=None,
         The location of CSV data.  If a string or path, and if it ends
         with a recognized compressed file extension (e.g. ".gz" or ".bz2"),
         the data is automatically decompressed when reading.
-    read_options: ReadOptions, optional
-        Options for the CSV reader (see ReadOptions constructor for defaults)
-    parse_options: ParseOptions, optional
+    read_options: pyarrow.csv.ReadOptions, optional
+        Options for the CSV reader (see pyarrow.csv.ReadOptions constructor
+        for defaults)
+    parse_options: pyarrow.csv.ParseOptions, optional
         Options for the CSV parser
-        (see ParseOptions constructor for defaults)
-    convert_options: ConvertOptions, optional
+        (see pyarrow.csv.ParseOptions constructor for defaults)
+    convert_options: pyarrow.csv.ConvertOptions, optional
         Options for converting CSV data
-        (see ConvertOptions constructor for defaults)
+        (see pyarrow.csv.ConvertOptions constructor for defaults)
     memory_pool: MemoryPool, optional
         Pool to allocate Table memory from
 
diff --git a/python/pyarrow/_json.pyx b/python/pyarrow/_json.pyx
index b5c839b..ffbf01c 100644
--- a/python/pyarrow/_json.pyx
+++ b/python/pyarrow/_json.pyx
@@ -81,6 +81,7 @@ cdef class ReadOptions:
     def block_size(self, value):
         self.options.block_size = value
 
+
 cdef class ParseOptions:
     """
     Options for parsing JSON files.
@@ -160,9 +161,9 @@ def read_json(input_file, read_options=None, parse_options=None,
     ----------
     input_file: string, path or file-like object
         The location of JSON data.
-    read_options: ReadOptions, optional
+    read_options: pyarrow.json.ReadOptions, optional
         Options for the JSON reader (see ReadOptions constructor for defaults)
-    parse_options: ParseOptions, optional
+    parse_options: pyarrow.json.ParseOptions, optional
         Options for the JSON parser
         (see ParseOptions constructor for defaults)
     memory_pool: MemoryPool, optional