You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/13 22:23:28 UTC

[GitHub] [arrow] ryan-johnson-databricks opened a new issue, #33662: Support parsing a StringArray full of JSON to a Table

ryan-johnson-databricks opened a new issue, #33662:
URL: https://github.com/apache/arrow/issues/33662

   ### Describe the enhancement requested
   
   Working with pyarrow, there doesn't seem to be any way to parse a `StringArray` full of json into an `Array` or `Table` of nested data. This surprises me, becuase [pyarrow.json.read_json](https://arrow.apache.org/docs/python/json.html) does exactly the right thing... but only for line-delimited json files. At least, I didn't see anything e.g. in [pyarrow.compute](https://arrow.apache.org/docs/python/compute.html) and a google search came up empty.
   
   Looking deeper there doesn't seem to be anything on the C++ side, either.
   
   Skimming the C++ sources, the code is a bit convoluted, but I think there's a core logic that could be wrapped up into a proper compute function. If I read correctly:
   * [TableReaderImpl::Read](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/reader.cc#L268)
   * calls [TableReaderImpl::ParseAndInsert](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/reader.cc#L287)
   * calls [ParseBlock](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/reader.cc#L159)
   * which uses a [HandlerBase](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/parser.cc#L649) instance that implements `BlockParser`
   * whose [doParse method](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/parser.cc#L773) uses a rapidjson `Reader` to do the heavy lifting.
   
   I _think_ to parse a `StringArray` (instead of a file), we'd just need an `ArrayParser` variant, similar to `BlockParser`, that presents rapidjson a different buffer for each string to be parsed --- maybe a  [std::spanstream](https://en.cppreference.com/w/cpp/io/basic_ispanstream) --- so that EOF becomes the "delimiter" instead of newline?
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org