You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "R-JunmingChen (via GitHub)" <gi...@apache.org> on 2023/04/13 03:42:56 UTC

[GitHub] [arrow] R-JunmingChen opened a new issue, #35096: The parameter newlines_in_values doesn't work for pyarrow.json

R-JunmingChen opened a new issue, #35096:
URL: https://github.com/apache/arrow/issues/35096

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I use the following code to read json file with parrow released version 11.0 from Anaconda
   
   ```
   import pyarrow.json as pj
   json_f=pj.read_json("test.json",parse_options=pj.ParseOptions(newlines_in_values=False))
   json_t=pj.read_json("test.json",parse_options=pj.ParseOptions(newlines_in_values=True))
   ```
   Here is the file content of test.json
   ```
   
   
   {
       "name"
   
       :
       12312
       ,
       "b"
       :
       "test\\n"
   }
   
   {"name":123
   
   ,
   
   "b":
   "\n89\n"}
   
   {"name":123123
   
   }
   
   
   
   ```
   However, the value of json_f and json_t are the same.
   
   ```
   pyarrow.Table
   name: int64
   b: string
   ----
   name: [[12312,123,123123]]
   b: [["test\n","
   89
   ",null]]
   ```
   It seems that the `newlines_in_values` fails to work.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35096: [Python] The parse_options parameter newlines_in_values doesn't work when reading JSON

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35096:
URL: https://github.com/apache/arrow/issues/35096#issuecomment-1511904818

   CC @benibus 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen commented on issue #35096: [Python] The parse_options parameter newlines_in_values doesn't work when reading JSON

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen commented on issue #35096:
URL: https://github.com/apache/arrow/issues/35096#issuecomment-1529412834

   > Sorry for the delay. `newlines_in_values` shouldn't actually affect the resulting table. It mostly serves as a warning to the reader that the source's JSON objects can't be reliably delimited by raw newlines - so a more expensive chunking path is taken prior to each chunk being parsed individually. Otherwise, parsing errors are very likely.
   > 
   > In your case, when `newlines_in_values=false`, you would get an error if you set `ReadOptions::block_size` to 64 (where the file size is 120). However, it would work just fine with `newlines_in_values=true`.
   > 
   > That being said, I'm not entirely sure why `newlines_in_values` isn't in `ReadOptions` instead. Looking at the C++ implementation, the option doesn't appear to be used by the parser at all.
   
   It resolves my confusion.
   May be we should refine the doc of parse_options? Since it hard to get the point of  the function of `newlines_in_values` .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on issue #35096: [Python] The parse_options parameter newlines_in_values doesn't work when reading JSON

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on issue #35096:
URL: https://github.com/apache/arrow/issues/35096#issuecomment-1522485993

   Sorry for the delay. `newlines_in_values` shouldn't actually affect the resulting table. It mostly serves as a warning to the reader that the source's JSON objects can't be reliably delimited by raw newlines - so a more expensive chunking path is taken prior to each chunk being parsed individually. Otherwise, parsing errors are very likely.
   
   In your case, when `newlines_in_values=false`, you would get an error if you set `ReadOptions::block_size` to 64 (where the file size is 120). However, it would work just fine with `newlines_in_values=true`. 
   
   That being said, I'm not entirely sure why `newlines_in_values` isn't in `ReadOptions` instead. Looking at the C++ implementation, the option doesn't appear to be used by the parser at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen closed issue #35096: [Python] The parse_options parameter newlines_in_values doesn't work when reading JSON

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen closed issue #35096: [Python] The parse_options parameter newlines_in_values doesn't work when reading JSON
URL: https://github.com/apache/arrow/issues/35096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org