You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/30 12:36:14 UTC

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171166647

   Datasets that come from JSON-producing APIs often have unpredictable blob sizes, so it's not easy to make agreeably objective statements about frequency of occurrence.
   
   Even we had a count of N datasets that had "large" rows who's to say that's frequent or not?
   
   The main point is in the short term to have a default `block_size` that's big enough to accommodate "unreasonably" large rows without forcing users to fiddle with it, and in the medium to long term to implement a solution using block resizing or perhaps expore use of a streaming JSON parser that may allow a constant block size.
   
   We're currently working with 2020 US election data pulled (at the time of the election) from the New York Times. Each row is about 10 MB of JSON. There's a _ton_ of nesting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org