You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/27 18:37:27 UTC

[GitHub] [arrow] saulpw opened a new pull request, #13442: ARROW-9612: [IO] increase default block_size from 1MB to 16MB

saulpw opened a new pull request, #13442:
URL: https://github.com/apache/arrow/pull/13442

   Parsing JSONL with large objects fails with previous 1MB block_size.
   Experimentation is required to find a workable block_size.
   These days JSON blobs are more frequently bigger than 1MB.
   16MB is a more reasonable balance between amount of memory allocated for the
   buffer, and number of JSON files in the wild that can be parsed
   immediately and without friction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171419265

   I still disagree. First I'll note that the error message is quite explicit about the workaround ("straddling object straddles two block boundaries (try to increase block size?)"). Second, it's not about "feeling better about cache locality", but potentially decreasing performance for actual users (which, unlike the users getting an error, will have no clue why performance may have decreased).
   
   If we didn't care about performance or resource consumption at all, we could just read the entire file in one block and that would solve the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171266568

   Benchmark runs are scheduled for baseline = a227e50c0102898ff594d2da0479819272ecbfdd and contender = f74be371a9fd9bf640a331d8765a52797b9f0c0e. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e5e8eab946204a7097788c19deaab458...8848c3a8e85a4f60a2907e142e3a9119/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5f34065aa8ad4012988a48a1982d3d60...d15fa8ce5aab400096e656261111cb81/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/68e91f2b226e4993a269020626da1fe0...de2adcc3838b42b49f99350175a4da6a/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1b04cdcf4a54798a7a99f555f8cfe7c...89c60386cfe741f28b7aabb6f6bc5814/)
   Buildkite builds:
   [Scheduled] [`f74be371` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1017)
   [Scheduled] [`f74be371` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1016)
   [Scheduled] [`f74be371` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1004)
   [Scheduled] [`f74be371` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1021)
   [Finished] [`a227e50c` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/991)
   [Failed] [`a227e50c` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/992)
   [Finished] [`a227e50c` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/981)
   [Finished] [`a227e50c` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/995)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171328009

   Got it. If we don't have any benchmarks for large files how do we know either way beyond "the current default works for some use cases but not others"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13442: ARROW-9612: [IO] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1167735500

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171288808

   IIRC we don't have any real-world JSON parsing benchmarks. There are some C++ micro-benchmarks, but I don't think they would help estimate scaling characteristics on large data files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] saulpw commented on pull request #13442: ARROW-9612: [IO] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

saulpw commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1167949244

   Thanks Weston.  I agree that handling large blobs without needing to muck with the block size would be ideal.  We can/should still do that at some point, and this PR just makes the experience a bit better in the interim.  Thanks for the speedy response!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171311595

   Hmm, sorry, perhaps I'm being dense but I don't understand how your question relates to what I said?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171499549

   I did argue that the UX improvement may actually be a regression for an (unknown, but most probably non-zero) number of users. Honestly, I don't think it's useful to reiterate all arguments again.
   
   If you want to publish a wrapper library to change a single setting so that users avoid an error message that precisely suggests trying to change that setting, then feel free to do so, though I'm struggling to understand what the benefit would be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] saulpw commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

saulpw commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171467657

   > If we didn't care about performance or resource consumption at all, we could just read the entire file in one block and that would solve the issue.
   
   Let's not fall into slippery-slope strawmen, that is not what this PR is about at all.
   
   Is there any amount we could bump this number that you would be okay with?  Assuming a distribution of data sizes, even 2x would make this better for some group of people (though not me or the bug participants).  Basically are we arguing over a value or a principle?  If the latter, then what is the principle at stake?  I'd like to know so I can develop my personal heuristic of when it's worthwhile to submit small UX improvements like this to Arrow, vs. e.g. creating a separate project that wraps Arrow and makes it nicer for users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171233258

   Well, there's no reason that 16MB is a "nice" value; it happens to work in your case and may not work in another case. 
   
   > I don't think we're going to address this in general without implementing JSON reading for which block size is not a user-facing concern.
   
   Right, so it doesn't make sense to change the default just to accomodate a particular use case without addressing the others, especially as the performance impact hasn't been researched. I recommend closing this PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171320351

   > Hmm, sorry, perhaps I'm being dense but I don't understand how your question relates to what I said?
   
   No I'm probably not being clear.
   
   You said
   
   > IIRC we don't have any real-world JSON parsing benchmarks. There are some C++ micro-benchmarks, but I don't think they would help estimate scaling characteristics on large data files.
   
   If we don't have any real world benchmarks that can help estimate scaling characteristics on large files, then we should be able to tell whether the 16MB buffer size affects the micro benchmarks.
   
   In the case where a file is larger than 1MB, it's required to set the block size anyway, so I'm not following what the counter argument to this PR regarding the performance of large files. You get what you get because there's no other option just to have something that works at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171567588

   I don't want the takeaway from this to be that Arrow doesn't care about UX, or that reports will go ignored. I think this is a conscious choice to prioritize a foundational fix. (FWIW, the error message could be more explicit about what to do.) 
   
   I've filed ARROW-16944 to look at more reasonable benchmarks. 
   
   I'm not sure how much optimization work our JSON reader has had and if JSON is becoming more popular, it may be worth more attention (right now it's not even hooked into Datasets).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1170226163

   > Would the "better" approach be something like determine optimal block_size based on blob size unless manually specified on ReadOptions?
   
   @raulcd I think so.  Either grow the block size when a large blob is encountered or gather the blocks in a queue until read to process.  It should be fairly straightforward for a single-threaded reader but might be more complicated if we have a multi-threaded JSONL reader (I don't recall which the current reader is).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171266451

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] saulpw commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

saulpw commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171546626

> If you want to publish a wrapper library to change a single setting

Arrow has many dozens or hundreds of frictions like this, and this PR was partly to see how easy it would be to improve things at the source. I don't have the energy to fight for every 1% UX gain, because you're right, it's not worth it, users can "just" do the work to read the error message and try out different values and get past the friction themselves. But I have a different vision, in which Arrow is more pleasant and convenient to use, and that requires systematically addressing frictions that real users experience. So in the interest of heading towards that vision with the resources I have and not creating unnecessary friction for Arrow maintainers, it's probably best for me to do this work on top of Arrow rather than within it. In the parlance of git, Arrow is more like plumbing than porcelain, which is a good thing for me to keep in mind.

Thanks to everyone for entertaining this PR and the ensuing discussion. I'll close this PR and take a different approach to addressing the issue.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

lidavidm commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1168611829

   Can we file an issue for the 'better' approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13442: ARROW-9612: [IO] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1167735459

   https://issues.apache.org/jira/browse/ARROW-9612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] saulpw commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

saulpw commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171415516

   > In which datasets? Unless there are hundreds/thousands of fields per object, it doesn't sound that likely.
   
   Per the original issue ARROW-9612:
   
   > [abhishek bharani] I got this issue when json record was on single line with col length 2097323. 
   
   > [Pere-Lluís Huguet Cabot] I have added a file that prompts the same error. It is a dump of wikipedia abstracts with wikidata information.
   
   So we don't have any "real" data on distribution of sizes, but we now have 3 people who encountered this error and felt enough friction to post an issue on JIRA (or submit an actual PR in my case).  Given what I know about how few people engage with OSS vs just suck it up and feel a bit worse about the project, this is significant enough for me to say "let's bump the default ~10x" as an immediate workaround.  One order of magnitude is a generally reasonable heuristic to bump in cases like this, so let's not entertain slippery slope arguments.  I did debate myself a bit about 8MB, 10MB, and 16MB, and if e.g. 8MB would make @pitrou feel better about cache locality, I can change the PR to that so we can move on from arguing about the workaround to implementing a longer-term fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171172497

   > Each row is about 10 MB of JSON.
   
   So 16MB is just barely adequate and may be too small for other similar datasets?
   
   Keep in mind that the block size is not merely used for type inference, it's used as a unit of work for batching and parallelization. A large value could be detrimental to performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13442: ARROW-9612: [IO] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1167878337

   It's configurable so presumable users with large blocks could always configure it larger.  This change is simple enough and I don't *think* this would have much impact on performance but it would be nice if we had some kind of JSONL benchmark.
   
   I do like Wes' idea of handling large blobs without requiring reconfiguration of the default block size.
   
   I guess I'm +0 and if no one else has any other strong opinions in a few days then let's go with this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171150086

   > These days JSON blobs are more frequently bigger than 1MB.
   
   In which datasets? Unless there are hundreds/thousands of fields per object, it doesn't so that likely.
   To put it in another way: if blobs are bigger than 1MB, then a mere one thousand rows will take more than 1GB. That's quite a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171266379

   Let's see what the benchmarks say!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171684040

   Ideally our defaults should get 80% of the performance while supporting 80% of the use cases.  Unfortunately, in this case, we don't know much about either number.  Abandoning this change is probably the right decision at the moment but I agree with David that it would be nice if someone had the bandwidth to add JSON to datasets and add some benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] saulpw closed pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

saulpw closed pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB
URL: https://github.com/apache/arrow/pull/13442


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171334430

   We don't really. I think the basic design of having a fixed block size was copied from the CSV reader where 1MB was measured to be a good compromise on some actual files. More abstractly, the block size is meant to allow for decent CPU cache locality, for which 16MB feels a bit too large in the general case.
   
   In the end, it would be better to have significant macro-benchmarks, both for CSV and for JSON. But that needs someone to spend time adding those benchmarks...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] raulcd commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

raulcd commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1170144087

   Would the "better" approach be something like determine optimal block_size based on blob size unless manually specified on ReadOptions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171166647

   Datasets that come from JSON-producing APIs often have unpredictable blob sizes, so it's not easy to make agreeably objective statements about frequency of occurrence.
   
   Even we had a count of N datasets that had "large" rows who's to say that's frequent or not?
   
   The main point is in the short term to have a default `block_size` that's big enough to accommodate "unreasonably" large rows without forcing users to fiddle with it, and in the medium to long term to implement a solution using block resizing or perhaps expore use of a streaming JSON parser that may allow a constant block size.
   
   We're currently working with 2020 US election data pulled (at the time of the election) from the New York Times. Each row is about 10 MB of JSON. There's a _ton_ of nesting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171183147

   > > Each row is about 10 MB of JSON.
   > 
   > So 16MB is just barely adequate and may be too small for other similar datasets?
   
   I guess? There's an additional 60% over 10MB, unclear without something concrete whether that's barely adequate for similar datasets.
   
   I don't think we're going to address this in general without implementing JSON reading for which block size is not a user-facing concern.
   
   > 
   > Keep in mind that the block size is not merely used for type inference, it's used as a unit of work for batching and parallelization. A large value could be detrimental to performance.
   
   It seems reasonable to trade off performance for being able to do anything at all. If I have to pass in the value anyway to get working code, I'm not yet thinking about performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171321955

   > If we don't have any real world benchmarks that can help estimate scaling characteristics on large files, then we should be able to tell whether the 16MB buffer size affects the micro benchmarks.
   
   Right, but that wouldn't tell us much otherwise.
   
   
   > In the case where a file has rows larger than 1MB, it's required to set the block size anyway, so I'm not following what the counter argument to this PR regarding the performance of large files
   
   A large file does not necessary have large rows. Large CSV files can typically have very small rows, and I expect at least _some_ large JSON files to be similar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Posted by GitBox <gi...@apache.org>.

cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171307781

   I thought the concern was for the common case of < 1MB, not for the unknown-anywhere case. Am I mistaken?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org