You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/23 16:15:39 UTC
[GitHub] [arrow] westonpace commented on pull request #35440: GH-34213: [C++] Use recursive calls without a delimiter if the user is doing a recursive GetFileInfo

westonpace commented on PR #35440:
URL: https://github.com/apache/arrow/pull/35440#issuecomment-1604505953

   I've gone through and done some more thorough testing now.  I was surprised to see we even have a big improvement when running in EC2, even if we're in the same datacenter (where the latency should be very low).  I tested with two test datasets.
   
   The first test dataset (s3://ursa-qa/wide-partition) was 10,000 files split into 10,000 folders nested two deep:
   
   ```
   x=0/y=0/file.parquet
   x=0/y=1/file.parquet
   ...
   x=0/y=99/file.parquet
   x=1/y=0/file.parquet
   ...
   x=99/y=99/file.parquet
   ```
   
   The second dataset (s3://ursa-qa/flat-partition) was 10,000 files in the root folder:
   
   ```
   file-0.parquet
   file-1.parquet
   ...
   file-9999.parquet
   ```
   
   For all of my tests I timed how long it took to recursively listed all files in the dataset.  I ran the tests on my local desktop (outside of EC2, on an EC2 server that was in a different region (us-west-2) and on an EC2 server that was in the same region (us-east-2).  All times are in seconds.
   
   There were (as hoped) significant performance improvements in the wide-partition dataset:
   
   ![Screenshot 2023-06-23 at 08-43-34 Online Graph Maker · Plotly Chart Studio](https://github.com/apache/arrow/assets/1696093/6819bdd9-30f8-48ec-b44d-792b335ee2ad)
   
   Regrettably, there might be a slight regression in the flat-partition dataset, although it is largely within the noise.  I have run the tests frequently enough that I feel it is stable:
   
   ![Screenshot 2023-06-23 at 08-44-36 Online Graph Maker · Plotly Chart Studio](https://github.com/apache/arrow/assets/1696093/53fcd6e0-1f3f-47ef-8f4e-f264ecd3770a)
   
   I've verified that, in both the old and new approach, we are sending the same number of HTTP messages to S3 and the content is very close (less than 300 bytes difference).  I don't think it is additional compute time (or else I'd expect to see a worse regression on the AWS servers).
   
   We could keep both styles (tree walking and recursive listing) but I don't think this regression is significant enough to justify the complexity.
   
   There is one other case that would likely regress.  That is the case where the data is deeply partitioned (e.g. each file is 4 or 5 folders deep) and the user specifies a low max recursion.  For example...
   
   ```
   a=0/b=0/c=0/d=0/e=0/file-0.parquet
   ...
   a=0/b=0/c=0/d=0/e=0/file-9999.parquet
   ```
   
   I would expect no regression if I fully listed the above dataset.  However, if I listed the above dataset with a max_recursion of 2 then the old approach would likely be much faster since it only needs to return 1 file info (the new approach would return all 10k file infos and then we would pare them down in memory).  I'm not aware of anyone using this use case (pyarrow doesn't even expose max_recursion) and so I'm not sure if it is worth the complexity of keeping both approaches.  Even in this case I suspect we would be looking at a difference of 0.3 vs 3 seconds which is better than our current worst case (~100 seconds).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org