You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Zeeyi13 (via GitHub)" <gi...@apache.org> on 2024/03/22 06:51:59 UTC

[I] [Go] Need help on reading parquet from S3 [arrow]

Zeeyi13 opened a new issue, #40737:
URL: https://github.com/apache/arrow/issues/40737

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hi team,
   
   I would like to read a parquet file from S3 with high performance. Is there any hit or an example for me to start with?
   
   I have tried writing a customized reader (internally it's leveraging S3 API to fetch a range of bytes) and passed it to function  `file.NewParquetReader()` , but noticed performance is not very ideal. 
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Go] Need help on reading parquet from S3 [arrow]

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on issue #40737:
URL: https://github.com/apache/arrow/issues/40737#issuecomment-2015700690

   12 minutes seems *really* bad, much worse than I'd expect. I've definitely seen better performance from S3 than that in the past, so I wonder where that time is being spent


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Go] Need help on reading parquet from S3 [arrow]

Posted by "Zeeyi13 (via GitHub)" <gi...@apache.org>.

Zeeyi13 commented on issue #40737:
URL: https://github.com/apache/arrow/issues/40737#issuecomment-2015618826

   Thanks @zeroshade for the quick reply.
   
   Just tried the s3iofs file reader ,  140MB file takes 12 mins to read VS if reading from local , it's ~ 14s or less.  It's expected to see slowness when reading from S3, but 12 mins is too long for our application. I have to check if there is other way to improve the performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Go] Need help on reading parquet from S3 [arrow]

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on issue #40737:
URL: https://github.com/apache/arrow/issues/40737#issuecomment-2015245062

   Personally, I would use https://github.com/wolfeidau/s3iofs to open the file which will internally leverage the s3 API to fetch the byte ranges and just pass it to `file.NewParquetReader` like you suggested. 
   
   I would only go down to creating your own page readers if you find the above isn't performant enough. It's unlikely that going down to that level would provide much in the way of performance gains. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org