You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "collimarco (via GitHub)" <gi...@apache.org> on 2023/03/29 21:10:36 UTC

[GitHub] [arrow] collimarco opened a new issue, #34781: Query a parquet file on S3 without downloading it (using byte ranges)

collimarco opened a new issue, #34781:
URL: https://github.com/apache/arrow/issues/34781

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I am new to this project and I am looking at the Ruby gem in particular:
   
   https://github.com/apache/arrow/tree/main/ruby
   
   If you load **a large Parquet file stored in S3**:
   
   ```
   require 'arrow-dataset'
   
   s3_uri = URI('s3://bucket/public.csv')
   Arrow::Table.load(s3_uri)
   ```
   
   Can you query it without downloading the entire file? Is it possible to download only the relevant parts using byte ranges? Or the above command will always download the entire file and it's the only available solution?
   
   Thanks
   
   ### Component(s)
   
   Ruby


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou closed issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou closed issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)
URL: https://github.com/apache/arrow/issues/34781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] collimarco commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "collimarco (via GitHub)" <gi...@apache.org>.

collimarco commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1561045360

   @kou 
   
   I am testing this code:
   
   ```ruby
   table = Arrow::Table.load('sample.parquet', format: :parquet, filter: [:equal, :status, 200])
   puts table
   ```
   
   The problem is that it prints rows that don't match the filter... for example it also prints rows that have a status 302 or 500. Why? Is that a bug? Or I misunderstood something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] collimarco commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "collimarco (via GitHub)" <gi...@apache.org>.

collimarco commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1500078801

   @kou Thanks for the reply!
   
   > We can't use byte ranges directly but we can use condition push down to reduce download size
   
   Can you please clarify this? Aren't they the same thing? You are basically avoiding the download of the entire file and you read it partially (so I guess that the partial download uses byte ranges under the hood)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] collimarco commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "collimarco (via GitHub)" <gi...@apache.org>.

collimarco commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1504907787

   > So we don't need to download entire data.
   
   Great. 
   
   > Could you explain your use case?
   
   I would like to store a large parquet file on S3 and search on it with some conditions without having to download it entirely.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] collimarco commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "collimarco (via GitHub)" <gi...@apache.org>.

collimarco commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1562775255

   @kou Thanks for the reply. Your code gives me this error:
   
   ```
   /Users/example/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/red-arrow-12.0.0/lib/arrow/table-loader.rb:61:in `load': Arrow::Table load source must be one of [uri_https, file, uri_http]: #<URI::Generic sample.parquet> (ArgumentError)
   ```
   
   There is a `sample.parquet` in the local directory, just for testing. `filter` doesn't work for local files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1562275895

   Ah, you need to use `URI` instead of `String` to use `filter:` (and `format: :parquet` is redundant when a file name has `.parquet` extension):
   
   ```text
   table = Arrow::Table.load(URI('sample.parquet'), filter: [:equal, :status, 200])
   puts table
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1500095673

   Internal S3 filesystem implementation will use bytes range: https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1103-L1104
   But do you really want to know about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1499781524

   Sorry, I missed this.
   
   We can't use byte ranges directly but we can use condition push down to reduce download size.
   
   We doesn't provide DSL to build condition yet (#29928) but the following will work:
   
   ```ruby
   require 'arrow-dataset'
   
   s3_uri = URI('s3://bucket/public.parquet')
   Arrow::Table.load(s3_uri, filter: [:equal, :column_name, 100]) # "column_name == 100"
   ```
   
   See https://arrow.apache.org/docs/cpp/compute.html for available functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1504918832

   OK. Then you can use the condition push down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1563483369

   Ah, sorry. I forgot to mention explicitly: You need to add `require "arrow-dataset"` to use `filter:`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] collimarco commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "collimarco (via GitHub)" <gi...@apache.org>.

collimarco commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1564506092

   @kou Thanks! It works!
   
   The only downside is that in my benchmark on a 100M log file I don't see any performance gains.
   
   Original Ruby program:
   
   ```
   require 'arrow'
   require 'parquet'
   
   table = Arrow::Table.load('sample.parquet', format: :parquet)
   puts table.slice { |slicer| slicer['status'] == 200 }
   ```
   
   Improved Ruby program:
   
   ```
   require 'arrow'
   require 'parquet'
   require 'arrow-dataset'
   
   file = URI('sample.parquet')
   table = Arrow::Table.load(file, filter: [:equal, :status, 200])
   puts table
   ```
   
   Measuring with the `time` command I get `~ 1.2s` for both versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1565152775

   Could you try a Parquet file on S3 too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kou commented on issue #34781: [Ruby] Query a parquet file on S3 without downloading it (using byte ranges)

Posted by "kou (via GitHub)" <gi...@apache.org>.

kou commented on issue #34781:
URL: https://github.com/apache/arrow/issues/34781#issuecomment-1500093479

   If we use condition push down, we can reduce target data in `.parquet`. So we don't need to download entire data.
   
   Could you explain your use case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org