You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 04:58:19 UTC

[GitHub] [arrow] JayjeetAtGithub commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

JayjeetAtGithub commented on pull request #10431:
URL: https://github.com/apache/arrow/pull/10431#issuecomment-891525546


   Thanks @westonpace for sharing your thoughts.
   
   > So here is my current understanding. Let me know if this seems off. There are two pieces to this.
   > 
   > There is a ceph object class (called Skyhook?) which processes scan tasks and lives in a "contrib" directory.
   > 
   > There is a fragment / file format for Arrow that understands how to send scan requests to a ceph storage server in the skyhook format.
   > 
   That is correct.
   
   > These two components aren't tightly coupled. The only source of agreement is the Arrow columnar format and this flatbuffers file. So for example (these are thought exercises, not things that will necessarily ever happen):
   > 
   > * Ceph could be running an older version of Skyhook built with Arrow version X and the dataset client could be running a newer version of Arrow version X+N.
   
   Yeah, this could happen. In this case, we need to ensure that the storage side understand the `ScanRequest` language in which the client sends requests and can also the serialize tables in a buffer format understandable/supported by the client.
   
   > * Skyhook could switch to some other library entirely in the future and as long as it continued to respect the flatbuffers format it would continue to work.
   
   Similar as above I guess.
   
   > * A different non-arrow library (or an Arrow implementation in a different language) could decide to start sending requests to Skyhook and as long as they agreed upon the flatbuffers and arrow columnar format everything would continue to work.
   
   Yes, as along as both the client and server agrees upon the same send and receive protocol, they should work fine.
   
   > Given the above I think the proper place for this flatbuffers file to live is in the same directory as the ceph object class. This flatbuffers file is the API for skyhook.
   
   I agree. This is the send API for skyhook.
   
   > Then, for building everything, the make files for that directory could produce two artifacts: A ceph object class and a small C++ "client library" which is just the output of the flatbuffers compiler.
   > Or you could skip the "client library" step and add an extra build step for the datasets module which runs the flatbuffers compiler.
   
   Could you please explain this part a little more?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org