You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/05/18 03:34:42 UTC

[GitHub] [arrow] wjones127 commented on pull request #35568: GH-33986: [Python] Sketch out a minimal protocol interface for datasets

wjones127 commented on PR #35568:
URL: https://github.com/apache/arrow/pull/35568#issuecomment-1552355842

   > I'm not sure the API you are defining helps you with that goal. I think what I is missing is the API used to create the dataset. What you've proposed here isn't flexible enough. For example, if I'm trying to convert a "named table request" (e.g. give me all rows from table "widgets" with filter "xyz" at time point Y) into a "scan request" (e.g. what pyarrow datasets can read) then I want something like...
   
   @westonpace you are correct that this doesn't define how such dataset classes are built. That's left to the consumer, who will write their own classes that conform to this API.
   
   However, I do like your idea for a dataset builder. I think it might be worth asking the PyIceberg developers whether something like that would work well for them. (I think Delta Lake and Lance will likely go the route of implementing their own classes in Rust.) I've noted this idea in https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit#heading=h.31rf5m1tlipg


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org