You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/15 18:39:28 UTC

[GitHub] [arrow] bkietz opened a new pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

bkietz opened a new pull request #7438:
URL: https://github.com/apache/arrow/pull/7438


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644317249


   https://issues.apache.org/jira/browse/ARROW-9105


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644747978


   I agree with the proposition #3, it aligns with the other method exposed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644732807


   I think we talked before about the difference between a "physical" schema and a "reader" (dataset) schema. 
   Right now a Fragment only knows about the physical schema, while here we need to know the dataset schema. To know this, we could 1) infer this from the partition expression as you do here in this PR, 2) keep (optionally) a reference to the dataset schema on the Fragment, or 3) let the user pass this schema.
   
   This third option we actually already do for `Fragment.scan/to_table/to_batches()`. 
   And I had forgotten that when opening the issue. Because for the example I showed for `to_table` on a fragment which raises an error:
   
   ```
   In [34]: fragment.to_table(filter=ds.field("part") == "A").to_pandas() 
   ...
   ArrowInvalid: Field named 'part' not found or not unique in the schema.
   ```
   
   this actually works fine if you specify the dataset schema:
   
   ```
   In [38]: fragment.to_table(filter=ds.field("part") == "A", schema=dataset.schema).to_pandas()
   Out[38]: 
      dummy part
   0      1    A
   1      1    A
   ```
   
   So the better solution might be to do something similar for `SplitByRowGroup` ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques closed pull request #7438: ARROW-9105: [C++][Dataset][Python] Pass an explicit schema to split_by_row_groups

Posted by GitBox <gi...@apache.org>.
fsaintjacques closed pull request #7438:
URL: https://github.com/apache/arrow/pull/7438


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644742632


   That's doable, and a more minimal change. The schema option would only be relevant to Python (since that's where implicit casts are inserted, so that's where we'd need the extra schema information). I'll refactor to use that approach


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org