You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/16 12:31:59 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

jorisvandenbossche commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644732807


   I think we talked before about the difference between a "physical" schema and a "reader" (dataset) schema. 
   Right now a Fragment only knows about the physical schema, while here we need to know the dataset schema. To know this, we could 1) infer this from the partition expression as you do here in this PR, 2) keep (optionally) a reference to the dataset schema on the Fragment, or 3) let the user pass this schema.
   
   This third option we actually already do for `Fragment.scan/to_table/to_batches()`. 
   And I had forgotten that when opening the issue. Because for the example I showed for `to_table` on a fragment which raises an error:
   
   ```
   In [34]: fragment.to_table(filter=ds.field("part") == "A").to_pandas() 
   ...
   ArrowInvalid: Field named 'part' not found or not unique in the schema.
   ```
   
   this actually works fine if you specify the dataset schema:
   
   ```
   In [38]: fragment.to_table(filter=ds.field("part") == "A", schema=dataset.schema).to_pandas()
   Out[38]: 
      dummy part
   0      1    A
   1      1    A
   ```
   
   So the better solution might be to do something similar for `SplitByRowGroup` ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org