You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "aersam (via GitHub)" <gi...@apache.org> on 2023/07/10 13:17:44 UTC

[GitHub] [arrow] aersam opened a new issue, #36593: Add rename_columns to DataSet

aersam opened a new issue, #36593:
URL: https://github.com/apache/arrow/issues/36593

   ### Describe the enhancement requested
   
   Dataset has fewer methods than Table, which is fine, of course. However we often use rename_columns on Table and it would be really handy for us to have it on Dataset, too. I think it could easily be implemented using replace_schema
   
   https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset 
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] aersam commented on issue #36593: [Python] Add rename_columns to DataSet

Posted by "aersam (via GitHub)" <gi...@apache.org>.

aersam commented on issue #36593:
URL: https://github.com/apache/arrow/issues/36593#issuecomment-1662215807

   Seems using replace_schema does not work. The dataset always uses those column names to query the parquet, meaning the column names must match the ones in physical files. What really is needed is a separation between physical column name and logical column name. This would be really great, especially since parquet is a bit limited in what column names are allowed. 
   The best would be to have a "column mapping" in the [fragment](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html)  which would map the schema column names to physical column names. This would allow making queries with parquets with different physical column for the same logical column name. I guess that's a bit complex regarding the filters... but still would be great.
   
   If we'd want to abstract Apache Iceberg oder Delta Lake Tables with the dataset, this would be needed (both support such column mapping stuff)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Add rename_columns to DataSet [arrow]

Posted by "ion-elgreco (via GitHub)" <gi...@apache.org>.

ion-elgreco commented on issue #36593:
URL: https://github.com/apache/arrow/issues/36593#issuecomment-1878426156

   @wjones127 do you think this is something that can be added?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Add rename_columns to DataSet [arrow]

Posted by "davlee1972 (via GitHub)" <gi...@apache.org>.

davlee1972 commented on issue #36593:
URL: https://github.com/apache/arrow/issues/36593#issuecomment-1753963333

   Shouldn't dataset() just take the same parameters as to_table()? The difference is that to_table() produces a materialized view while dataset() creates a logical view.. This would include columns whether they are a subset or computed/renamed from the original dataset..
   
   I also not sure why there is a dataset.scanner class.. It looks logical, but you can't perform a join with it against another dataset.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org