You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "grisaitis (via GitHub)" <gi...@apache.org> on 2023/03/14 02:43:37 UTC

[GitHub] [arrow] grisaitis opened a new issue, #34554: Best practice for adding a column for `__filename` without adding a duplicate field?

grisaitis opened a new issue, #34554:
URL: https://github.com/apache/arrow/issues/34554

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   #### my problem
   i have a `Dataset` and want to add the `__filename` field when creating a `Table` from this dataset. I've tried a few things:
   
   #### attempt 1: use `dataset.scanner`
   
   ```python
   columns = dataset.schema.names + ["__filename"]  # ["__last_in_fragment"]
   scanner = dataset.scanner(columns=columns)
   my_table = scanner.to_table()
   ```
   
   but with this, i get a duplicate column error when i try to access it with duckdb:
   ```python
   con = duckdb.connect()
   con.execute("SELECT * FROM my_table limit 5").fetch_df()
   ```
   ```
   duckdb.InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
   Error: Invalid Error: ArrowInvalid: Multiple matches for FieldRef.Name(__filename)
   ```
   
   #### attempt 2: use a dict of columns
   i have also tried creating the new column by specifying columns as expressions. but, `ds.field("__filename")` is not recognized:
   
   ```python
   columns = {x: ds.field(x) for x in dataset.schema.names}
   columns["filename"] = ds.field("__filename")
   scanner = dataset.scanner(columns=columns)
   ```
   
   ```
   Traceback (most recent call last)
   ----> scanner = dataset.scanner(columns=columns)
   /opt/conda/envs/myenv/lib/python3.11/site-packages/pyarrow/_dataset.pyx:336, in pyarrow._dataset.Dataset.scanner()
   /opt/conda/envs/myenv/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2576, in pyarrow._dataset.Scanner.from_dataset()
   /opt/conda/envs/myenv/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2484, in pyarrow._dataset.Scanner._make_scan_options()
   /opt/conda/envs/myenv/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2381, in pyarrow._dataset._populate_builder()
   /opt/conda/envs/myenv/lib/python3.11/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
   ArrowInvalid: No match for FieldRef.Name(__filename) in <schema...>
   ```
   
   #### tl;dr
   what is the best practice for adding fields like `__filename` to a table while avoiding duplicate column errors?
   
   #### related issues
   
   this might be related to
   - https://github.com/apache/arrow/issues/24407
   - https://github.com/dask/dask/issues/9251
   
   
   #### versions
   
   ```
   Linux-4.19.0-23-cloud-amd64-x86_64-with-glibc2.28
   Python 3.11.0 | packaged by conda-forge (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
   pyarrow 11.0.0
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Best practice for adding a column for `__filename` without adding a duplicate field? [arrow]

Posted by "dhirschfeld (via GitHub)" <gi...@apache.org>.
dhirschfeld commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1996646078

   Happy 1-year anniversary to this issue! :tada: 🎂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34554: Best practice for adding a column for `__filename` without adding a duplicate field?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1473833579

   We can probably call this a bug.  Both of those approaches should work.  In the first approach, does `my_table` load successfully?  How are you connecting the loaded table with duckdb?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Best practice for adding a column for `__filename` without adding a duplicate field? [arrow]

Posted by "dhirschfeld (via GitHub)" <gi...@apache.org>.
dhirschfeld commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1996637472

   > Seems to work, though I'm not sure how efficient it is :/
   
   Needless to say, it would be good to have this natively (and efficiently!) supported by `pyarrow` itself.
   
   One workaround would be to allow columns to be renamed in the scanner, similarly to how I'm doing it above but doing it on they fly rather than reiterating and mutating the dataset after the fact.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jwolosiuk commented on issue #34554: Best practice for adding a column for `__filename` without adding a duplicate field?

Posted by "jwolosiuk (via GitHub)" <gi...@apache.org>.
jwolosiuk commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1721058550

   Hi, is there any update on this, or any workaround? I got the same problem with version 13.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Best practice for adding a column for `__filename` without adding a duplicate field? [arrow]

Posted by "dhirschfeld (via GitHub)" <gi...@apache.org>.
dhirschfeld commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1975401506

   Also running into this 😔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Best practice for adding a column for `__filename` without adding a duplicate field? [arrow]

Posted by "zxexz (via GitHub)" <gi...@apache.org>.
zxexz commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1893149883

   I'm running into this issue once every couple weeks still. I've not been able to find a workaround beyond renaming things upstream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Best practice for adding a column for `__filename` without adding a duplicate field? [arrow]

Posted by "dhirschfeld (via GitHub)" <gi...@apache.org>.
dhirschfeld commented on issue #34554:
URL: https://github.com/apache/arrow/issues/34554#issuecomment-1996620231

   Seems to work, though I'm not sure how efficient it is :/
   ```python
   def rename_columns(schema: pa.Schema, **name_mapping) -> pa.RecordBatch:
       return pa.schema(
           field.with_name(name_mapping.get(field.name, field.name))
           for field in schema
       )
   
   
   def append_filename(ds: pa.dataset.Dataset) -> pa.dataset.Dataset:
       scanner = ds.scanner(columns=ds.schema.names + ['__filename'])
       schema = rename_columns(scanner.projected_schema, __filename='filename')
       batches = (
           pa.RecordBatch.from_arrays(batch.columns, schema=schema)
           for batch in scanner.to_batches()
       )
       return pa.dataset.dataset(list(batches))   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org