You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/09/06 14:18:00 UTC

[jira] [Commented] (ARROW-8210) [C++][Dataset] Handling of duplicate columns in Dataset factory and scanning

    [ https://issues.apache.org/jira/browse/ARROW-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600788#comment-17600788 ] 

Joris Van den Bossche commented on ARROW-8210:
----------------------------------------------

There is some discussion in https://github.com/apache/arrow/pull/13938#issuecomment-1223657494 about how this could be handled

> [C++][Dataset] Handling of duplicate columns in Dataset factory and scanning
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-8210
>                 URL: https://issues.apache.org/jira/browse/ARROW-8210
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> While testing duplicate column names, I ran into multiple issues:
> * Factory fails if there are duplicate columns, even for a single file
> * In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal
> * Once a Dataset with duplicated columns is created, scanning without any column projection fails
> ---
> My python reproducer:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> import pyarrow.fs
> # create single parquet file with duplicated column names
> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])], names=['a', 'b', 'a'])
> pq.write_table(table, "data_duplicate_columns.parquet")
> {code}
> Factory fails:
> {code}
> dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet")
> ...
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format)
>     346 
>     347     factories = [_ensure_factory(f, **kwargs) for f in paths_or_factories]
> --> 348     return UnionDatasetFactory(factories).finish()
>     349 
>     350 
> ArrowInvalid: Can't unify schema with duplicate field names.
> {code}
> And when creating a Dataset manually:
> {code:python}
> schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])
> dataset = ds.FileSystemDataset(
>     schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
>     [str(basedir / "data_duplicate_columns.parquet")], [ds.ScalarExpression(True)])
> {code}
> then scanning fails:
> {code}
> >>> dataset.to_table()
> ...
> ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
> b: int64
> a: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)