You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/02/05 16:04:00 UTC

[jira] [Updated] (ARROW-11473) [Python] Needs a handling for missing columns while reading parquet file

     [ https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-11473:
------------------------------------
    Summary: [Python] Needs a handling for missing columns while reading parquet file   (was: Needs a handling for missing columns while reading parquet file )

> [Python] Needs a handling for missing columns while reading parquet file 
> -------------------------------------------------------------------------
>
>                 Key: ARROW-11473
>                 URL: https://issues.apache.org/jira/browse/ARROW-11473
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: jason khadka
>            Priority: Major
>
> Currently there is no way to handle the error raised by missing columns in parquet file.
> If a column passed is missing, it just raises ArrowInvalid error
> {code:java}
> columns=[item1, item2, item3] #item3 is not there in parquet file
> pd.read_parquet(file_name, columns = columns)
> > ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
> There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored.
> Example :
> {code:java}
> from pyarrow.lib import ArrowInvalid 
> read_columns = ['a','b','X'] 
> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 
> file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 
> try: 
>     df = pd.read_parquet(file_name, columns = read_columns) 
> except ArrowInvalid as e: 
>     inval = e 
> print(inval.args)
> >("Field named 'X' not found or not unique in the schema.",){code}
>  
> You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example :
>  
> {code:java}
> inval.field 
> > 'X'{code}
> Or a better feature would be to have a error handling in read_table of pyarrow, where something like \{{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema.
> Example, in case above :
> {code:java}
> df = pd.read_parquet(file_name, columns = read_columns, error = 'ignore'){code}
> Would ignore the missing column 'X'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)