You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "jason khadka (Jira)" <ji...@apache.org> on 2021/02/02 16:39:00 UTC

[jira] [Created] (ARROW-11473) Needs a handling for missing columns while reading parquet file

jason khadka created ARROW-11473:
------------------------------------

             Summary: Needs a handling for missing columns while reading parquet file 
                 Key: ARROW-11473
                 URL: https://issues.apache.org/jira/browse/ARROW-11473
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
            Reporter: jason khadka


Currently there is no way to handle the error raised by missing columns in parquet file.

If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file

pd.read_parquet(file_name, columns = columns)

> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored.

Example :

{{}}
{code:java}

from pyarrow.lib import ArrowInvalid 

read_columns = ['a','b','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 

file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 

try: 
    df = pd.read_parquet(file_name, columns = read_columns) 
except ArrowInvalid as e: 
    inval = e 

print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
 

{{}}

You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example :

 

{{}}
{code:java}
inval.field 
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, where something like {{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)