You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jason khadka (Jira)" <ji...@apache.org> on 2021/02/02 16:39:00 UTC
[jira] [Created] (ARROW-11473) Needs a handling for missing columns
while reading parquet file
jason khadka created ARROW-11473:
------------------------------------
Summary: Needs a handling for missing columns while reading parquet file
Key: ARROW-11473
URL: https://issues.apache.org/jira/browse/ARROW-11473
Project: Apache Arrow
Issue Type: New Feature
Components: Python
Reporter: jason khadka
Currently there is no way to handle the error raised by missing columns in parquet file.
If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file
pd.read_parquet(file_name, columns = columns)
> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored.
Example :
{{}}
{code:java}
from pyarrow.lib import ArrowInvalid
read_columns = ['a','b','X']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
file_name = '/tmp/my_df.pq' df.to_parquet(file_name)
try:
df = pd.read_parquet(file_name, columns = read_columns)
except ArrowInvalid as e:
inval = e
print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
{{}}
You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example :
{{}}
{code:java}
inval.field
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, where something like {{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)