You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/13 15:49:37 UTC

[GitHub] [arrow] jasonkhadka opened a new issue #9194: Needs a handling for missing columns in parquet file

jasonkhadka opened a new issue #9194:
URL: https://github.com/apache/arrow/issues/9194


   Currently there is no way to handle the error raised by missing columns in parquet file. If a column passed in `columns=[item1, item2, item3]` is missing it just raises : 
   `ArrowInvalid: Field named 'item3' not found or not unique in the schema.`
   
   There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored.
   
   https://github.com/apache/arrow/blob/ec18db9dbde801781109095dc4c7198dc35bbc07/python/pyarrow/parquet.py#L1657


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760857808


   > Like a keyword to indicate that missing columns in this list can be ignored instead of raising an error?
   
   Yes a `error='ignore'`  keyword would be a perfect solution. 
   
   
   > The name of the missing field is in the error message?
   
   Name of the missing field is there in the error message. But if you want to get the field name out of error so that you can use that to drop it from the list of columns and try again to read the parquet, it is difficult. 
   The error only contains the message, and it would be great if the error also included the field name as property, so error handling could be built. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-770164610


   closing.  I think opening a feature request/bug report through JIRA is the next step.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760768514


   What kind of way to handle this are you looking for? 
   Like a keyword to indicate that missing columns in this list can be ignored instead of raising an error?
   
   > The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored.
   
   The name of the missing field is in the error message?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield closed issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

emkornfield closed issue #9194:
URL: https://github.com/apache/arrow/issues/9194


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka edited a comment on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka edited a comment on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760857808


   > Like a keyword to indicate that missing columns in this list can be ignored instead of raising an error?
   
   Yes an `error='ignore'`  keyword would be a perfect solution. 
   
   
   > The name of the missing field is in the error message?
   
   Name of the missing field is there in the error message. But if you want to get the field name out of error so that you can use that to drop it from the list of columns and try again to read the parquet, it is difficult. 
   The error only contains the message, and it would be great if the error also included the field name as property, so error handling could be built. 
   
   
   Example : 
   
   
   ```
   from pyarrow.lib import ArrowInvalid
   try:
       df = pd.read_parquet(file_name, columns = columns)
   except ArrowInvalid as e:
       inval = e
   ```
   ```
   inval.args
   >("Field named 'COLUMN_A' not found or not unique in the schema.",)
   ```
   
   You could parse the message above to get 'COLUMN_A', but that is a bit of hectic solution. Would be great if the error message contained the field name. So, you could do for example : 
   
   ```
   inval.field
   > 'COLUMN_A'
   ```
   And with this, one could remove 'COLUMN_A' form the list 'columns' and then retry reading the parquet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka edited a comment on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka edited a comment on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760857808


   > Like a keyword to indicate that missing columns in this list can be ignored instead of raising an error?
   
   Yes an `error='ignore'`  keyword would be a perfect solution. 
   
   
   > The name of the missing field is in the error message?
   
   Name of the missing field is there in the error message. But if you want to get the field name out of error so that you can use that to drop it from the list of columns and try again to read the parquet, it is difficult. 
   The error only contains the message, and it would be great if the error also included the field name as property, so error handling could be built. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-763045595


   That seems like a reasonable request.  Could you please report this feature request on Arrow's [JIRA](https://issues.apache.org/jira/browse/ARROW)?.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-762170184


   I guess this could be done, but it would be convenient if there was a simple argument to ignore the columns that are not present on the dataset. 
   Also it might not be possible to get all the dtype before reading the parquet. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka closed issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka closed issue #9194:
URL: https://github.com/apache/arrow/issues/9194


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jasonkhadka edited a comment on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

jasonkhadka edited a comment on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760857808


   > Like a keyword to indicate that missing columns in this list can be ignored instead of raising an error?
   
   Yes an `error='ignore'`  keyword would be a perfect solution. 
   
   
   > The name of the missing field is in the error message?
   
   Name of the missing field is there in the error message. But if you want to get the field name out of error so that you can use that to drop it from the list of columns and try again to read the parquet, it is difficult. 
   The error only contains the message, and it would be great if the error also included the field name as property, so error handling could be built. 
   
   
   Example : 
   
   
   ```
   from pyarrow.lib import ArrowInvalid
   
   read_columns = ['a','b','X']
   
   df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
   file_name = '/tmp/my_df.pq'
   df.to_parquet(file_name)
   
   
   try:
       df = pd.read_parquet(file_name, columns = read_columns)
   except ArrowInvalid as e:
       inval = e
   ```
   ```
   inval.args
   >("Field named 'X' not found or not unique in the schema.",)
   ```
   
   You could parse the message above to get 'X', but that is a bit of hectic solution. Would be great if the error message contained the field name. So, you could do for example : 
   
   ```
   inval.field
   > 'X'
   ```
   And with this, one could remove 'X' form the list 'read_columns' and then retry reading the parquet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #9194: Needs a handling for missing columns in parquet file

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-761062704


   Do you know the data type of the missing column?  If so, you can use the datasets API to read the table.  The datasets API can take in a expected schema that has all columns that might be asked for.  This allows for dataset evolution where you have a master schema for a collection of files but individual files might not have all the columns.
   
   ```
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as pads
   
   read_columns = ['a','b','X']
   
   df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
   file_name = '/tmp/my_df.pq'
   df.to_parquet(file_name)
   
   schema = pa.schema([
       ('a', pa.int64()),
       ('b', pa.string()),
       ('X', pa.int32())
   ])
   
   # df = pd.read_parquet(file_name, columns = read_columns)                                                                                                                                                  
   ds = pads.dataset([file_name], schema=schema)
   table = ds.to_table()
   print(table)
   print(table.column('X').to_pylist())
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org