You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/15 17:03:44 UTC

[GitHub] [arrow] westonpace commented on issue #9194: Needs a handling for missing columns in parquet file

westonpace commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-761062704


   Do you know the data type of the missing column?  If so, you can use the datasets API to read the table.  The datasets API can take in a expected schema that has all columns that might be asked for.  This allows for dataset evolution where you have a master schema for a collection of files but individual files might not have all the columns.
   
   ```
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as pads
   
   read_columns = ['a','b','X']
   
   df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
   file_name = '/tmp/my_df.pq'
   df.to_parquet(file_name)
   
   schema = pa.schema([
       ('a', pa.int64()),
       ('b', pa.string()),
       ('X', pa.int32())
   ])
   
   # df = pd.read_parquet(file_name, columns = read_columns)                                                                                                                                                  
   ds = pads.dataset([file_name], schema=schema)
   table = ds.to_table()
   print(table)
   print(table.column('X').to_pylist())
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org