You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/23 01:51:11 UTC

[GitHub] [arrow] njzhl93 opened a new issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

njzhl93 opened a new issue #8741:
URL: https://github.com/apache/arrow/issues/8741


   I have to partition DataFrame into several parquets, and read them by using from_paths.
   
   for example, I get two DataFrames like this:
   ```
   MDDate=20201113/20201113.parquet
        0  1      2     3    MDDate
   0  1.1  1  [1.1]     a  20201113
   1  2.2  2  [2.2]  None  20201113
   2  3.3  3  [3.3]     c  20201113
   
   MDDate=20201114/20201114.parquet
         0  1      2     3    MDDate
   0  None  1  [1.1]     a  20201114
   1  None  2   None  None  20201114
   2  None  3  [3.3]     c  20201114 
   ```
   I try to read them by FileSystemDataset.from_paths
   
   ```
   import pyarrow as pa
   import pyarrow.dataset as ds
   file_list = ['MDDate=20201113/20201113.parquet', 'MDDate=20201114/20201114.parquet']
   schemas = pa.schema([('0', pa.float64()), ('1', pa.int64()), ('2', pa.list_(pa.float64())), ('3', pa.string())])
   partitions = [ds.field("MDDate") == '20201113', ds.field("MDDate") == '20201114']
   dataset = ds.FileSystemDataset.from_paths(file_list,
                                             schema=schemas,
                                             format=ds.ParquetFileFormat(),
                                             filesystem=fs.SubTreeFileSystem(
                                                 'test_read',
                                                 fs.LocalFileSystem()),
                                             partitions=partitions)
   df = dataset.to_table().to_pandas()
   ```
   Then I get an error likes this
   
   ```
   Traceback (most recent call last):
     File "/app/mount/code/test_save_factor/test_read_from_path.py", line 36, in <module>
       df = dataset.to_table().to_pandas()
     File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: fields had matching names but differing types. From: 0: null To: 0: double
   I think this error is about schema. I set column 0 pa.float64(). but in MDDate=20201114/20201114.parquet, the schema of column 0 is pa.null(), because all rows are None, and can't convert it to pa.float64() automatically.
   ```
   
   Is it possible to read parquets by using from_paths when some columns don't have same schemas?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732028275


   @njzhl93 What version of pyarrow are you using? I *think* that we added the ability to cast from null to any other type in pyarrow 2.0. 
   
   Sidenote: in general we prefer the user mailing list for such questions: https://arrow.apache.org/community/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] njzhl93 commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

njzhl93 commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732031758


   @jorisvandenbossche I use 1.0.1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732042716


   Good to hear it is working now!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] njzhl93 commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

njzhl93 commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732038589


   @jorisvandenbossche I have tried 2.0.0 just now and it works fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] njzhl93 commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

njzhl93 commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732033615


   @jorisvandenbossche I will try, thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] njzhl93 closed issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

njzhl93 closed issue #8741:
URL: https://github.com/apache/arrow/issues/8741


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732032561


   Can you try with version 2.0.0 ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] njzhl93 edited a comment on issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Posted by GitBox <gi...@apache.org>.

njzhl93 edited a comment on issue #8741:
URL: https://github.com/apache/arrow/issues/8741#issuecomment-732031758


   @jorisvandenbossche I am using 1.0.1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org