You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "jhwang7628 (via GitHub)" <gi...@apache.org> on 2023/11/03 22:13:53 UTC

[I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

jhwang7628 opened a new issue, #38577:
URL: https://github.com/apache/arrow/issues/38577

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi,
   
   We have a parquet that used to read fine in 13.0.0, but now I got an error when calling via `pandas.read_parquet` using 14.0.0. The relevant error is:
   ```
     File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 3003, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,   
     File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2631, in read
       table = self._dataset.to_table(  
     File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3713, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2148480400
   ```
   
   Is this an intended behavior? I skimmed through the changelog but did not find this. Thanks.
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "alexeyche (via GitHub)" <gi...@apache.org>.

alexeyche commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1798188028

   Also noticed this, though hard to reproduce it without sharing data as of now, I might find time in future. In short it's just big dataset with nested arrays of floats


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1826032128

   From seeing the potential fix in https://github.com/apache/arrow/pull/38784, I could manage to create a simple reproducer:
   
   Creating this file with pyarrow 13.0 reads fine with that version:
   ```python
   import string
   import numpy as np
   import pyarrow as pa
   
   # column with >2GB data
   data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in np.random.randint(10, 500, size=10_000)]
   table = pa.table({'a': pa.array(data*1000)})
   
   import pyarrow.parquet as pq
   pq.write_table(table, "test_capacity.parquet")
   ```
   
   but reading with pyarrow 14:
   
   ```
   import pyarrow.parquet as pq
   pf = pq.ParquetFile("test_capacity.parquet")
   
   In [6]: pf.read()
   ...
   ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2148282365
   /home/joris/scipy/repos/arrow/cpp/src/arrow/array/builder_binary.h:332  ValidateOverflow(elements)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1202  acc_->builder->ReserveData( std::min<int64_t>(*estimated_data_length, ::arrow::kBinaryMemoryLimit))
   /home/joris/scipy/repos/arrow/cpp/src/parquet/encoding.cc:1407  helper.Prepare(len_)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:109  LoadBatch(batch_size)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1252  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
   /home/joris/scipy/repos/arrow/cpp/src/parquet/arrow/reader.cc:1233  fut.MoveResult()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "jhwang7628 (via GitHub)" <gi...@apache.org>.

jhwang7628 commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1830996664

   Thanks! I missed the above comments. Glad you were able to repro yourself. I'll wait until this PR gets incorporated to re-test it on my data. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1793350830

   Hmmm would you mind provide the file? A bit hard to check the scanner change without the data or logging..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "jhwang7628 (via GitHub)" <gi...@apache.org>.

jhwang7628 commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1799780808

   I cannot provide the data blob as it is company internal data. The issue is consistent and deterministic on my side, so I think maybe any large parquet will do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou closed issue #38577: Reading parquet file behavior change from 13.0.0 to 14.0.0
URL: https://github.com/apache/arrow/issues/38577


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1798195178

   https://github.com/apache/arrow/pull/38621
   
   A bug is found about reading parquet in python `read_table` calls more requests than expected. But I don't now if this is related to the issue.
   
   @alexeyche Would you mind provide some info about how to reproduce the issue here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1823181409

   > I cannot provide the data blob as it is company internal data. The issue is consistent and deterministic on my side, so I think maybe any large parquet will do?
   
   Would you be able to try to create a file that has the same characteristics as your internal data file but with random data? (eg approximately the same types, and number of rows and size) 
   That would help a lot in evaluating if we are actually fixing this issue with https://github.com/apache/arrow/pull/38784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1801247211

   @jhwang7628 After go though the code, I think this might related to https://github.com/apache/arrow/pull/38437
   
   I'll try to check this in this week. This might trable when you read a large binary column, after this patch it might reserve more data than expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Reading parquet file behavior change from 13.0.0 to 14.0.0 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #38577:
URL: https://github.com/apache/arrow/issues/38577#issuecomment-1817774617

   @jhwang7628 @alexeyche Would you mind check that whether you're using dictionary encoding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org