You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 09:46:30 UTC

[GitHub] [arrow] yurikoomiga opened a new issue, #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

yurikoomiga opened a new issue, #13186:
URL: https://github.com/apache/arrow/issues/13186

   Hi All
   
   When I use Arrow Reading Parquet File like this:
   
   `auto st = parquet::arrow::FileReader::Make(
                       arrow::default_memory_pool(),
                       parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);
   arrow::Status status = _reader->GetRecordBatchReader({_current_group},
                                                                        _parquet_column_ids, &_rb_batch);
    _reader->set_batch_size(65536);
    _reader->set_use_threads(true);
    status = _rb_batch->ReadNext(&_batch);`
   
   status is not ok and an error occured like this:
   `IOError: Corrupt snappy compressed data.`
   
   When I comment out this statement, 
   ` _reader->set_use_threads(true);`
   the program runs normally,i can read parquet file well
   Program errors only occur when I read multiple columns and using use_threads=true, and a single column will not occur error
   
   the testing parquet file is created by pyarrow，I use only 1 group and each group has 3000000 records.
   the parquet file has 20 columns including int and string types
   
   reading file using C++,arrow 7.0.0 ,snappy 1.1.8
   
   
   
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yurikoomiga commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

yurikoomiga commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1136033315

   > This seems like a bug. Can you create a JIRA ticket? Can you attach a sample file that fails to read?
   
   I have created a JIRA ticket here:https://issues.apache.org/jira/browse/ARROW-16642


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yurikoomiga commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

yurikoomiga commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1130091938

   Thanks,By the way,I just upgraded Arrow To 8.0.0 just now，this error occurs again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1130076106

   This seems like a bug.  Can you create a JIRA ticket?  Can you attach a sample file that fails to read?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

pitrou commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1131926933

   @yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] voidbip commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

voidbip commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1143751927

   FYI I am getting the same error using the CPP SDK. I am using apache arrow 8.0.0 and snappy 1.1.8. I will try to provide more details as I dive in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yurikoomiga commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Posted by GitBox <gi...@apache.org>.

yurikoomiga commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1135743977

   > @yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)
   
   I'm sorry to reply you after so long.
   The sample file is so large, so I post the generating code like this:
   
   ```
   import random, string
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   def create_list(type):
       if type == "VARCHAR":
           num = string.ascii_letters + string.digits
           return  "".join(random.sample(num, random.randint(1, 20)))
       elif type == "INT":
           return  random.randint(1,65536)
   
   def chang_column_type(column_type,data_frame_column):
       if "INT" in column_type:
           return  data_frame_column.astype("int32")
       return data_frame_column
   
   def build_parquet_schema(column_name,column_type):
       table_list = list()
       for index, column in enumerate(column_name):
           if "VARCHAR" in column_type[index]:
               table_list.append((column, pa.string()))
           elif "INT" in column_type[index]:
               table_list.append((column,pa.int32()))
           else:
               table_list.append((column, pa.string()))
       return  pa.schema(table_list)
   
   if __name__ == '__main__':
   
       parquet_file ="test.parquet"
   
       column_type,column_name,data_list = list(),list(),list()
       for i in range(0,20):
           column_name.append("TEST%s"%i)
           column_type.append("VARCHAR") if i%2==0 else column_type.append("INT")
   
       table_schema = build_parquet_schema(column_name,column_type)
   
       for i in range(0,3*1000*1000):
           data_list.append(list(map(create_list,column_type)))
   
       test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name))
       for index, column in enumerate(column_name):
           test_panda_frame[column] = chang_column_type(column_type[index], test_panda_frame[column])
       table = pa.Table.from_pandas(test_panda_frame, schema=table_schema)
       pq.write_table(table, parquet_file,row_group_size=300*1000*1000)
       exit(0)
   ```
   I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0
   You can use this to generate a test.parquet file and read any multiple columns with using `_reader->set_use_threads(true);` 
   @pitrou @westonpace 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org