You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joseph Gardi (Jira)" <ji...@apache.org> on 2022/08/18 23:45:00 UTC

[jira] [Created] (ARROW-17469) Failure to parse files that can be parsed on pyarrow. Also, failure to recover from crash

Joseph Gardi created ARROW-17469:
------------------------------------

             Summary: Failure to parse files that can be parsed on pyarrow. Also, failure to recover from crash
                 Key: ARROW-17469
                 URL: https://issues.apache.org/jira/browse/ARROW-17469
             Project: Apache Arrow
          Issue Type: Bug
         Environment: Mac OS 11.4
go 1.17.1
github.com/apache/arrow/go/arrow v0.0.0-20211112161151-bc219186db40
            Reporter: Joseph Gardi
         Attachments: part-00000-db343798-fcc1-4288-be39-3b00bed75c24.c000.snappy.parquet

I am using the following code to read parquet files in go and it works on some parquet files:
{code:java}
import (
"github.com/apache/arrow/go/v10/arrow/memory"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/pqarrow"
...

pf, err := file.NewParquetReader(bytes.NewReader(data))
check(err)
preader, err := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)
check(err)
fmt.Println("before read table")
result, err := preader.ReadTable(ctx)
check(err)
fmt.Println("result is", result.NumRows())
result.Release(){code}
It works on some parquet files but not on others files that can be parse by pyarrow's read_table function. However, even pyarrow fails to parse some parquet files that I was able to parse with [https://github.com/xitongsys/parquet-go.] I've attached an example of a file that fails. When it fails I get this stack trace:

panic: runtime error: index out of range [0] with length 0

goroutine 595 [running]:
github.com/apache/arrow/go/v10/parquet/internal/utils.NewFirstTimeBitmapWriter(...)
    
{code:java}
/Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/internal/utils/bitmap_writer.go:83
github.com/apache/arrow/go/v10/parquet/file.defLevelsToBitmapInternal({0xc001714500, 0x0, 0x1000}, {0x881b680, 0x0, 0x100, 0x2648}, 0xc001583880, 0x40)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/file/level_conversion.go:173 +0x23b
github.com/apache/arrow/go/v10/parquet/file.DefLevelsToBitmap(...)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/file/level_conversion.go:186
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc00049dec0, 0x0)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/column_readers.go:279 +0x1b3
github.com/apache/arrow/go/v10/parquet/pqarrow.(*listReader).BuildArray(0xc000a31ac0, 0xbd3)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/column_readers.go:391 +0x4a2
github.com/apache/arrow/go/v10/parquet/pqarrow.(*structReader).BuildArray(0xc000418f60, 0xbd3)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/column_readers.go:289 +0x534
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ColumnReader).NextBatch(0xc00051c330, 0x0)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/file_reader.go:134 +0x5c
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadColumn(0xc001583f88, {0xc0008f72b0, 0xc000bda3f0, 0x0}, 0xc000a316c0)
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/file_reader.go:247 +0x65
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups.func1()
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/file_reader.go:341 +0xd2
created by github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileReader).ReadRowGroups
    /Users/josephgardi/.gvm/pkgsets/go1.17.1/global/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220818191625-a1c3d57af514/parquet/pqarrow/file_reader.go:332 +0x3d2{code}
 

There is always some chance that my application will encounter a bad parquet file so I'd like to be able to recover from this panic. However, that doesn't work easily because this stack trace is coming from a different goroutine which is created on line 332 of ffile_reader.go:ReadRowGroups. 

So it seems that the solution is to do a recover within that goroutine and then try a different prarser such as [xitongsys|https://github.com/xitongsys/parquet-go.]/go-parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)