You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "candiduslynx (via GitHub)" <gi...@apache.org> on 2023/06/07 16:41:58 UTC

[GitHub] [arrow] candiduslynx opened a new pull request, #35972: GH-35015: [Go] Fix parquet memleak

candiduslynx opened a new pull request, #35972:
URL: https://github.com/apache/arrow/pull/35972

   ### Rationale for this change
   
   Some memory leaks resulted in partially skipped memory checks in pqarrow package.
   This PR brings the checks back.
   
   ### What changes are included in this PR?
   
   Releases in proper places.
   
   ### Are these changes tested?
   
   Yes, the tests from #35015 are fully enabled now.
   
   ### Are there any user-facing changes?
   
   No.
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35972:
URL: https://github.com/apache/arrow/pull/35972#issuecomment-1581181849

   :warning: GitHub issue #35015 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221903408


##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -443,14 +454,16 @@ func chunksToSingle(chunked *arrow.Chunked) (arrow.ArrayData, error) {
 	case 0:
 		return array.NewData(chunked.DataType(), 0, []*memory.Buffer{nil, nil}, nil, 0, 0), nil
 	case 1:
-		return chunked.Chunk(0).Data(), nil
+		data := chunked.Chunk(0).Data()
+		data.Retain() // we pass control to the caller
+		return data, nil
 	default: // if an item reader yields a chunked array, this is not yet implemented
 		return nil, arrow.ErrNotImplemented
 	}
 }
 
 // create a chunked arrow array from the raw record data
-func transferColumnData(rdr file.RecordReader, valueType arrow.DataType, descr *schema.Column, mem memory.Allocator) (*arrow.Chunked, error) {
+func transferColumnData(rdr file.RecordReader, valueType arrow.DataType, descr *schema.Column) (*arrow.Chunked, error) {

Review Comment:
   mem param was unused



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221902212


##########
go/arrow/memory/checked_allocator.go:
##########
@@ -160,14 +160,24 @@ func (a *CheckedAllocator) AssertSize(t TestingT, sz int) {
 				break
 			}
 			callersMsg.WriteString("\t")
-			callersMsg.WriteString(frame.Function)
-			callersMsg.WriteString(fmt.Sprintf(" line %d", frame.Line))
+			if fn := frame.Func; fn != nil {
+				callersMsg.WriteString(fmt.Sprintf("%s+%x", fn.Name(), frame.PC-fn.Entry()))
+			} else {
+				callersMsg.WriteString(fmt.Sprintf("%s, line %d", frame.Function, frame.Line))
+			}
+			callersMsg.WriteString("\n\t\t")
+			callersMsg.WriteString(frame.File + ":" + strconv.Itoa(frame.Line))

Review Comment:
   this gives a proper fill filename, so the leak is easier to find (IntelliJ IDEA even allows clicking on a link to line)



##########
go/arrow/memory/checked_allocator.go:
##########
@@ -160,14 +160,24 @@ func (a *CheckedAllocator) AssertSize(t TestingT, sz int) {
 				break
 			}
 			callersMsg.WriteString("\t")
-			callersMsg.WriteString(frame.Function)
-			callersMsg.WriteString(fmt.Sprintf(" line %d", frame.Line))
+			if fn := frame.Func; fn != nil {
+				callersMsg.WriteString(fmt.Sprintf("%s+%x", fn.Name(), frame.PC-fn.Entry()))
+			} else {
+				callersMsg.WriteString(fmt.Sprintf("%s, line %d", frame.Function, frame.Line))
+			}
+			callersMsg.WriteString("\n\t\t")
+			callersMsg.WriteString(frame.File + ":" + strconv.Itoa(frame.Line))
 			callersMsg.WriteString("\n")
 			if !more {
 				break
 			}
 		}
-		t.Errorf("LEAK of %d bytes FROM %s line %d\n%v", info.sz, f.Name(), info.line, callersMsg.String())
+		file, line := f.FileLine(info.pc)
+		t.Errorf("LEAK of %d bytes FROM\n\t%s+%x\n\t\t%s:%d\n%v",

Review Comment:
   this gives a proper fill filename, so the leak is easier to find (IntelliJ IDEA even allows clicking on a link to line)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on PR #35972:
URL: https://github.com/apache/arrow/pull/35972#issuecomment-1581196509

   reopened as #35973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221905986


##########
go/parquet/pqarrow/encode_arrow_test.go:
##########
@@ -338,6 +339,7 @@ func simpleRoundTrip(t *testing.T, tbl arrow.Table, rowGroupSize int64) {
 
 		chunked, err := crdr.NextBatch(tbl.NumRows())
 		require.NoError(t, err)
+		defer chunked.Release()

Review Comment:
   lazy, but only in test



##########
go/parquet/pqarrow/encode_arrow_test.go:
##########
@@ -1040,10 +1044,7 @@ func (ps *ParquetIOTestSuite) TestSingleColumnRequiredWrite() {
 	}
 }
 
-func (ps *ParquetIOTestSuite) roundTripTable(_ memory.Allocator, expected arrow.Table, storeSchema bool) {
-	mem := memory.NewCheckedAllocator(memory.DefaultAllocator) // FIXME: currently overriding allocator to isolate leaks between roundTripTable and caller

Review Comment:
   return of the original mem



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221903051


##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -296,11 +306,11 @@ func (sr *structReader) BuildArray(lenBound int64) (*arrow.Chunked, error) {
 	buffers := make([]*memory.Buffer, 1)
 	if validityIO.NullCount > 0 {
 		buffers[0] = nullBitmap
+		defer nullBitmap.Release()
 	}
-
 	data := array.NewData(sr.filtered.Type, int(validityIO.Read), buffers, childArrData, int(validityIO.NullCount), 0)
 	defer data.Release()
-	arr := array.MakeFromData(data)
+	arr := array.NewStructData(data)

Review Comment:
   call directly



##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -443,14 +454,16 @@ func chunksToSingle(chunked *arrow.Chunked) (arrow.ArrayData, error) {
 	case 0:
 		return array.NewData(chunked.DataType(), 0, []*memory.Buffer{nil, nil}, nil, 0, 0), nil
 	case 1:
-		return chunked.Chunk(0).Data(), nil
+		data := chunked.Chunk(0).Data()
+		data.Retain() // we pass control to the caller

Review Comment:
   we want the caller to have the full control of the data (so, releasing the array after the chunksToSingle should have no effect on the returned data.
   For 0 len it was already true (we constructed a new data), so this retain comes in naturally



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221906326


##########
go/parquet/pqarrow/encode_dictionary_test.go:
##########
@@ -400,7 +400,7 @@ func (ar *ArrowReadDictSuite) writeSimple() {
 		pqarrow.DefaultWriterProps()))
 }
 
-func (ArrowReadDictSuite) NullProbabilities() []float64 {
+func (*ArrowReadDictSuite) NullProbabilities() []float64 {

Review Comment:
   Go asks to stick to one of the receivers type: either value or pointer (no mixing)



##########
go/parquet/pqarrow/encode_dictionary_test.go:
##########
@@ -543,6 +543,7 @@ func (ar *ArrowReadDictSuite) TestIncrementalReads() {
 	for i := 0; i < numReads; i++ {
 		chunk, err := col.NextBatch(int64(batchSize))
 		ar.Require().NoError(err)
+		defer chunk.Release()

Review Comment:
   lazy, but only in tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221902509


##########
go/parquet/file/record_reader.go:
##########
@@ -783,19 +783,19 @@ func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, dtype arrow.
 	}}
 }
 
-func (fr *byteArrayRecordReader) ReserveValues(extra int64, hasNullable bool) error {
-	fr.bldr.Reserve(int(extra))
-	return fr.primitiveRecordReader.ReserveValues(extra, hasNullable)
+func (br *byteArrayRecordReader) ReserveValues(extra int64, hasNullable bool) error {

Review Comment:
   just renamed the receiver to have the same name across all methods



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221905669


##########
go/parquet/pqarrow/encode_arrow_test.go:
##########
@@ -320,6 +320,7 @@ func writeTableToBuffer(t *testing.T, mem memory.Allocator, tbl arrow.Table, row
 }
 
 func simpleRoundTrip(t *testing.T, tbl arrow.Table, rowGroupSize int64) {
+	t.Helper()

Review Comment:
   so that we see the real failing test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221901805


##########
go/arrow/memory/buffer.go:
##########
@@ -35,7 +35,7 @@ type Buffer struct {
 
 // NewBufferBytes creates a fixed-size buffer from the specified data.
 func NewBufferBytes(data []byte) *Buffer {
-	return &Buffer{refCount: 0, buf: data, length: len(data)}
+	return &Buffer{refCount: 1, buf: data, length: len(data)}

Review Comment:
   for housekeeping



##########
go/arrow/memory/checked_allocator.go:
##########
@@ -160,14 +160,24 @@ func (a *CheckedAllocator) AssertSize(t TestingT, sz int) {
 				break
 			}
 			callersMsg.WriteString("\t")
-			callersMsg.WriteString(frame.Function)
-			callersMsg.WriteString(fmt.Sprintf(" line %d", frame.Line))
+			if fn := frame.Func; fn != nil {
+				callersMsg.WriteString(fmt.Sprintf("%s+%x", fn.Name(), frame.PC-fn.Entry()))

Review Comment:
   offset in bytes from func start



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221903678


##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -525,34 +538,29 @@ func transferZeroCopy(rdr file.RecordReader, dt arrow.DataType) arrow.ArrayData
 		}
 	}()
 
-	return array.NewData(dt, rdr.ValuesWritten(), []*memory.Buffer{
-		bitmap, values}, nil, int(rdr.NullCount()), 0)
+	return array.NewData(dt, rdr.ValuesWritten(),
+		[]*memory.Buffer{bitmap, values},

Review Comment:
   put `[]*memory.Buffer{bitmap, values},` on a separate line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on PR #35972:
URL: https://github.com/apache/arrow/pull/35972#issuecomment-1581192908

   reopening because of force pushed cleanup


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221901607


##########
go/arrow/array/extension.go:
##########
@@ -74,28 +74,12 @@ func NewExtensionArrayWithStorage(dt arrow.ExtensionType, storage arrow.Array) a
 		panic(fmt.Errorf("arrow/array: storage type %s for extension type %s, does not match expected type %s", storage.DataType(), dt.ExtensionName(), dt.StorageType()))
 	}
 
-	base := ExtensionArrayBase{}

Review Comment:
   basically, it was a copied code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #35972:
URL: https://github.com/apache/arrow/pull/35972#discussion_r1221902799


##########
go/parquet/pqarrow/column_readers.go:
##########
@@ -117,12 +111,25 @@ func (lr *leafReader) LoadBatch(nrecords int64) (err error) {
 			}
 		}
 	}
-	lr.out, err = transferColumnData(lr.recordRdr, lr.field.Type, lr.descr, lr.rctx.mem)
+	lr.out, err = transferColumnData(lr.recordRdr, lr.field.Type, lr.descr)

Review Comment:
   mem param was unused



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx closed pull request #35972: GH-35015: [Go] Fix parquet memleak

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx closed pull request #35972: GH-35015: [Go] Fix parquet memleak
URL: https://github.com/apache/arrow/pull/35972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org