You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/04 23:47:52 UTC

[GitHub] [arrow] minyoung opened a new pull request, #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

minyoung opened a new pull request, #13068:
URL: https://github.com/apache/arrow/pull/13068

   `parquet/file.serializedPageReader` has a [memory.Buffer](https://github.com/apache/arrow/blob/8bd5514f52bf9cc542a389edaf697cbc2c97b752/go/parquet/file/page_reader.go#L299) attribute (presumably to reuse across page reads). But at the end of `serializedPageReader.Next` (in the non-error case), a new `memory.Buffer` is [created](https://github.com/apache/arrow/blob/8bd5514f52bf9cc542a389edaf697cbc2c97b752/go/parquet/file/page_reader.go#L615) without releasing the pre-existing `p.buf`, thus resulting in a leak.
   
   Existing tests updated to test and catch this (`parquet/file` now uses `CheckedAllocator).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13068:
URL: https://github.com/apache/arrow/pull/13068#issuecomment-1118031945

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

zeroshade commented on PR #13068:
URL: https://github.com/apache/arrow/pull/13068#issuecomment-1124230142

   Thanks @minyoung!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

zeroshade commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r867062398


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Ok, looking back at the code I notice something: In all cases *except* for a V2 DataPage, even if the data isn't compressed, we end up copying it out of `buf` into memory allocated by the `decompressBuffer`. This has two effects:
   
   1. It separates the connection between `p.buf` and the `data` variable we pass to the data page.
   2. It *ensures* that we can't possibly have a situation of leaking memory because we're not using the passed in memory allocator to create that copy. We know for a fact that Go allocated that memory for the page (Not necessarily a good thing, but sure, we can go with that for now. a larger refactor may be a future idea to improve this). The only time this isn't the case is line 586, where we do `io.ReadFull(p.r p.buf.Bytes())`. 
   
   My original idea here was that we'd utilize the same buffer where possible, creating a new buffer for `p.buf` each time. This is exemplified by the `Release` that exists on the `Page` interface. So the easy solution I can see is to keep your change to create a New buffer for each call into Next, and make a slight modification in the `DataPageV2` case:
   
   Currently it looks like this in your change.
   
   ```go
   var data []byte
   if compressed {
           if levelsBytelen > 0 {
                   io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
           }
   	if data, p.err = p.decompress(lenCompressed-levelsBytelen, buf.Bytes()[levelsBytelen:]); p.err != nil {
   	        return false
           }
   } else {
           io.ReadFull(p.r, buf.Bytes())
   	data = buf.Bytes()
   }
   ```
   
   And then we call `NewBufferBytes(data)` when creating the page. We can likely solve the issue by instead doing this:
   
   ```go
   var pagebuf *memory.Buffer
   if compressed {
           if levelsBytelen > 0 {
                   io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
           }
           var data []byte
   	if data, p.err = p.decompress(lenCompressed-levelsBytelen, buf.Bytes()[levelsBytelen:]); p.err != nil {
   	        return false
           }
           pagebuf = memory.NewBufferBytes(data)
   } else {
           io.ReadFull(p.r, buf.Bytes())
           pagebuf = buf
           pagebuf.Retain()
   }
   ```
   
   And then use pagebuf to construct the `DataPageV2`. That should be sufficient, it'll still make a new buffer for each call to `Next` and allow the defer to release the memory, but it allows us to avoid the copy in this case by just passing the buffer to the Page safely.
   
   Does all of that make sense? I kind of rambled a bit but I hope it all made sense and came across well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

minyoung commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r865467660


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Since a new `memory.Buffer` was created at the end here, I opted to create (and release) a new buffer at the start of the function instead.
   
   Alternative fix would be:
   * remove this, so that the buffer is reused across calls to `serializedPageReader.Next` (is this safe?)
   * add a `Release` method to the `PageReader` interface
   * have the `primitiveRecordReader` release the `PageReader`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

zeroshade commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r870693634


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   `NewBufferBytes(buf.Bytes())` just creates a `*memory.Buffer` which references the existing slice, it doesn't copy the contents of the slice. So there's no copying happening.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13068:
URL: https://github.com/apache/arrow/pull/13068#issuecomment-1126115204

   Benchmark runs are scheduled for baseline = 21de30e881d941ceba5fbe8b8768e80dd0b27721 and contender = bb190bbc345547aefe4aec3e1bc878c645c2cc50. bb190bbc345547aefe4aec3e1bc878c645c2cc50 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/54fc771083084b1d85d53c826bd7d0f3...498eb302ebd4454bb77499a4afc78cca/)
   [Failed :arrow_down:0.08% :arrow_up:0.04%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5f435bb59ede488b9ffbf3fa4ae8d399...2a1ef32e05cd47748269cf63c99105d1/)
   [Failed :arrow_down:0.0% :arrow_up:0.36%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5aa8f6e382bb463289829fd5cd85051a...a9a0a705d10544c481b3334217c24a66/)
   [Finished :arrow_down:1.34% :arrow_up:0.08%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a465389b462b4cceb1bfe297873357db...d6c31744607140f5910589b41147104e/)
   Buildkite builds:
   [Finished] [`bb190bbc` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/742)
   [Failed] [`bb190bbc` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/739)
   [Finished] [`bb190bbc` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/729)
   [Finished] [`bb190bbc` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/744)
   [Finished] [`21de30e8` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/741)
   [Finished] [`21de30e8` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/738)
   [Failed] [`21de30e8` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/728)
   [Finished] [`21de30e8` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/743)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13068:
URL: https://github.com/apache/arrow/pull/13068#issuecomment-1118031933

   https://issues.apache.org/jira/browse/ARROW-16473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

minyoung commented on PR #13068:
URL: https://github.com/apache/arrow/pull/13068#issuecomment-1124286633

   Thanks for the thorough review and comments @zeroshade!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

minyoung commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r868254860


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Yes, that makes sense. PR updated.
   
   I just wonder if `memory.NewBufferBytes(buf.Bytes())` would end up copying the `[]byte` or not? Since both `memory.NewBufferBytes` and `buf.Bytes` both use/pass references to the underlying `[]byte` and does not do any `copy`'s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

minyoung commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r866177642


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Well, we can't simply pass the buffer as is to the page because the buffer itself may or may not be compressed. If the page is compressed, this function decompresses the page, and creates a new buffer with the decompressed data. Or are you proposing that we overwrite `buf` with the decompressed data in that case?
   
   Also, if we pass the buffer as is, is there a risk that `io.ReadFull(p.r, p.buf.Bytes())` will modify the data for the 'previous' page? Or should we do what I do now and create a new buffer for each call to `Next` (as we were doing previously but at the end of the function instead).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

zeroshade commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r865950160


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Ah I see what was happening. I passed the raw bytes as a *new* buffer to the page itself. So technically even though the accounting in the memory allocator is now correct, if the allocator isn't a Go allocator this will actually end up using the buffer after it's been freed. if you look at line 596, it calls `memory.NewBufferBytes(data)` to pass the buffer to the datapage itself. `NewBufferBytes` constructs a buffer which wraps around a byte slice, but it doesn't attempt to *free* that slice when it is garbage collected and assumes that the Go garbage collector will handle it. 
   
   This is all fine for a byteslice that is allocated by Go itself, but since this is a buffer coming from the resizable buffer allocated using the `memory.Allocator` it's possible that this memory wasn't allocated by Go at all.
   
   Because data pages were always created using `NewBufferBytes` I don't believe I added a `Release` function to them. At this point I think the safest solution is to add a `Release` to the `DataPage` interface and corresponding structs, ensure it gets called by the page reader when it sets the current page to nil or otherwise, and just pass the buffer as is to the page rather than calling `NewBufferBytes`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

minyoung commented on code in PR #13068:
URL: https://github.com/apache/arrow/pull/13068#discussion_r868254860


##########
go/parquet/file/page_reader.go:
##########
@@ -611,8 +610,6 @@ func (p *serializedPageReader) Next() bool {
 			// we don't know this page type, we're allowed to skip non-data pages
 			continue
 		}
-
-		p.buf = memory.NewResizableBuffer(p.mem)

Review Comment:
   Yes, that makes sense. PR rebased and updated.
   
   I just wonder if `memory.NewBufferBytes(buf.Bytes())` would end up copying the `[]byte` or not? Since both `memory.NewBufferBytes` and `buf.Bytes` both use/pass references to the underlying `[]byte` and does not do any `copy`'s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade closed pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader

Posted by GitBox <gi...@apache.org>.

zeroshade closed pull request #13068: ARROW-16473: [Go] fixing memory leak in serializedPageReader
URL: https://github.com/apache/arrow/pull/13068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org