You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/10 17:52:52 UTC

[GitHub] [beam] lostluck opened a new issue, #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash tandrecover Go worker heapdumps

lostluck opened a new issue, #21797:
URL: https://github.com/apache/beam/issues/21797

   ### What would you like to happen?
   
   Ideally the worker binary is resilient to crashes with bundles being independant. Some crashes are unavoidable though, like OOMs and similar. In such events, it's useful to have a heapdump or core dump to examine to see what could be fixed or improved.
   
   The Go SDK container boot launcher should set `GOTRACEBACK=crash` which will cause unix based systems to write out a core dump on unexpected exits (such as OOMs and Similar). Further, like with java, a restarting process should try to find any dump written, and move it to a target location (like the job's designated temp folder).
   
   See https://pkg.go.dev/runtime and https://pkg.go.dev/runtime/debug#SetTraceback for some details.
   
   Complications:
   1. Collecting the core dump.
   Based on the test for such crash dumps, it looks like it may by default get written to StdErr out. https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/runtime/crash_unix_test.go;l=163
   We can't simply write all std err to a file, because for long running tasks that would become extremely large from other debug printouts.
   This should be validated, and default std.err logging changed in this case.
   
   But this could be a misreading. Traceback information is printed (stack traces etc), but the heap dump might still go elsewhere. This needs to be determined, and in particular if docker containers are used, whether the file itself is accessible to the next container started by the VM.
   
   2. Moving the file
   Since the pipeline binary would have the facility to write to the job's temp directory, such copying code would need to live in the harness, (likely in harness/init.go), so it can write the file if available.
   
   Alternatively, there's possibly some configuration that could be done for the runner to manage this instead, but this isn't clear offhand.
   
   Java's approach is contained in [MemoryMonitor](https://github.com/apache/beam/blob/4ffeae4d2b800f2df36d2ea2eab549f2204d5691/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/status/MemoryMonitor.java)
   
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: sdk-go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1180484317

   I think I found a good way to do this today. If we limit the memory size of the worker with a syscall to set the rlimit, instead of throwing an OOM the process will fail gracefully and return control back to boot.go. As long as we then immediately take a heap dump, we get a good picture of what the heap looks like on failure.
   
   This seems like its about as good as we can do with Go, but it should get us mostly on par with the other SDKs. FWIW, its not totally unlike what Java does [here](https://github.com/apache/beam/blob/262f2b7f91ac879cb8921a3e7d59d0315c9df9c4/sdks/java/container/boot.go#L238), though its definitely not exactly the same either


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1180564069

   > That doesn't sound awful actually, though it does look like the "Runner Provided Limit" is code that no longer exists, so it's always 70% of machine size or 1GB.
   
   Yep, fortunately that just simplifies the equation for us though.
   
   > gah That root [syscall package](https://pkg.go.dev/syscall) is locked down and deprecated and things should migrate to using packages from [golang.org/x/sys](https://pkg.go.dev/golang.org/x/sys) instead.
   
   Good callout, I hadn't noticed that. Shouldn't be a big deal though. I have this mostly working locally, just need to clean it up and do some more robust testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179225711

   I was considering something along those lines, I'm not ready to full commit to that approach yet though. First, I'd like to explore if there's a generalizable way to do this in the boot container that reuses existing code. Basically, I want to validate that we have to make the GCS assumption you mentioned:
   
   > and that's not terribly open, since it makes a GCS assumption.
   
   
   If we can avoid having a separate worker binary execution path just for uploads I'd probably prefer it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179227998

   Agreed. The advantage is reducing future changes to the boot loader, since it's the fungible usercode that would write to <insert file system of choice here>, but I don't like the extra startup path. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179355254

   One thing I should've called out here - taking a heap dump (or any similar operation) is a very disruptive thing that we don't want to be doing unless we really _need_ to. [WriteHeapDump](https://pkg.go.dev/runtime/debug#WriteHeapDump) "suspends the execution of all goroutines until the heap dump is completely written." It can also take a while.
   
   So doing this when we don't need it would be very bad.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1180557917

   *gah* That root [syscall package](https://pkg.go.dev/syscall) is locked down and deprecated and things should migrate to using packages from [golang.org/x/sys](https://pkg.go.dev/golang.org/x/sys) instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179366645

   Oh yikes, when there truly is an OOM, the whole stack panics unrecoverably (including boot.go). I guess that makes sense because its subject to the same memory constraints as its subprocess? Regardless, that means that our only real options are:
   
   1) Try to do something predictive of memory issues
   2) Do nothing
   3) Do something like what I did in #22196 - I don't see a heap dump being super useful in the cases that would be caught by that though, so I'm not inclined to do it.
   4) Do something at the runner level. It looks like maybe python does something like that? I'll look into their approach more next week https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/python/apache_beam/runners/worker/worker_status.py#L87


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179190144

   It occurs to me that probably the "best" right place for this ultimately is in the Container boot.go technically, since that launcher will still be up and could check for /upload something, after worker binary terminates.  I wonder if the java boot.go already does this...?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1180553320

   That doesn't sound awful actually, though it does look like the "Runner Provided Limit" is code that no longer exists, so it's always 70% of machine size.
   
   Yeah: [Old code](https://github.com/apache/beam/commit/8370a8d91f2b86ddf2fefe5f755084a3973edc9f#diff-25164be214469c9b01c90b319d4a93e1035f7058d7dfec7ace3f9cea253ed7c7R125) and the fields from [that proto no longer exist](https://github.com/apache/beam/blob/262f2b7f91ac879cb8921a3e7d59d0315c9df9c4/sdks/go/pkg/beam/model/fnexecution_v1/beam_provision_api.pb.go#L135)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck closed issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck closed issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps
URL: https://github.com/apache/beam/issues/21797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179192961

   Yeah, I was just having that thought - Java doesn't I don't think, but I think that approach makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
lostluck commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179219729

   Right. The annoying bit here is that with that approach it requires the GCS writing code to be added to the boot container :/, and that's not terribly open, since it makes a GCS assumption.
   
   Big Oof. So it would need to be in the SDK harness somehow...
   
   But that implies we could do something tricky in the harness Init code: an additional optional flag pointing to the local core dump if it exists. Then we re-start the binary with that flag, upload the core dump using the SDK available GCS writing code (or whatever alternative depending on where the job is running), and then return out.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on issue #21797: [Feature Request]: [Go SDK] Set GOTRACEBACK=crash and recover Go worker heapdumps

Posted by GitBox <gi...@apache.org>.
damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179337648

   As I currently understand it, using the `GOTRACEBACK=crash` is going to be technically impossible. The location of the core dump is dependent on the underlying kernel and we don't have the ability to change that (since the container is mounted read only, and that type of manipulation is almost definitely a no-go in pretty much every context).
   
   A good example of this being hard is Dataflow - on Dataflow, the contents of `/proc/sys/kernel/core_pattern` currently are `/sbin/crash_reporter --user=%P:%s:%u:%g:%f` - this means that any core dumps will be written to the crash reporter (a chrome concept) if one exists. I'm not even sure if that will happen (e.g. can we write to that from the container? Probably not), but regardless we won't be able to read it. Even if we were able to modify the core_pattern file while building our container, I think a solution that is extensible to arbitrary runners is not possible with that approach since it relies on the behavior of that particular operating system. We'd need to have intimate knowledge of the OS/kernel/configuration that we're running on top of _and_ resilient to any changes there (ideally without requiring an SDK upgrade - we shouldn't be coupled to the version of hardware we're running on).
   
   ---------------------------------------------------------------------------------
   
   That leaves a few options as I see it:
   
   1) We could take a heap dump after the process crashes. I'm not sure if that has utility or not, I'd have to examine if any useful info is contained on the heap or if its totally wiped on process completion.
   2) We could upload a heap dump every few minutes (this is probably bad, its expensive and won't necessarily catch the problem)
   3) We could try to build out something like Java has and monitor _something_ and upload a heap dump when that happens.
   4) We can not do this.
   
   In theory (1) or (3) is the best option, but I have no clue what we would actually monitor for (3). Java monitors GC thrashing which doesn't really make sense for Go. I don't know of any compelling signals to use here.
   
   My next steps here:
   
   1. Investigate whether a post-process completion heap dump offers any value.
   2. If that doesn't uncover _significant_ value, consider ways we could do option (3). I'm not super hopeful here.
   3. Based on the results of the investigation, do any of options (1), (3), or (4)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org