You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Rion Williams <ri...@gmail.com> on 2020/10/27 18:22:04 UTC

Transform Logging Issues with Spark/Dataproc in GCP

+user for visibility

Begin forwarded message:

> From: Rion Williams <ri...@gmail.com>
> Date: October 27, 2020 at 12:28:37 PM CDT
> To: dev@beam.apache.org
> Subject: Transform Logging Issues with Spark/Dataproc in GCP
> 
> Hi all,
> 
> Recently, I deployed a very simple Apache Beam pipeline to get some insights into how it behaved executing in Dataproc as opposed to on my local machine. I quickly realized that after executing that any DoFn or transform-level logging didn't appear within the job logs within the Google Cloud Console as I would have expected and I'm not entirely sure what might be missing.
> 
> All of the high level logging messages are emitted as expected:
> 
> // This works
> log.info("Testing logging operations")
> 
> pipeline
>     .apply(Create.of(...))
>     .apply(ParDo.of(LoggingDoFn))
> 
> The LoggingDoFn class here is a very basic transform that emits each of the values that it encounters as seen below:
> 
> object LoggingDoFn : DoFn<String, ...>() {
>     private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
>     
>     @ProcessElement
>     fun processElement(c: ProcessContext) {
>         // This is never emitted within the logs
>         log.info("Attempting to parse ${c.element()}")
>     }
> }
> 
> As detailed in the comments, I can see logging messages outside of the processElement() calls (presumably because those are being executed by the Spark runner), but is there a way to easily expose those within the inner transform as well?
> 
> When viewing the logs related to this job, we can see the higher-level logging present, but no mention of a "Attempting to parse ..." message from the DoFn:
> 
> 
> 
> The job itself is being executed by the following gcloud command, which has the driver log levels explicitly defined, but perhaps there's another level of logging or configuration that needs to be added:
> 
> gcloud dataproc jobs submit spark --jar=gs://some_bucket/deployment/example.jar --project example-project --cluster example-cluster --region us-example --driver-log-levels com.example=DEBUG -- --runner=SparkRunner --output=gs://some_bucket/deployment/out
> 
> To summarize, log messages are not being emitted to the Google Cloud Console for tasks that would generally be assigned to the Spark runner itself (e.g. processElement()). I'm unsure if it's a configuration-related issue or something else entirely.
> 
> Any advice would be appreciated and I’d be glad to provide some additional details however I can.
> 
> Thanks,
> 
> Rion