You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Davidson <ae...@ucsc.edu.INVALID> on 2022/01/20 22:32:39 UTC

How to configure log4j in pyspark to get log level, file name, and line number

Hi

When I use python logging for my unit test. I am able to control the output format. I get the log level, the file and line number, then the msg

[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN

In my spark driver I am able to get the log4j logger

        spark = SparkSession\
                    .builder\
                    .appName("estimatedScalingFactors")\
                    .getOrCreate()

        #
        # https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
        # initialize  logger for yarn cluster logs
        #
        log4jLogger = spark.sparkContext._jvm.org.apache.log4j
        logger = log4jLogger.LogManager.getLogger(__name__)

However it only outputs the message. As a hack I have been adding the function names to the msg.



I wonder if this is because of the way I make my python code available. When I submit my job using



‘$ gcloud dataproc jobs submit pyspark’



I pass my python file in a zip file
 --py-files ${extraPkg}

I use level warn because the driver info logs are very verbose


###############################################################################

def rowSums( self, countsSparkDF, columnNames ):

    self.logger.warn( "rowSums BEGIN" )



    # https://stackoverflow.com/a/54283997/4586180

    retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, [col( x ) for x in columnNames] ) )



    self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

                         .format( retDF.count(), len( retDF.columns ) ) )



    self.logger.warn( "rowSums END\n" )

    return retDF

kind regards

Andy

Re: How to configure log4j in pyspark to get log level, file name, and line number

Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.
Interesting. I noticed that my drive log messages with time stamp, function name but no line number. However log message in other python files only contain the messages. All of my python code is a single zip file. The zip file is job submit argument

2022-01-21 19:45:02 WARN  __main__:? - sparkConfig: ('spark.sql.cbo.enabled', 'true')
2022-01-21 19:48:34 WARN  __main__:? - readsSparkDF.rdd.getNumPartitions():1698
__init__ BEGIN
__init__ END
run BEGIN
run rawCountsSparkDF numRows:5387495 numCols:10409

My guess is somehow I need to change the way log4j is configure on the workers?

Kind regards

Andy

From: Andrew Davidson <ae...@ucsc.edu>
Date: Thursday, January 20, 2022 at 2:32 PM
To: "user @spark" <us...@spark.apache.org>
Subject: How to configure log4j in pyspark to get log level, file name, and line number

Hi

When I use python logging for my unit test. I am able to control the output format. I get the log level, the file and line number, then the msg

[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN

In my spark driver I am able to get the log4j logger

        spark = SparkSession\
                    .builder\
                    .appName("estimatedScalingFactors")\
                    .getOrCreate()

        #
        # https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
        # initialize  logger for yarn cluster logs
        #
        log4jLogger = spark.sparkContext._jvm.org.apache.log4j
        logger = log4jLogger.LogManager.getLogger(__name__)

However it only outputs the message. As a hack I have been adding the function names to the msg.



I wonder if this is because of the way I make my python code available. When I submit my job using



‘$ gcloud dataproc jobs submit pyspark’



I pass my python file in a zip file
 --py-files ${extraPkg}

I use level warn because the driver info logs are very verbose


###############################################################################

def rowSums( self, countsSparkDF, columnNames ):

    self.logger.warn( "rowSums BEGIN" )



    # https://stackoverflow.com/a/54283997/4586180

    retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, [col( x ) for x in columnNames] ) )



    self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

                         .format( retDF.count(), len( retDF.columns ) ) )



    self.logger.warn( "rowSums END\n" )

    return retDF

kind regards

Andy