You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by johnzengspark <jo...@hotmail.com> on 2017/07/29 18:09:21 UTC

Logging in RDD mapToPair of Java Spark application

Hi, All,

Although there are lots of discussions related to logging in this news
group, I did not find an answer to my specific question so I am posting mine
with the hope that this will not cause a duplicated question.

Here is my simplified Java testing Spark app:

public class SparkJobEntry {
	public static void main(String[] args) {
		// Following line is in stdout from JobTracker UI
		System.out.println("argc=" + args.length);
		
		SparkConf conf = new SparkConf().setAppName("TestSparkApp"); 
		JavaSparkContext sc = new JavaSparkContext(conf);
		JavaRDD<String> fileRDD = sc.textFile(args[0]);
		
		fileRDD.mapToPair(new PairFunction<String, String, String>() {

			private static final long serialVersionUID = 1L;
			
			@Override
			public Tuple2<String, String> call(String input) throws Exception {
				// Following line is not in stdout from JobTracker UI
				System.out.println("This line should be printed in stdout");
				// Other code removed from here to make things simple
				return new Tuple2<String, String>("1", "Testing data");
			}}).saveAsTextFile(args[0] + ".results");
	}
}

What I expected from JobTracker UI is to see both stdout lines: first line
is "argc=2" and second line is "This line should be printed in stdout".  But
I only see the first line which is outside of the 'mapToPair'.  I actually
have verified my 'mapToPair' is called and the statements after the second
logging line were executed.  The only issue for me is why the second logging
is not in JobTracker UI.  

Appreciate your help.

Thanks

John



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Logging in RDD mapToPair of Java Spark application

Posted by ayan guha <gu...@gmail.com>.

Not that I can think of. If you have Spark history Server running then it
may be another place to look

On Mon, Jul 31, 2017 at 9:48 AM, John Zeng <jo...@hotmail.com> wrote:

> Hi, Ayan,
>
>
> Thanks for the suggestion.  I did that and got following weird message
> even I enabled the log aggregation:
>
>
> [root@john1 conf]# yarn logs -applicationId application_1501197841826_0013
> 17/07/30 16:45:06 INFO client.RMProxy: Connecting to ResourceManager at
> john1.dg/192.168.6.90:8032
> /tmp/logs/root/logs/application_1501197841826_0013does not exist.
> Log aggregation has not completed or is not enabled.
>
> Any other way to see my logs?
>
> Thanks
>
> John
>
>
>
>
> ------------------------------
> *From:* ayan guha <gu...@gmail.com>
> *Sent:* Sunday, July 30, 2017 10:34 PM
> *To:* John Zeng; Riccardo Ferrari
>
> *Cc:* User
> *Subject:* Re: Logging in RDD mapToPair of Java Spark application
>
> Hi
>
> As you are using yarn log aggregation, yarn moves all the logs to hdfs
> after the application completes.
>
> You can use following command to get the logs:
> yarn logs -applicationId <your application id>
>
>
>
> On Mon, 31 Jul 2017 at 3:17 am, John Zeng <jo...@hotmail.com> wrote:
>
>> Thanks Riccardo for the valuable info.
>>
>>
>> Following your guidance, I looked at the Spark UI and figured out the
>> default logs location for executors is 'yarn/container-logs'.  I ran my
>> Spark app again and I can see a new folder was created for it:
>>
>>
>> [root@john2 application_1501197841826_0013]# ls -l
>> total 24
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:07 container_1501197841826_0013_
>> 01_000001
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_
>> 01_000002
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_
>> 01_000003
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_
>> 02_000001
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_
>> 02_000002
>> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_
>> 02_000003
>>
>> But when I tried to look into each its content, it was gone and there was
>> not file at all from the same place:
>>
>> [root@john2 application_1501197841826_0013]# vi
>> container_1501197841826_0013_*
>> [root@john2 application_1501197841826_0013]# ls -l
>> total 0
>> [root@john2 application_1501197841826_0013]# pwd
>> /yarn/container-logs/application_1501197841826_0013
>>
>> I believe Spark moves these logs to a different place.  But where are
>> they?
>>
>> Thanks
>>
>> John
>>
>>
>>
>>
>> ------------------------------
>> *From:* Riccardo Ferrari <fe...@gmail.com>
>> *Sent:* Saturday, July 29, 2017 8:18 PM
>> *To:* johnzengspark
>> *Cc:* User
>> *Subject:* Re: Logging in RDD mapToPair of Java Spark application
>>
>> Hi John,
>>
>> The reason you don't see the second sysout line is because is executed on
>> a different JVM (ie. Driver vs Executor). the second sysout line should be
>> available through the executor logs. Check the Executors tab.
>>
>> There are alternative approaches to manage log centralization however it
>> really depends on what are your requirements.
>>
>> Hope it helps,
>>
>> On Sat, Jul 29, 2017 at 8:09 PM, johnzengspark <jo...@hotmail.com>
>> wrote:
>>
>>> Hi, All,
>>>
>>> Although there are lots of discussions related to logging in this news
>>> group, I did not find an answer to my specific question so I am posting
>>> mine
>>> with the hope that this will not cause a duplicated question.
>>>
>>> Here is my simplified Java testing Spark app:
>>>
>>> public class SparkJobEntry {
>>>         public static void main(String[] args) {
>>>                 // Following line is in stdout from JobTracker UI
>>>                 System.out.println("argc=" + args.length);
>>>
>>>                 SparkConf conf = new SparkConf().setAppName("
>>> TestSparkApp");
>>>                 JavaSparkContext sc = new JavaSparkContext(conf);
>>>                 JavaRDD<String> fileRDD = sc.textFile(args[0]);
>>>
>>>                 fileRDD.mapToPair(new PairFunction<String, String,
>>> String>() {
>>>
>>>                         private static final long serialVersionUID = 1L;
>>>
>>>                         @Override
>>>                         public Tuple2<String, String> call(String input)
>>> throws Exception {
>>>                                 // Following line is not in stdout from
>>> JobTracker UI
>>>                                 System.out.println("This line should be
>>> printed in stdout");
>>>                                 // Other code removed from here to make
>>> things simple
>>>                                 return new Tuple2<String, String>("1",
>>> "Testing data");
>>>                         }}).saveAsTextFile(args[0] + ".results");
>>>         }
>>> }
>>>
>>> What I expected from JobTracker UI is to see both stdout lines: first
>>> line
>>> is "argc=2" and second line is "This line should be printed in stdout".
>>> But
>>> I only see the first line which is outside of the 'mapToPair'.  I
>>> actually
>>> have verified my 'mapToPair' is called and the statements after the
>>> second
>>> logging line were executed.  The only issue for me is why the second
>>> logging
>>> is not in JobTracker UI.
>>>
>>> Appreciate your help.
>>>
>>> Thanks
>>>
>>> John
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-
>>> Spark-application-tp29007.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards,
Ayan Guha

Re: Logging in RDD mapToPair of Java Spark application

Posted by John Zeng <jo...@hotmail.com>.

Hi, Ayan,


Thanks for the suggestion.  I did that and got following weird message even I enabled the log aggregation:


[root@john1 conf]# yarn logs -applicationId application_1501197841826_0013
17/07/30 16:45:06 INFO client.RMProxy: Connecting to ResourceManager at john1.dg/192.168.6.90:8032
/tmp/logs/root/logs/application_1501197841826_0013does not exist.
Log aggregation has not completed or is not enabled.

Any other way to see my logs?

Thanks

John




________________________________
From: ayan guha <gu...@gmail.com>
Sent: Sunday, July 30, 2017 10:34 PM
To: John Zeng; Riccardo Ferrari
Cc: User
Subject: Re: Logging in RDD mapToPair of Java Spark application

Hi

As you are using yarn log aggregation, yarn moves all the logs to hdfs after the application completes.

You can use following command to get the logs:
yarn logs -applicationId <your application id>



On Mon, 31 Jul 2017 at 3:17 am, John Zeng <jo...@hotmail.com>> wrote:

Thanks Riccardo for the valuable info.


Following your guidance, I looked at the Spark UI and figured out the default logs location for executors is 'yarn/container-logs'.  I ran my Spark app again and I can see a new folder was created for it:


[root@john2 application_1501197841826_0013]# ls -l
total 24
drwx--x--- 2 yarn yarn 4096 Jul 30 10:07 container_1501197841826_0013_01_000001
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_01_000002
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_01_000003
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000001
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000002
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000003

But when I tried to look into each its content, it was gone and there was not file at all from the same place:

[root@john2 application_1501197841826_0013]# vi container_1501197841826_0013_*
[root@john2 application_1501197841826_0013]# ls -l
total 0
[root@john2 application_1501197841826_0013]# pwd
/yarn/container-logs/application_1501197841826_0013

I believe Spark moves these logs to a different place.  But where are they?

Thanks

John




________________________________
From: Riccardo Ferrari <fe...@gmail.com>>
Sent: Saturday, July 29, 2017 8:18 PM
To: johnzengspark
Cc: User
Subject: Re: Logging in RDD mapToPair of Java Spark application

Hi John,

The reason you don't see the second sysout line is because is executed on a different JVM (ie. Driver vs Executor). the second sysout line should be available through the executor logs. Check the Executors tab.

There are alternative approaches to manage log centralization however it really depends on what are your requirements.

Hope it helps,

On Sat, Jul 29, 2017 at 8:09 PM, johnzengspark <jo...@hotmail.com>> wrote:
Hi, All,

Although there are lots of discussions related to logging in this news
group, I did not find an answer to my specific question so I am posting mine
with the hope that this will not cause a duplicated question.

Here is my simplified Java testing Spark app:

public class SparkJobEntry {
        public static void main(String[] args) {
                // Following line is in stdout from JobTracker UI
                System.out.println("argc=" + args.length);

                SparkConf conf = new SparkConf().setAppName("TestSparkApp");
                JavaSparkContext sc = new JavaSparkContext(conf);
                JavaRDD<String> fileRDD = sc.textFile(args[0]);

                fileRDD.mapToPair(new PairFunction<String, String, String>() {

                        private static final long serialVersionUID = 1L;

                        @Override
                        public Tuple2<String, String> call(String input) throws Exception {
                                // Following line is not in stdout from JobTracker UI
                                System.out.println("This line should be printed in stdout");
                                // Other code removed from here to make things simple
                                return new Tuple2<String, String>("1", "Testing data");
                        }}).saveAsTextFile(args[0] + ".results");
        }
}

What I expected from JobTracker UI is to see both stdout lines: first line
is "argc=2" and second line is "This line should be printed in stdout".  But
I only see the first line which is outside of the 'mapToPair'.  I actually
have verified my 'mapToPair' is called and the statements after the second
logging line were executed.  The only issue for me is why the second logging
is not in JobTracker UI.

Appreciate your help.

Thanks

John



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>


--
Best Regards,
Ayan Guha

Re: Logging in RDD mapToPair of Java Spark application

Posted by ayan guha <gu...@gmail.com>.

Hi

As you are using yarn log aggregation, yarn moves all the logs to hdfs
after the application completes.

You can use following command to get the logs:
yarn logs -applicationId <your application id>



On Mon, 31 Jul 2017 at 3:17 am, John Zeng <jo...@hotmail.com> wrote:

> Thanks Riccardo for the valuable info.
>
>
> Following your guidance, I looked at the Spark UI and figured out the
> default logs location for executors is 'yarn/container-logs'.  I ran my
> Spark app again and I can see a new folder was created for it:
>
>
> [root@john2 application_1501197841826_0013]# ls -l
> total 24
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:07
> container_1501197841826_0013_01_000001
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08
> container_1501197841826_0013_01_000002
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08
> container_1501197841826_0013_01_000003
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08
> container_1501197841826_0013_02_000001
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08
> container_1501197841826_0013_02_000002
> drwx--x--- 2 yarn yarn 4096 Jul 30 10:08
> container_1501197841826_0013_02_000003
>
> But when I tried to look into each its content, it was gone and there was
> not file at all from the same place:
>
> [root@john2 application_1501197841826_0013]# vi
> container_1501197841826_0013_*
> [root@john2 application_1501197841826_0013]# ls -l
> total 0
> [root@john2 application_1501197841826_0013]# pwd
> /yarn/container-logs/application_1501197841826_0013
>
> I believe Spark moves these logs to a different place.  But where are they?
>
> Thanks
>
> John
>
>
>
>
> ------------------------------
> *From:* Riccardo Ferrari <fe...@gmail.com>
> *Sent:* Saturday, July 29, 2017 8:18 PM
> *To:* johnzengspark
> *Cc:* User
> *Subject:* Re: Logging in RDD mapToPair of Java Spark application
>
> Hi John,
>
> The reason you don't see the second sysout line is because is executed on
> a different JVM (ie. Driver vs Executor). the second sysout line should be
> available through the executor logs. Check the Executors tab.
>
> There are alternative approaches to manage log centralization however it
> really depends on what are your requirements.
>
> Hope it helps,
>
> On Sat, Jul 29, 2017 at 8:09 PM, johnzengspark <jo...@hotmail.com>
> wrote:
>
>> Hi, All,
>>
>> Although there are lots of discussions related to logging in this news
>> group, I did not find an answer to my specific question so I am posting
>> mine
>> with the hope that this will not cause a duplicated question.
>>
>> Here is my simplified Java testing Spark app:
>>
>> public class SparkJobEntry {
>>         public static void main(String[] args) {
>>                 // Following line is in stdout from JobTracker UI
>>                 System.out.println("argc=" + args.length);
>>
>>                 SparkConf conf = new
>> SparkConf().setAppName("TestSparkApp");
>>                 JavaSparkContext sc = new JavaSparkContext(conf);
>>                 JavaRDD<String> fileRDD = sc.textFile(args[0]);
>>
>>                 fileRDD.mapToPair(new PairFunction<String, String,
>> String>() {
>>
>>                         private static final long serialVersionUID = 1L;
>>
>>                         @Override
>>                         public Tuple2<String, String> call(String input)
>> throws Exception {
>>                                 // Following line is not in stdout from
>> JobTracker UI
>>                                 System.out.println("This line should be
>> printed in stdout");
>>                                 // Other code removed from here to make
>> things simple
>>                                 return new Tuple2<String, String>("1",
>> "Testing data");
>>                         }}).saveAsTextFile(args[0] + ".results");
>>         }
>> }
>>
>> What I expected from JobTracker UI is to see both stdout lines: first line
>> is "argc=2" and second line is "This line should be printed in stdout".
>> But
>> I only see the first line which is outside of the 'mapToPair'.  I actually
>> have verified my 'mapToPair' is called and the statements after the second
>> logging line were executed.  The only issue for me is why the second
>> logging
>> is not in JobTracker UI.
>>
>> Appreciate your help.
>>
>> Thanks
>>
>> John
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
> --
Best Regards,
Ayan Guha

Re: Logging in RDD mapToPair of Java Spark application

Posted by John Zeng <jo...@hotmail.com>.

Thanks Riccardo for the valuable info.


Following your guidance, I looked at the Spark UI and figured out the default logs location for executors is 'yarn/container-logs'.  I ran my Spark app again and I can see a new folder was created for it:


[root@john2 application_1501197841826_0013]# ls -l
total 24
drwx--x--- 2 yarn yarn 4096 Jul 30 10:07 container_1501197841826_0013_01_000001
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_01_000002
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_01_000003
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000001
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000002
drwx--x--- 2 yarn yarn 4096 Jul 30 10:08 container_1501197841826_0013_02_000003

But when I tried to look into each its content, it was gone and there was not file at all from the same place:

[root@john2 application_1501197841826_0013]# vi container_1501197841826_0013_*
[root@john2 application_1501197841826_0013]# ls -l
total 0
[root@john2 application_1501197841826_0013]# pwd
/yarn/container-logs/application_1501197841826_0013

I believe Spark moves these logs to a different place.  But where are they?

Thanks

John




________________________________
From: Riccardo Ferrari <fe...@gmail.com>
Sent: Saturday, July 29, 2017 8:18 PM
To: johnzengspark
Cc: User
Subject: Re: Logging in RDD mapToPair of Java Spark application

Hi John,

The reason you don't see the second sysout line is because is executed on a different JVM (ie. Driver vs Executor). the second sysout line should be available through the executor logs. Check the Executors tab.

There are alternative approaches to manage log centralization however it really depends on what are your requirements.

Hope it helps,

On Sat, Jul 29, 2017 at 8:09 PM, johnzengspark <jo...@hotmail.com>> wrote:
Hi, All,

Although there are lots of discussions related to logging in this news
group, I did not find an answer to my specific question so I am posting mine
with the hope that this will not cause a duplicated question.

Here is my simplified Java testing Spark app:

public class SparkJobEntry {
        public static void main(String[] args) {
                // Following line is in stdout from JobTracker UI
                System.out.println("argc=" + args.length);

                SparkConf conf = new SparkConf().setAppName("TestSparkApp");
                JavaSparkContext sc = new JavaSparkContext(conf);
                JavaRDD<String> fileRDD = sc.textFile(args[0]);

                fileRDD.mapToPair(new PairFunction<String, String, String>() {

                        private static final long serialVersionUID = 1L;

                        @Override
                        public Tuple2<String, String> call(String input) throws Exception {
                                // Following line is not in stdout from JobTracker UI
                                System.out.println("This line should be printed in stdout");
                                // Other code removed from here to make things simple
                                return new Tuple2<String, String>("1", "Testing data");
                        }}).saveAsTextFile(args[0] + ".results");
        }
}

What I expected from JobTracker UI is to see both stdout lines: first line
is "argc=2" and second line is "This line should be printed in stdout".  But
I only see the first line which is outside of the 'mapToPair'.  I actually
have verified my 'mapToPair' is called and the statements after the second
logging line were executed.  The only issue for me is why the second logging
is not in JobTracker UI.

Appreciate your help.

Thanks

John



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-Spark-application-tp29007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Logging in RDD mapToPair of Java Spark application

Posted by Riccardo Ferrari <fe...@gmail.com>.

Hi John,

The reason you don't see the second sysout line is because is executed on a
different JVM (ie. Driver vs Executor). the second sysout line should be
available through the executor logs. Check the Executors tab.

There are alternative approaches to manage log centralization however it
really depends on what are your requirements.

Hope it helps,

On Sat, Jul 29, 2017 at 8:09 PM, johnzengspark <jo...@hotmail.com> wrote:

> Hi, All,
>
> Although there are lots of discussions related to logging in this news
> group, I did not find an answer to my specific question so I am posting
> mine
> with the hope that this will not cause a duplicated question.
>
> Here is my simplified Java testing Spark app:
>
> public class SparkJobEntry {
>         public static void main(String[] args) {
>                 // Following line is in stdout from JobTracker UI
>                 System.out.println("argc=" + args.length);
>
>                 SparkConf conf = new SparkConf().setAppName("
> TestSparkApp");
>                 JavaSparkContext sc = new JavaSparkContext(conf);
>                 JavaRDD<String> fileRDD = sc.textFile(args[0]);
>
>                 fileRDD.mapToPair(new PairFunction<String, String,
> String>() {
>
>                         private static final long serialVersionUID = 1L;
>
>                         @Override
>                         public Tuple2<String, String> call(String input)
> throws Exception {
>                                 // Following line is not in stdout from
> JobTracker UI
>                                 System.out.println("This line should be
> printed in stdout");
>                                 // Other code removed from here to make
> things simple
>                                 return new Tuple2<String, String>("1",
> "Testing data");
>                         }}).saveAsTextFile(args[0] + ".results");
>         }
> }
>
> What I expected from JobTracker UI is to see both stdout lines: first line
> is "argc=2" and second line is "This line should be printed in stdout".
> But
> I only see the first line which is outside of the 'mapToPair'.  I actually
> have verified my 'mapToPair' is called and the statements after the second
> logging line were executed.  The only issue for me is why the second
> logging
> is not in JobTracker UI.
>
> Appreciate your help.
>
> Thanks
>
> John
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Logging-in-RDD-mapToPair-of-Java-
> Spark-application-tp29007.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>