You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sp...@seznam.cz on 2014/12/05 21:25:04 UTC

Including data nucleus tools

Hi all,

  I have created assembly jar from 1.2 snapshot source by running [1] which 
sets correct version of hadoop for our cluster and uses hive profile. I also
have written relatively simple test program which starts by reading data 
from parquet using hive context. I compile the code against assembly jar 
created and then submited it on a cluster using by [2]. Job fails in its 
early stage on creating HiveContext itself. Important part of stack trace is
[3].

  Could please some of you explain what is wrong and how it should be fixed?
I have found only SPARK-4532
(https://issues.apache.org/jira/browse/SPARK-4532) when looking for 
something related. Fix for the bug is merged in source I have used so this 
is ruled out...

  Thanks for help

  Jakub

[1] ./sbt/sbt -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn -Phive assembly/
assembly

[2] ./bin/spark-submit --num-executors 200 --master yarn-cluster --conf 
spark.yarn.jar=assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-
hadoop2.3.0-cdh5.1.3.jar --class org.apache.spark.mllib.
CreateGuidDomainDictionary root-0.1.jar ...some-args-here

[3]
14/12/05 20:28:15 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: java.lang.
RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.
HiveMetaStoreClient)
Exception in thread "Driver" java.lang.RuntimeException: java.lang.
RuntimeException: Unable to instantiate
...
Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.
JDOPersistenceManagerFactory
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
...

Re: Including data nucleus tools

Posted by Jakub Dubovsky <sp...@seznam.cz>.

Hi DB,

  I cherry-picked the commit into branch-1.2 and it solved the problem. It 
solves the problem but has some bits and pieces around which was not 
finalized thus reverted beeing late in release process.

  Jakub

------
"Just out of my curiosity. Do you manually apply this patch and see if
this can actually resolve the issue? It seems that it was merged at
some point, but reverted due to that it causes some stability issue.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Dec 13, 2014 at 7:11 AM, <sp...@seznam.cz> wrote:
> So to answer my own question. It is a bug and there is unmerged PR for 
that
> already.
>
> https://issues.apache.org/jira/browse/SPARK-2624
> https://github.com/apache/spark/pull/3238
>
> Jakub
>
> ---------- Původní zpráva ----------
> Od: spark.dubovsky.jakub@seznam.cz
> Komu: spark.dubovsky.jakub@seznam.cz
> Datum: 12. 12. 2014 15:26:35
>
>
> Předmět: Re: Including data nucleus tools
>
>
> Hi,
>
> I had time to try it again. I submited my app by the same command with
> these additional options:
>
> --jars
> lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,lib/
datanucleus-rdbms-3.2.9.jar
>
> Now an app successfully creates hive context. So my question remains: Is
> "classpath entries" from sparkUI the same classpath as mentioned in submit
> script message?
>
> "Spark assembly has been built with Hive, including Datanucleus jars on
> classpath"
>
> If so then why the script fails to really include datanucleus jars on
> classpath? I found no bug about this on jira. Or is there a way how
> particular yarn/os settings on our cluster overrides this?
>
> Thanks in advance
>
> Jakub
>
> ---------- Původní zpráva ----------
> Od: spark.dubovsky.jakub@seznam.cz
> Komu: Michael Armbrust <mi...@databricks.com>
> Datum: 7. 12. 2014 3:02:33
> Předmět: Re: Including data nucleus tools
>
>
> Next try. I copied whole dist directory created by make-distribution 
script
> to cluster not just assembly jar. Then I used
>
> ./bin/spark-submit --num-executors 200 --master yarn-cluster --class
> org.apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar
> ${args}
>
> ...to run app again. Startup scripts printed this message:
>
> "Spark assembly has been built with Hive, including Datanucleus jars on
> classpath"
>
> ...so I thought I am finally there. But job started and failed on the same
> ClassNotFound exception as before. Is "classpath" from script message just
> classpath of driver? Or is it the same classpath which is affected by --
jars
> option? I was trying to find out from scripts but I was not able to find
> where --jars option is processed.
>
> thanks
>
> ---------- Původní zpráva ----------
> Od: Michael Armbrust <mi...@databricks.com>
> Komu: spark.dubovsky.jakub@seznam.cz
> Datum: 6. 12. 2014 20:39:13
> Předmět: Re: Including data nucleus tools
>
>
> On Sat, Dec 6, 2014 at 5:53 AM, <sp...@seznam.cz> wrote:
>
> Bonus question: Should the class
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly?
> Because it is not in jar now.
>
>
> No these jars cannot be put into the assembly because they have extra
> metadata files that live in the same location (so if you put them all in 
an
> assembly they overrwrite each other). This metadata is used in discovery.
> Instead they must be manually put on the classpath in their original form
> (usually using --jars)."

Re: Including data nucleus tools

Posted by DB Tsai <db...@dbtsai.com>.

Just out of my curiosity. Do you manually apply this patch and see if
this can actually resolve the issue? It seems that it was merged at
some point, but reverted due to that it causes some stability issue.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Dec 13, 2014 at 7:11 AM,  <sp...@seznam.cz> wrote:
> So to answer my own question. It is a bug and there is unmerged PR for that
> already.
>
> https://issues.apache.org/jira/browse/SPARK-2624
> https://github.com/apache/spark/pull/3238
>
> Jakub
>
> ---------- Původní zpráva ----------
> Od: spark.dubovsky.jakub@seznam.cz
> Komu: spark.dubovsky.jakub@seznam.cz
> Datum: 12. 12. 2014 15:26:35
>
>
> Předmět: Re: Including data nucleus tools
>
>
> Hi,
>
>   I had time to try it again. I submited my app by the same command with
> these additional options:
>
>   --jars
> lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,lib/datanucleus-rdbms-3.2.9.jar
>
>   Now an app successfully creates hive context. So my question remains: Is
> "classpath entries" from sparkUI the same classpath as mentioned in submit
> script message?
>
> "Spark assembly has been built with Hive, including Datanucleus jars on
> classpath"
>
>   If so then why the script fails to really include datanucleus jars on
> classpath? I found no bug about this on jira. Or is there a way how
> particular yarn/os settings on our cluster overrides this?
>
>   Thanks in advance
>
>   Jakub
>
> ---------- Původní zpráva ----------
> Od: spark.dubovsky.jakub@seznam.cz
> Komu: Michael Armbrust <mi...@databricks.com>
> Datum: 7. 12. 2014 3:02:33
> Předmět: Re: Including data nucleus tools
>
>
> Next try. I copied whole dist directory created by make-distribution script
> to cluster not just assembly jar. Then I used
>
> ./bin/spark-submit --num-executors 200 --master yarn-cluster --class
> org.apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar
> ${args}
>
>  ...to run app again. Startup scripts printed this message:
>
> "Spark assembly has been built with Hive, including Datanucleus jars on
> classpath"
>
>   ...so I thought I am finally there. But job started and failed on the same
> ClassNotFound exception as before. Is "classpath" from script message just
> classpath of driver? Or is it the same classpath which is affected by --jars
> option? I was trying to find out from scripts but I was not able to find
> where --jars option is processed.
>
>   thanks
>
> ---------- Původní zpráva ----------
> Od: Michael Armbrust <mi...@databricks.com>
> Komu: spark.dubovsky.jakub@seznam.cz
> Datum: 6. 12. 2014 20:39:13
> Předmět: Re: Including data nucleus tools
>
>
> On Sat, Dec 6, 2014 at 5:53 AM, <sp...@seznam.cz> wrote:
>
> Bonus question: Should the class
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly?
> Because it is not in jar now.
>
>
> No these jars cannot be put into the assembly because they have extra
> metadata files that live in the same location (so if you put them all in an
> assembly they overrwrite each other).  This metadata is used in discovery.
> Instead they must be manually put on the classpath in their original form
> (usually using --jars).

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Including data nucleus tools

Posted by sp...@seznam.cz.

So to answer my own question. It is a bug and there is unmerged PR for that 
already.

https://issues.apache.org/jira/browse/SPARK-2624
https://github.com/apache/spark/pull/3238

Jakub


---------- Původní zpráva ----------
Od: spark.dubovsky.jakub@seznam.cz
Komu: spark.dubovsky.jakub@seznam.cz
Datum: 12. 12. 2014 15:26:35
Předmět: Re: Including data nucleus tools

"
Hi,

  I had time to try it again. I submited my app by the same command with 
these additional options:

  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,
lib/datanucleus-rdbms-3.2.9.jar

  Now an app successfully creates hive context. So my question remains: Is 
"classpath entries" from sparkUI the same classpath as mentioned in submit 
script message?

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  If so then why the script fails to really include datanucleus jars on 
classpath? I found no bug about this on jira. Or is there a way how 
particular yarn/os settings on our cluster overrides this?

  Thanks in advance

  Jakub


---------- Původní zpráva ----------
Od: spark.dubovsky.jakub@seznam.cz
Komu: Michael Armbrust <mi...@databricks.com>
Datum: 7. 12. 2014 3:02:33
Předmět: Re: Including data nucleus tools

"
Next try. I copied whole dist directory created by make-distribution script 
to cluster not just assembly jar. Then I used

./bin/spark-submit --num-executors 200 --master yarn-cluster --class org.
apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar ${args}

 ...to run app again. Startup scripts printed this message:

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  ...so I thought I am finally there. But job started and failed on the same
ClassNotFound exception as before. Is "classpath" from script message just 
classpath of driver? Or is it the same classpath which is affected by --jars
option? I was trying to find out from scripts but I was not able to find 
where --jars option is processed.

  thanks


---------- Původní zpráva ----------
Od: Michael Armbrust <mi...@databricks.com>
Komu: spark.dubovsky.jakub@seznam.cz
Datum: 6. 12. 2014 20:39:13
Předmět: Re: Including data nucleus tools

"



On Sat, Dec 6, 2014 at 5:53 AM, <spark.dubovsky.jakub@seznam.cz
(mailto:/skin/default/img/empty.gif)> wrote:"
Bonus question: Should the class org.datanucleus.api.jdo.
JDOPersistenceManagerFactory be part of assembly? Because it is not in jar 
now.

"



No these jars cannot be put into the assembly because they have extra 
metadata files that live in the same location (so if you put them all in an 
assembly they overrwrite each other).  This metadata is used in discovery.  
Instead they must be manually put on the classpath in their original form 
(usually using --jars). 



 
"
"
"

Re: Including data nucleus tools

Posted by sp...@seznam.cz.

Hi,

  I had time to try it again. I submited my app by the same command with 
these additional options:

  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,
lib/datanucleus-rdbms-3.2.9.jar

  Now an app successfully creates hive context. So my question remains: Is 
"classpath entries" from sparkUI the same classpath as mentioned in submit 
script message?

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  If so then why the script fails to really include datanucleus jars on 
classpath? I found no bug about this on jira. Or is there a way how 
particular yarn/os settings on our cluster overrides this?

  Thanks in advance

  Jakub


---------- Původní zpráva ----------
Od: spark.dubovsky.jakub@seznam.cz
Komu: Michael Armbrust <mi...@databricks.com>
Datum: 7. 12. 2014 3:02:33
Předmět: Re: Including data nucleus tools

"
Next try. I copied whole dist directory created by make-distribution script 
to cluster not just assembly jar. Then I used

./bin/spark-submit --num-executors 200 --master yarn-cluster --class org.
apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar ${args}

 ...to run app again. Startup scripts printed this message:

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  ...so I thought I am finally there. But job started and failed on the same
ClassNotFound exception as before. Is "classpath" from script message just 
classpath of driver? Or is it the same classpath which is affected by --jars
option? I was trying to find out from scripts but I was not able to find 
where --jars option is processed.

  thanks


---------- Původní zpráva ----------
Od: Michael Armbrust <mi...@databricks.com>
Komu: spark.dubovsky.jakub@seznam.cz
Datum: 6. 12. 2014 20:39:13
Předmět: Re: Including data nucleus tools

"



On Sat, Dec 6, 2014 at 5:53 AM, <spark.dubovsky.jakub@seznam.cz
(mailto:/skin/default/img/empty.gif)> wrote:"
Bonus question: Should the class org.datanucleus.api.jdo.
JDOPersistenceManagerFactory be part of assembly? Because it is not in jar 
now.

"



No these jars cannot be put into the assembly because they have extra 
metadata files that live in the same location (so if you put them all in an 
assembly they overrwrite each other).  This metadata is used in discovery.  
Instead they must be manually put on the classpath in their original form 
(usually using --jars). 



 
"
"

Re: Including data nucleus tools

Posted by sp...@seznam.cz.

Next try. I copied whole dist directory created by make-distribution script 
to cluster not just assembly jar. Then I used

./bin/spark-submit --num-executors 200 --master yarn-cluster --class org.
apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar ${args}

 ...to run app again. Startup scripts printed this message:

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  ...so I thought I am finally there. But job started and failed on the same
ClassNotFound exception as before. Is "classpath" from script message just 
classpath of driver? Or is it the same classpath which is affected by --jars
option? I was trying to find out from scripts but I was not able to find 
where --jars option is processed.

  thanks


---------- Původní zpráva ----------
Od: Michael Armbrust <mi...@databricks.com>
Komu: spark.dubovsky.jakub@seznam.cz
Datum: 6. 12. 2014 20:39:13
Předmět: Re: Including data nucleus tools

"



On Sat, Dec 6, 2014 at 5:53 AM, <spark.dubovsky.jakub@seznam.cz
(mailto:/skin/default/img/empty.gif)> wrote:"
Bonus question: Should the class org.datanucleus.api.jdo.
JDOPersistenceManagerFactory be part of assembly? Because it is not in jar 
now.

"



No these jars cannot be put into the assembly because they have extra 
metadata files that live in the same location (so if you put them all in an 
assembly they overrwrite each other).  This metadata is used in discovery.  
Instead they must be manually put on the classpath in their original form 
(usually using --jars). 



 
"

Re: Including data nucleus tools

Posted by Michael Armbrust <mi...@databricks.com>.

On Sat, Dec 6, 2014 at 5:53 AM, <sp...@seznam.cz> wrote:
>
> Bonus question: Should the class
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly?
> Because it is not in jar now.
>

No these jars cannot be put into the assembly because they have extra
metadata files that live in the same location (so if you put them all in an
assembly they overrwrite each other).  This metadata is used in discovery.
Instead they must be manually put on the classpath in their original form
(usually using --jars).

Re: Including data nucleus tools

Posted by sp...@seznam.cz.

Hi again,

I have tried to recompile and run this again with new assembly created by

./make-distribution.sh -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn -
Phive -DskipTests

It results in exactly the same error. Any other hints?
Bonus question: Should the class org.datanucleus.api.jdo.
JDOPersistenceManagerFactory be part of assembly? Because it is not in jar 
now.

  thanks in advance
  Jakub



---------- Původní zpráva ----------
Od: DB Tsai <db...@dbtsai.com>
Komu: spark.dubovsky.jakub@seznam.cz
Datum: 5. 12. 2014 22:53:32
Předmět: Re: Including data nucleus tools

"

Can you try to run the same job using the assembly packaged by make-
distribution as we discussed in the other thread.





Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com(https://www.dbtsai.com)
LinkedIn: https://www.linkedin.com/in/dbtsai
(https://www.linkedin.com/in/dbtsai)




On Fri, Dec 5, 2014 at 12:25 PM, <spark.dubovsky.jakub@seznam.cz
(mailto:spark.dubovsky.jakub@seznam.cz)> wrote:
"
Hi all,

  I have created assembly jar from 1.2 snapshot source by running [1] which 
sets correct version of hadoop for our cluster and uses hive profile. I also
have written relatively simple test program which starts by reading data 
from parquet using hive context. I compile the code against assembly jar 
created and then submited it on a cluster using by [2]. Job fails in its 
early stage on creating HiveContext itself. Important part of stack trace is
[3].

  Could please some of you explain what is wrong and how it should be fixed?
I have found only SPARK-4532
(https://issues.apache.org/jira/browse/SPARK-4532) when looking for 
something related. Fix for the bug is merged in source I have used so this 
is ruled out...

  Thanks for help

  Jakub

[1] ./sbt/sbt -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn -Phive assembly/
assembly

[2] ./bin/spark-submit --num-executors 200 --master yarn-cluster --conf 
spark.yarn.jar=assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-
hadoop2.3.0-cdh5.1.3.jar --class org.apache.spark.mllib.
CreateGuidDomainDictionary root-0.1.jar ...some-args-here

[3]
14/12/05 20:28:15 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: java.lang.
RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.
HiveMetaStoreClient)
Exception in thread "Driver" java.lang.RuntimeException: java.lang.
RuntimeException: Unable to instantiate
...
Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.
JDOPersistenceManagerFactory
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
...

"



"

Re: Including data nucleus tools

Posted by DB Tsai <db...@dbtsai.com>.

Can you try to run the same job using the assembly packaged by
make-distribution as we discussed in the other thread.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

On Fri, Dec 5, 2014 at 12:25 PM, <sp...@seznam.cz> wrote:

> Hi all,
>
>   I have created assembly jar from 1.2 snapshot source by running [1]
> which sets correct version of hadoop for our cluster and uses hive profile.
> I also have written relatively simple test program which starts by reading
> data from parquet using hive context. I compile the code against assembly
> jar created and then submited it on a cluster using by [2]. Job fails in
> its early stage on creating HiveContext itself. Important part of stack
> trace is [3].
>
>   Could please some of you explain what is wrong and how it should be
> fixed? I have found only SPARK-4532
> <https://issues.apache.org/jira/browse/SPARK-4532> when looking for
> something related. Fix for the bug is merged in source I have used so this
> is ruled out...
>
>   Thanks for help
>
>   Jakub
>
> [1] ./sbt/sbt -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn -Phive
> assembly/assembly
>
> [2] ./bin/spark-submit --num-executors 200 --master yarn-cluster --conf
> spark.yarn.jar=assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop2.3.0-cdh5.1.3.jar
> --class org.apache.spark.mllib.CreateGuidDomainDictionary root-0.1.jar
> ...some-args-here
>
> [3]
> 14/12/05 20:28:15 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception:
> java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient)
> Exception in thread "Driver" java.lang.RuntimeException:
> java.lang.RuntimeException: Unable to instantiate
> ...
> Caused by: java.lang.ClassNotFoundException:
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory
>     at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> ...
>