You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Egor Pahomov <pa...@gmail.com> on 2016/01/13 02:01:12 UTC

1.6.0: Standalone application: Getting ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
serious issue. I successfully updated spark thrift server from 1.5.2 to
1.6.0. But I have standalone application, which worked fine with 1.5.2 but
failing on 1.6.0 with:

*NestedThrowables:*
*java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory*
* at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)*
* at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)*

Inside this application I work with hive table, which have data in json
format.

When I add

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-core</artifactId>
    <version>4.0.0-release</version>
</dependency>

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-api-jdo</artifactId>
    <version>4.0.0-release</version>
</dependency>

<dependency>
    <groupId>org.datanucleus</groupId>
    <artifactId>datanucleus-rdbms</artifactId>
    <version>3.2.9</version>
</dependency>

I'm getting:

*Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence
process has been specified to use a ClassLoaderResolver of name
"datanucleus" yet this has not been found by the DataNucleus plugin
mechanism. Please check your CLASSPATH and plugin specification.*
* at
org.datanucleus.AbstractNucleusContext.<init>(AbstractNucleusContext.java:102)*
* at
org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:162)*

I have CDH 5.5. I build spark with

*./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.5.0
-Phive -DskipTests*

Than I publish fat jar locally:

*mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file
-Dfile=./spark-assembly.jar -DgroupId=org.spark-project
-DartifactId=my-spark-assembly -Dversion=1.6.0-SNAPSHOT -Dpackaging=jar*

Than I include dependency on this fat jar:

<dependency>
    <groupId>org.spark-project</groupId>
    <artifactId>my-spark-assembly</artifactId>
    <version>1.6.0-SNAPSHOT</version>
</dependency>

Than I build my application with assembly plugin:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <configuration>
        <artifactSet>
            <includes>
                <include>*:*</include>
            </includes>
        </artifactSet>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <transformers>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">

<resource>META-INF/services/org.apache.hadoop.fs.FileSystem</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                        <resource>reference.conf</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
                        <resource>log4j.properties</resource>
                    </transformer>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
                    <transformer

implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>
                </transformers>
            </configuration>
        </execution>
    </executions>
</plugin>

Configuration of assembly plugin is copy-past from spark assembly pom.

This workflow worked for 1.5.2 and broke for 1.6.0. If I have not good
approach of creating this standalone application, please recommend
other approach, but spark-submit does not work for me - it hard for me
to connect it to Oozie.

Any suggestion would be appreciated - I'm stuck.

My spark config:

lazy val sparkConf = new SparkConf()
  .setMaster("yarn-client")
  .setAppName(appName)
  .set("spark.yarn.queue", "jenkins")
  .set("spark.executor.memory", "10g")
  .set("spark.yarn.executor.memoryOverhead", "2000")
  .set("spark.executor.cores", "3")
  .set("spark.driver.memory", "4g")
  .set("spark.shuffle.io.numConnectionsPerPeer", "5")
  .set("spark.sql.autoBroadcastJoinThreshold", "200483647")
  .set("spark.network.timeout", "1000s")
  .set("spark.executor.extraJavaOptions", "-XX:MaxPermSize=2g")
  .set("spark.driver.maxResultSize", "2g")
  .set("spark.rpc.lookupTimeout", "1000s")
  .set("spark.sql.hive.convertMetastoreParquet", "false")
  .set("spark.kryoserializer.buffer.max", "200m")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.yarn.driver.memoryOverhead", "1000")
  .set("spark.dynamicAllocation.enabled", "true")
  .set("spark.shuffle.service.enabled", "true")
  .set("spark.dynamicAllocation.minExecutors", "1")
  .set("spark.dynamicAllocation.maxExecutors", "20")
  .set("spark.dynamicAllocation.executorIdleTimeout", "60s")
  .set("spark.sql.tungsten.enabled", "false")
  .set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "100s")
.setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))

-- 



*Sincerely yoursEgor Pakhomov*

Re: 1.6.0: Standalone application: Getting ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

Posted by Egor Pahomov <pa...@gmail.com>.

My fault, I should have read documentation more accurate -
http://spark.apache.org/docs/latest/sql-programming-guide.html precisely
says, that I need to add these 3 jars to class path in case I need them. We
can not include them in fat jar, because they OSGI and require to have
plugin.xml and META_INF/MANIFEST.MF in root of jar. The problem is you have
3 of them and every one has it's own plugin.xml. You can include all this
in fat jar if you would be able to merge plugin.xml, but currently there is
no tool to do so. maven assembly plugin just has no such merger, maven
shaded plugin has XmlAppenderTransformer, but for some reason it doesn't
work. And that is it - you just have to live with the fact, that you have
fat jar with all dep, except these 3. Good news is if you are in
yarn-client mode you only need to add them to classpath of your driver, you
do not have to do addJar(). It's really good news, since it's hard to do
addJar() properly in Oozie job.

2016-01-12 17:01 GMT-08:00 Egor Pahomov <pa...@gmail.com>:

> Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
> serious issue. I successfully updated spark thrift server from 1.5.2 to
> 1.6.0. But I have standalone application, which worked fine with 1.5.2 but
> failing on 1.6.0 with:
>
> *NestedThrowables:*
> *java.lang.ClassNotFoundException:
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory*
> * at
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)*
> * at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)*
> * at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)*
>
> Inside this application I work with hive table, which have data in json
> format.
>
> When I add
>
> <dependency>
>     <groupId>org.datanucleus</groupId>
>     <artifactId>datanucleus-core</artifactId>
>     <version>4.0.0-release</version>
> </dependency>
>
> <dependency>
>     <groupId>org.datanucleus</groupId>
>     <artifactId>datanucleus-api-jdo</artifactId>
>     <version>4.0.0-release</version>
> </dependency>
>
> <dependency>
>     <groupId>org.datanucleus</groupId>
>     <artifactId>datanucleus-rdbms</artifactId>
>     <version>3.2.9</version>
> </dependency>
>
> I'm getting:
>
> *Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence
> process has been specified to use a ClassLoaderResolver of name
> "datanucleus" yet this has not been found by the DataNucleus plugin
> mechanism. Please check your CLASSPATH and plugin specification.*
> * at
> org.datanucleus.AbstractNucleusContext.<init>(AbstractNucleusContext.java:102)*
> * at
> org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:162)*
>
> I have CDH 5.5. I build spark with
>
> *./make-distribution.sh -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.5.0 -Phive -DskipTests*
>
> Than I publish fat jar locally:
>
> *mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file
> -Dfile=./spark-assembly.jar -DgroupId=org.spark-project
> -DartifactId=my-spark-assembly -Dversion=1.6.0-SNAPSHOT -Dpackaging=jar*
>
> Than I include dependency on this fat jar:
>
> <dependency>
>     <groupId>org.spark-project</groupId>
>     <artifactId>my-spark-assembly</artifactId>
>     <version>1.6.0-SNAPSHOT</version>
> </dependency>
>
> Than I build my application with assembly plugin:
>
> <plugin>
>     <groupId>org.apache.maven.plugins</groupId>
>     <artifactId>maven-shade-plugin</artifactId>
>     <configuration>
>         <artifactSet>
>             <includes>
>                 <include>*:*</include>
>             </includes>
>         </artifactSet>
>         <filters>
>             <filter>
>                 <artifact>*:*</artifact>
>                 <excludes>
>                     <exclude>META-INF/*.SF</exclude>
>                     <exclude>META-INF/*.DSA</exclude>
>                     <exclude>META-INF/*.RSA</exclude>
>                 </excludes>
>             </filter>
>         </filters>
>     </configuration>
>     <executions>
>         <execution>
>             <phase>package</phase>
>             <goals>
>                 <goal>shade</goal>
>             </goals>
>             <configuration>
>                 <transformers>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
>                         <resource>META-INF/services/org.apache.hadoop.fs.FileSystem</resource>
>                     </transformer>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
>                         <resource>reference.conf</resource>
>                     </transformer>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>                         <resource>log4j.properties</resource>
>                     </transformer>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
>                     <transformer
>                             implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>
>                 </transformers>
>             </configuration>
>         </execution>
>     </executions>
> </plugin>
>
> Configuration of assembly plugin is copy-past from spark assembly pom.
>
> This workflow worked for 1.5.2 and broke for 1.6.0. If I have not good approach of creating this standalone application, please recommend other approach, but spark-submit does not work for me - it hard for me to connect it to Oozie.
>
> Any suggestion would be appreciated - I'm stuck.
>
> My spark config:
>
> lazy val sparkConf = new SparkConf()
>   .setMaster("yarn-client")
>   .setAppName(appName)
>   .set("spark.yarn.queue", "jenkins")
>   .set("spark.executor.memory", "10g")
>   .set("spark.yarn.executor.memoryOverhead", "2000")
>   .set("spark.executor.cores", "3")
>   .set("spark.driver.memory", "4g")
>   .set("spark.shuffle.io.numConnectionsPerPeer", "5")
>   .set("spark.sql.autoBroadcastJoinThreshold", "200483647")
>   .set("spark.network.timeout", "1000s")
>   .set("spark.executor.extraJavaOptions", "-XX:MaxPermSize=2g")
>   .set("spark.driver.maxResultSize", "2g")
>   .set("spark.rpc.lookupTimeout", "1000s")
>   .set("spark.sql.hive.convertMetastoreParquet", "false")
>   .set("spark.kryoserializer.buffer.max", "200m")
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.yarn.driver.memoryOverhead", "1000")
>   .set("spark.dynamicAllocation.enabled", "true")
>   .set("spark.shuffle.service.enabled", "true")
>   .set("spark.dynamicAllocation.minExecutors", "1")
>   .set("spark.dynamicAllocation.maxExecutors", "20")
>   .set("spark.dynamicAllocation.executorIdleTimeout", "60s")
>   .set("spark.sql.tungsten.enabled", "false")
>   .set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "100s")   .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
>
> --
>
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 


*Sincerely yoursEgor Pakhomov*