You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alonso Isidoro Roman <al...@gmail.com> on 2014/06/23 20:15:40 UTC

about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Hi all,

I am new to Spark, so this is probably a basic question. i want to explore
the possibilities of this fw, concretely using it in conjunction with 3
party libs, like mongodb, for example.

I have been keeping instructions from
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order to
connect spark with mongodb. This example is made with
spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml with
latest versions.

This is my pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>com.aironman.spark</groupId>

<artifactId>simple-project</artifactId>

<modelVersion>4.0.0</modelVersion>

<name>Simple Project</name>

<packaging>jar</packaging>

<version>1.0</version>

<repositories>

<repository>

<id>Akka repository</id>

<url>http://repo.akka.io/releases</url>

</repository>

</repositories>

<dependencies>

<dependency> <!-- Spark dependency -->

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.0.0</version>

</dependency>


 <dependency>

<groupId>org.mongodb</groupId>

<artifactId>mongo-hadoop-core</artifactId>

<version>1.0.0</version>

</dependency>


 </dependencies>
</project>

As you can see, super simple pom.xml

And this is the JavaWordCount.java

import java.util.Arrays;

import java.util.Collections;


import org.apache.hadoop.conf.Configuration;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.bson.BSONObject;

import org.bson.BasicBSONObject;


import scala.Tuple2;


import com.mongodb.hadoop.MongoOutputFormat;


/***

 * Esta clase se supone que se conecta a un cluster mongodb para ejecutar
una tarea word count por cada palabra almacenada en la bd.

 * el problema es que esta api esta rota, creo. Estoy usando la ultima
version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria usar un
ejemplo

 * sobre spark-streaming conectandose a un base mongodb, o usar spark-streaming
junto con spring integration, es decir, conectar spark con un servicio web
que

 * periodicamente alimentaria spark...

 * @author aironman

 *

 */

public class JavaWordCount {



    public static void main(String[] args) {



        JavaSparkContext sc = new JavaSparkContext("local", "Java Word
Count");



        Configuration config = new Configuration();

        config.set("mongo.input.uri", "mongodb:127.0.0.1:27017/beowulf.input
");

        config.set("mongo.output.uri", "mongodb:
127.0.0.1:27017/beowulf.output");





        JavaPairRDD<Object, BSONObject> mongoRDD =
sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
Object.class, BSONObject.class);



//         Input contains tuples of (ObjectId, BSONObject)

        JavaRDD<String> words = mongoRDD.flatMap(new
FlatMapFunction<Tuple2<Object,
BSONObject>, String>() {

            @Override

            public Iterable<String> call(Tuple2<Object, BSONObject> arg) {

                Object o = arg._2.get("text");

                if (o instanceof String) {

                    String str = (String) o;

                    str = str.toLowerCase().replaceAll("[.,!?\n]", " ");

                    return Arrays.asList(str.split(" "));

                } else {

                    return Collections.emptyList();

                }

            }

        });

        *//here is an error, The method map(Function<String,R>) in the type
JavaRDD<String> is not applicable for the arguments (new
PairFunction<String,String,Integer>(){})*

        JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String,
String, Integer>() {

            public Tuple2<String, Integer> call(String s) {

                return new Tuple2<>(s, 1);

            }

        });

        JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
Function2<Integer,
Integer, Integer>() {

            public Integer call(Integer i1, Integer i2) {

                return i1 + i2;

            }

        });



        *//another error, The method
map(Function<Tuple2<String,Integer>,R>) in the type
JavaPairRDD<String,Integer> is not applicable for the arguments (new
 //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})*


//         Output contains tuples of (null, BSONObject) - ObjectId will be
generated by Mongo driver if null

        JavaPairRDD<Object, BSONObject> save = counts.map(new
PairFunction<Tuple2<String,
Integer>, Object, BSONObject>() {

            @Override

            public Tuple2<Object, BSONObject> call(Tuple2<String, Integer>
tuple) {

                BSONObject bson = new BasicBSONObject();

                bson.put("word", tuple._1);

                bson.put("count", tuple._2);

                return new Tuple2<>(null, bson);

            }

        });



//         Only MongoOutputFormat and config are relevant

        save.saveAsNewAPIHadoopFile("file:/bogus", Object.class, Object.
class, MongoOutputFormat.class, config);

    }



}

It looks like jar hell dependency, isn't it?  can anyone guide or help me?

Another thing, i don t like closures, is it possible to use this fw without
using it?
Another question, are this objects, JavaSparkContext sc, JavaPairRDD<Object,
BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?

Thank you very much and apologizes if the questions are not trending topic
:)

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Alonso Isidoro Roman <al...@gmail.com>.

Hi Sheryl, thank you for your patience with me. The problem with that
example is the author is using spark-core_2.10-0.9.0-incubating and i am
trying to do the same with latest version of spark, 1.0.0, so i think that
project it is not a good starting point to develop.

I am going to start from spark-examples_2.10, let s see what i can do.

Thank you very much

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-06-26 20:38 GMT+02:00 Sheryl John <sh...@gmail.com>:

> Make sure you have the right versions for 'mongo-hadoop-core' and
> 'hadoop-client'. I noticed you've used hadoop-client version 2.2.0
>
> The maven repositories does not have the mongo-hadoop connector for Hadoop
> version 2.2.0, so you have to include the mongo-hadoop connector as an
> unmanaged library (as mentioned in the link
> http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ and example
> code).
> I'm not sure if this will fix the error.
>
>
> On Thu, Jun 26, 2014 at 3:23 AM, Alonso Isidoro Roman <al...@gmail.com>
> wrote:
>
>> Hi Sheryl,
>>
>> first at all, thanks for answering. spark-core_2.10 artifact has as
>> dependency hadoop-client-1.0.4.jar and mongo-hadoop-core-1.0.0.jar has
>> mongo-java-driver-2.7.3.jar.
>>
>> This is my actual pom.xml
>>
>> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
>> http://www.w3.org/2001/XMLSchema-instance"
>>
>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>
>> <groupId>com.aironman.spark</groupId>
>>
>> <artifactId>simple-project</artifactId>
>>
>> <modelVersion>4.0.0</modelVersion>
>>
>> <name>Simple Project</name>
>>
>> <packaging>jar</packaging>
>>
>> <version>1.0</version>
>>
>> <properties>
>>
>> <maven.compiler.source>1.7</maven.compiler.source>
>>
>> <maven.compiler.target>1.7</maven.compiler.target>
>>
>> <hadoop.version>1.0.2</hadoop.version>
>>
>> </properties>
>>
>>
>>  <repositories>
>>
>> <repository>
>>
>> <id>Akka repository</id>
>>
>> <url>http://repo.akka.io/releases</url>
>>
>> </repository>
>>
>> </repositories>
>>
>> <dependencies>
>>
>> <dependency> <!-- Spark dependency -->
>>
>> <groupId>org.apache.spark</groupId>
>>
>> <artifactId>spark-core_2.10</artifactId>
>>
>> <version>1.0.0</version>
>>
>> </dependency>
>>
>> <dependency> <!-- Mongodb dependency -->
>>
>> <groupId>org.mongodb</groupId>
>>
>> <artifactId>mongo-hadoop-core</artifactId>
>>
>> <version>1.0.0</version>
>>
>> </dependency>
>>
>>
>>  <dependency>
>>
>> <groupId>org.apache.hadoop</groupId>
>>
>> <artifactId>hadoop-client</artifactId>
>>
>> <version>2.2.0</version>
>>
>> </dependency>
>>
>> <dependency>
>>
>> <groupId>org.mongodb</groupId>
>>
>> <artifactId>mongo-java-driver</artifactId>
>>
>> <version>2.9.3</version>
>>
>> </dependency>
>>
>> </dependencies>
>>
>> <build>
>>
>> <outputDirectory>target/java-${maven.compiler.source}/classes</
>> outputDirectory>
>>
>> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
>> testOutputDirectory>
>>
>> <plugins>
>>
>> <plugin>
>>
>> <groupId>org.apache.maven.plugins</groupId>
>>
>> <artifactId>maven-shade-plugin</artifactId>
>>
>> <version>2.3</version>
>>
>> <configuration>
>>
>>
>>  <shadedArtifactAttached>false</shadedArtifactAttached>
>>
>>
>>  <outputFile>
>> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
>> </outputFile>
>>
>> <artifactSet>
>>
>> <includes>
>>
>> <include>*:*</include>
>>
>> </includes>
>>
>> </artifactSet>
>>
>>
>>  <filters>
>>
>> <filter>
>>
>> <artifact>*:*</artifact>
>>
>> <!-- <excludes> <exclude>META-INF/*.SF</exclude>
>> <exclude>META-INF/*.DSA</exclude>
>>
>> <exclude>META-INF/*.RSA</exclude> </excludes> -->
>>
>> </filter>
>>
>> </filters>
>>
>>
>>  </configuration>
>>
>> <executions>
>>
>> <execution>
>>
>> <phase>package</phase>
>>
>> <goals>
>>
>> <goal>shade</goal>
>>
>> </goals>
>>
>> <configuration>
>>
>> <transformers>
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>>
>> <resource>reference.conf</resource>
>>
>> </transformer>
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>>
>> <resource>log4j.properties</resource>
>>
>> </transformer>
>>
>> </transformers>
>>
>> </configuration>
>>
>> </execution>
>>
>> </executions>
>>
>> </plugin>
>>
>> </plugins>
>>
>> </build>
>>
>> </project>
>>
>> and the error persists, this is the  spark-submit command`s output
>>
>> MacBook-Pro-Retina-de-Alonso:ConnectorSparkMongo aironman$
>> /Users/aironman/spark-1.0.0/bin/spark-submit --class
>> "com.aironman.spark.utils.JavaWordCount" --master local[4]
>> target/simple-project-1.0.jar
>>
>> 14/06/26 12:05:21 INFO SecurityManager: Using Spark's default log4j
>> profile: org/apache/spark/log4j-defaults.properties
>>
>> 14/06/26 12:05:21 INFO SecurityManager: Changing view acls to: aironman
>>
>> 14/06/26 12:05:21 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(aironman)
>>
>> 14/06/26 12:05:21 INFO Slf4jLogger: Slf4jLogger started
>>
>> 14/06/26 12:05:21 INFO Remoting: Starting remoting
>>
>> 14/06/26 12:05:21 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://spark@192.168.1.35:49681]
>>
>> 14/06/26 12:05:21 INFO Remoting: Remoting now listens on addresses:
>> [akka.tcp://spark@192.168.1.35:49681]
>>
>> 14/06/26 12:05:21 INFO SparkEnv: Registering MapOutputTracker
>>
>> 14/06/26 12:05:21 INFO SparkEnv: Registering BlockManagerMaster
>>
>> 14/06/26 12:05:21 INFO DiskBlockManager: Created local directory at
>> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-local-20140626120521-09d2
>>
>> 14/06/26 12:05:21 INFO MemoryStore: MemoryStore started with capacity
>> 294.9 MB.
>>
>> 14/06/26 12:05:21 INFO ConnectionManager: Bound socket to port 49682 with
>> id = ConnectionManagerId(192.168.1.35,49682)
>>
>> 14/06/26 12:05:21 INFO BlockManagerMaster: Trying to register BlockManager
>>
>> 14/06/26 12:05:21 INFO BlockManagerInfo: Registering block manager
>> 192.168.1.35:49682 with 294.9 MB RAM
>>
>> 14/06/26 12:05:21 INFO BlockManagerMaster: Registered BlockManager
>>
>> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>>
>> 14/06/26 12:05:21 INFO HttpBroadcast: Broadcast server started at
>> http://192.168.1.35:49683
>>
>> 14/06/26 12:05:21 INFO HttpFileServer: HTTP File server directory is
>> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-e0cff70b-9597-452f-b5db-99aef3245e04
>>
>> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>>
>> 14/06/26 12:05:21 INFO SparkUI: Started SparkUI at
>> http://192.168.1.35:4040
>>
>> 2014-06-26 12:05:21.915 java[880:1903] Unable to load realm info from
>> SCDynamicStore
>>
>> 14/06/26 12:05:22 INFO SparkContext: Added JAR
>> file:/Users/aironman/Documents/ws-spark/ConnectorSparkMongo/target/simple-project-1.0.jar
>> at http://192.168.1.35:49684/jars/simple-project-1.0.jar with timestamp
>> 1403777122023
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/mongodb/hadoop/MongoInputFormat
>>
>> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:50)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.mongodb.hadoop.MongoInputFormat
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>
>> ... 8 more
>>
>>
>> com.mongodb.hadoop.MongoInputFormat is located in mongo-hadoop-core-1.0.0.jar
>> and theorycally the jar is included in uber jar because i can see in
>> the mvm clean package command output
>>
>>
>> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded
>> jar.
>>
>>
>> It looks like the dependency jar is not included in the uber jar, despite
>> the line shown above...
>>
>>
>> Thanks in advance, i really appreciate it
>>
>> Alonso Isidoro Roman.
>>
>> Mis citas preferidas (de hoy) :
>> "Si depurar es el proceso de quitar los errores de software, entonces
>> programar debe ser el proceso de introducirlos..."
>>  -  Edsger Dijkstra
>>
>> My favorite quotes (today):
>> "If debugging is the process of removing software bugs, then programming
>> must be the process of putting ..."
>>   - Edsger Dijkstra
>>
>> "If you pay peanuts you get monkeys"
>>
>>
>>
>> 2014-06-25 18:47 GMT+02:00 Sheryl John <sh...@gmail.com>:
>>
>> Hi Alonso,
>>>
>>> I was able to get Spark working with MongoDB following the above
>>> instructions and used SBT to manage dependencies.
>>> Try adding 'hadoop-client' and 'mongo-java-driver' dependencies in your
>>> pom.xml.
>>>
>>>
>>>
>>> On Tue, Jun 24, 2014 at 8:06 AM, Alonso Isidoro Roman <
>>> alonsoir@gmail.com> wrote:
>>>
>>>> Hi Yana, thanks for the answer, of course you are right and it works!
>>>> no errors, now i am trying to execute it and i am getting an error
>>>>
>>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>>> com/mongodb/hadoop/MongoInputFormat
>>>>
>>>> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)
>>>>
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>
>>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>>>>
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>>>
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.mongodb.hadoop.MongoInputFormat
>>>>
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>>
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>>
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>>
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>>
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>>
>>>> ... 8 more
>>>>
>>>>
>>>> That class is located in mongo-hadoop-core-1.0.0.jar which is located
>>>> and declared in pom.xml and i am using maven-shade-plugin, this is the
>>>> actual pom.xml:
>>>>
>>>>
>>>>
>>>> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
>>>> http://www.w3.org/2001/XMLSchema-instance"
>>>>
>>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>>
>>>> <groupId>com.aironman.spark</groupId>
>>>>
>>>> <artifactId>simple-project</artifactId>
>>>>
>>>> <modelVersion>4.0.0</modelVersion>
>>>>
>>>> <name>Simple Project</name>
>>>>
>>>> <packaging>jar</packaging>
>>>>
>>>> <version>1.0</version>
>>>>
>>>> <properties>
>>>>
>>>> <maven.compiler.source>1.7</maven.compiler.source>
>>>>
>>>> <maven.compiler.target>1.7</maven.compiler.target>
>>>>
>>>> <hadoop.version>1.0.2</hadoop.version>
>>>>
>>>> </properties>
>>>>
>>>>
>>>> <repositories>
>>>>
>>>> <repository>
>>>>
>>>> <id>Akka repository</id>
>>>>
>>>> <url>http://repo.akka.io/releases</url>
>>>>
>>>> </repository>
>>>>
>>>> </repositories>
>>>>
>>>> <dependencies>
>>>>
>>>> <dependency> <!-- Spark dependency -->
>>>>
>>>> <groupId>org.apache.spark</groupId>
>>>>
>>>> <artifactId>spark-core_2.10</artifactId>
>>>>
>>>> <version>1.0.0</version>
>>>>
>>>> </dependency>
>>>>
>>>>
>>>> <dependency>
>>>>
>>>> <groupId>org.mongodb</groupId>
>>>>
>>>> <artifactId>mongo-hadoop-core</artifactId>
>>>>
>>>> <version>1.0.0</version>
>>>>
>>>> </dependency>
>>>>
>>>>
>>>> </dependencies>
>>>>
>>>> <build>
>>>>
>>>> <outputDirectory>target/java-${maven.compiler.source}/classes</
>>>> outputDirectory>
>>>>
>>>> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes
>>>> </testOutputDirectory>
>>>>
>>>> <plugins>
>>>>
>>>> <plugin>
>>>>
>>>> <groupId>org.apache.maven.plugins</groupId>
>>>>
>>>> <artifactId>maven-shade-plugin</artifactId>
>>>>
>>>> <version>2.3</version>
>>>>
>>>> <configuration>
>>>>
>>>>  <shadedArtifactAttached>false</shadedArtifactAttached>
>>>>
>>>>  <outputFile>
>>>> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
>>>> </outputFile>
>>>>
>>>> <artifactSet>
>>>>
>>>> <includes>
>>>>
>>>> <include>*:*</include>
>>>>
>>>> </includes>
>>>>
>>>> </artifactSet>
>>>>
>>>> <!--
>>>>
>>>>  <filters>
>>>>
>>>> <filter>
>>>>
>>>>  <artifact>*:*</artifact>
>>>>
>>>> <excludes>
>>>>
>>>> <exclude>META-INF/*.SF</exclude>
>>>>
>>>>  <exclude>META-INF/*.DSA</exclude>
>>>>
>>>> <exclude>META-INF/*.RSA</exclude>
>>>>
>>>> </excludes>
>>>>
>>>>  </filter>
>>>>
>>>> </filters>
>>>>
>>>>  -->
>>>>
>>>> </configuration>
>>>>
>>>> <executions>
>>>>
>>>> <execution>
>>>>
>>>> <phase>package</phase>
>>>>
>>>> <goals>
>>>>
>>>> <goal>shade</goal>
>>>>
>>>> </goals>
>>>>
>>>> <configuration>
>>>>
>>>> <transformers>
>>>>
>>>> <transformer
>>>>
>>>> implementation=
>>>> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"
>>>> />
>>>>
>>>> <transformer
>>>>
>>>> implementation=
>>>> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>>>>
>>>> <resource>reference.conf</resource>
>>>>
>>>> </transformer>
>>>>
>>>> <transformer
>>>>
>>>> implementation=
>>>> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer"
>>>> >
>>>>
>>>> <resource>log4j.properties</resource>
>>>>
>>>> </transformer>
>>>>
>>>> </transformers>
>>>>
>>>> </configuration>
>>>>
>>>> </execution>
>>>>
>>>> </executions>
>>>>
>>>> </plugin>
>>>>
>>>> </plugins>
>>>>
>>>> </build>
>>>>
>>>> </project>
>>>>
>>>> It is confusing me because i can see the output of mvm package:
>>>>
>>>> ...
>>>>
>>>> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded
>>>> jar.
>>>>
>>>> ...
>>>> in theory, the uber jar is created with all dependencies but if i open
>>>> the jar i don't see them.
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>> Alonso Isidoro Roman.
>>>>
>>>> Mis citas preferidas (de hoy) :
>>>> "Si depurar es el proceso de quitar los errores de software, entonces
>>>> programar debe ser el proceso de introducirlos..."
>>>>  -  Edsger Dijkstra
>>>>
>>>> My favorite quotes (today):
>>>> "If debugging is the process of removing software bugs, then
>>>> programming must be the process of putting ..."
>>>>   - Edsger Dijkstra
>>>>
>>>> "If you pay peanuts you get monkeys"
>>>>
>>>>
>>>>
>>>> 2014-06-23 21:27 GMT+02:00 Yana Kadiyska <ya...@gmail.com>:
>>>>
>>>>  One thing I noticed around the place where you get the first error --
>>>>> you are calling words.map instead of words.mapToPair. map produces
>>>>> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
>>>>> Java APIs myself but it looks to me like you need to check the types
>>>>> more carefully.
>>>>>
>>>>> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
>>>>> <al...@gmail.com> wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > I am new to Spark, so this is probably a basic question. i want to
>>>>> explore
>>>>> > the possibilities of this fw, concretely using it in conjunction
>>>>> with 3
>>>>> > party libs, like mongodb, for example.
>>>>> >
>>>>> > I have been keeping instructions from
>>>>> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in
>>>>> order to
>>>>> > connect spark with mongodb. This example is made with
>>>>> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml
>>>>> with
>>>>> > latest versions.
>>>>> >
>>>>> > This is my pom.xml
>>>>> >
>>>>> > <project xmlns="http://maven.apache.org/POM/4.0.0"
>>>>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>> >
>>>>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>>>> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>>> >
>>>>> > <groupId>com.aironman.spark</groupId>
>>>>> >
>>>>> > <artifactId>simple-project</artifactId>
>>>>> >
>>>>> > <modelVersion>4.0.0</modelVersion>
>>>>> >
>>>>> > <name>Simple Project</name>
>>>>> >
>>>>> > <packaging>jar</packaging>
>>>>> >
>>>>> > <version>1.0</version>
>>>>> >
>>>>> > <repositories>
>>>>> >
>>>>> > <repository>
>>>>> >
>>>>> > <id>Akka repository</id>
>>>>> >
>>>>> > <url>http://repo.akka.io/releases</url>
>>>>> >
>>>>> > </repository>
>>>>> >
>>>>> > </repositories>
>>>>> >
>>>>> > <dependencies>
>>>>> >
>>>>> > <dependency> <!-- Spark dependency -->
>>>>> >
>>>>> > <groupId>org.apache.spark</groupId>
>>>>> >
>>>>> > <artifactId>spark-core_2.10</artifactId>
>>>>> >
>>>>> > <version>1.0.0</version>
>>>>> >
>>>>> > </dependency>
>>>>> >
>>>>> >
>>>>> > <dependency>
>>>>> >
>>>>> > <groupId>org.mongodb</groupId>
>>>>> >
>>>>> > <artifactId>mongo-hadoop-core</artifactId>
>>>>> >
>>>>> > <version>1.0.0</version>
>>>>> >
>>>>> > </dependency>
>>>>> >
>>>>> >
>>>>> > </dependencies>
>>>>> >
>>>>> > </project>
>>>>> >
>>>>> > As you can see, super simple pom.xml
>>>>> >
>>>>> > And this is the JavaWordCount.java
>>>>> >
>>>>> > import java.util.Arrays;
>>>>> >
>>>>> > import java.util.Collections;
>>>>> >
>>>>> >
>>>>> > import org.apache.hadoop.conf.Configuration;
>>>>> >
>>>>> > import org.apache.spark.api.java.JavaPairRDD;
>>>>> >
>>>>> > import org.apache.spark.api.java.JavaRDD;
>>>>> >
>>>>> > import org.apache.spark.api.java.JavaSparkContext;
>>>>> >
>>>>> > import org.apache.spark.api.java.function.FlatMapFunction;
>>>>> >
>>>>> > import org.apache.spark.api.java.function.Function2;
>>>>> >
>>>>> > import org.apache.spark.api.java.function.PairFunction;
>>>>> >
>>>>> > import org.bson.BSONObject;
>>>>> >
>>>>> > import org.bson.BasicBSONObject;
>>>>> >
>>>>> >
>>>>> > import scala.Tuple2;
>>>>> >
>>>>> >
>>>>> > import com.mongodb.hadoop.MongoOutputFormat;
>>>>> >
>>>>> >
>>>>> > /***
>>>>> >
>>>>> >  * Esta clase se supone que se conecta a un cluster mongodb para
>>>>> ejecutar
>>>>> > una tarea word count por cada palabra almacenada en la bd.
>>>>> >
>>>>> >  * el problema es que esta api esta rota, creo. Estoy usando la
>>>>> ultima
>>>>> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria
>>>>> usar un
>>>>> > ejemplo
>>>>> >
>>>>> >  * sobre spark-streaming conectandose a un base mongodb, o usar
>>>>> > spark-streaming junto con spring integration, es decir, conectar
>>>>> spark con
>>>>> > un servicio web que
>>>>> >
>>>>> >  * periodicamente alimentaria spark...
>>>>> >
>>>>> >  * @author aironman
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >  */
>>>>> >
>>>>> > public class JavaWordCount {
>>>>> >
>>>>> >
>>>>> >
>>>>> >     public static void main(String[] args) {
>>>>> >
>>>>> >
>>>>> >
>>>>> >         JavaSparkContext sc = new JavaSparkContext("local", "Java
>>>>> Word
>>>>> > Count");
>>>>> >
>>>>> >
>>>>> >
>>>>> >         Configuration config = new Configuration();
>>>>> >
>>>>> >         config.set("mongo.input.uri",
>>>>> > "mongodb:127.0.0.1:27017/beowulf.input");
>>>>> >
>>>>> >         config.set("mongo.output.uri",
>>>>> > "mongodb:127.0.0.1:27017/beowulf.output");
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
>>>>> > Object.class, BSONObject.class);
>>>>> >
>>>>> >
>>>>> >
>>>>> > //         Input contains tuples of (ObjectId, BSONObject)
>>>>> >
>>>>> >         JavaRDD<String> words = mongoRDD.flatMap(new
>>>>> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>>>>> >
>>>>> >             @Override
>>>>> >
>>>>> >             public Iterable<String> call(Tuple2<Object, BSONObject>
>>>>> arg) {
>>>>> >
>>>>> >                 Object o = arg._2.get("text");
>>>>> >
>>>>> >                 if (o instanceof String) {
>>>>> >
>>>>> >                     String str = (String) o;
>>>>> >
>>>>> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", "
>>>>> ");
>>>>> >
>>>>> >                     return Arrays.asList(str.split(" "));
>>>>> >
>>>>> >                 } else {
>>>>> >
>>>>> >                     return Collections.emptyList();
>>>>> >
>>>>> >                 }
>>>>> >
>>>>> >             }
>>>>> >
>>>>> >         });
>>>>> >
>>>>> >         //here is an error, The method map(Function<String,R>) in
>>>>> the type
>>>>> > JavaRDD<String> is not applicable for the arguments (new
>>>>> > PairFunction<String,String,Integer>(){})
>>>>> >
>>>>> >         JavaPairRDD<String, Integer> ones = words.map(new
>>>>> > PairFunction<String, String, Integer>() {
>>>>> >
>>>>> >             public Tuple2<String, Integer> call(String s) {
>>>>> >
>>>>> >                 return new Tuple2<>(s, 1);
>>>>> >
>>>>> >             }
>>>>> >
>>>>> >         });
>>>>> >
>>>>> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
>>>>> > Function2<Integer, Integer, Integer>() {
>>>>> >
>>>>> >             public Integer call(Integer i1, Integer i2) {
>>>>> >
>>>>> >                 return i1 + i2;
>>>>> >
>>>>> >             }
>>>>> >
>>>>> >         });
>>>>> >
>>>>> >
>>>>> >
>>>>> >         //another error, The method
>>>>> map(Function<Tuple2<String,Integer>,R>)
>>>>> > in the type JavaPairRDD<String,Integer> is not applicable for the
>>>>> arguments
>>>>> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>>>>> >
>>>>> >
>>>>> > //         Output contains tuples of (null, BSONObject) - ObjectId
>>>>> will be
>>>>> > generated by Mongo driver if null
>>>>> >
>>>>> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
>>>>> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
>>>>> >
>>>>> >             @Override
>>>>> >
>>>>> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
>>>>> Integer>
>>>>> > tuple) {
>>>>> >
>>>>> >                 BSONObject bson = new BasicBSONObject();
>>>>> >
>>>>> >                 bson.put("word", tuple._1);
>>>>> >
>>>>> >                 bson.put("count", tuple._2);
>>>>> >
>>>>> >                 return new Tuple2<>(null, bson);
>>>>> >
>>>>> >             }
>>>>> >
>>>>> >         });
>>>>> >
>>>>> >
>>>>> >
>>>>> > //         Only MongoOutputFormat and config are relevant
>>>>> >
>>>>> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
>>>>> > Object.class, MongoOutputFormat.class, config);
>>>>> >
>>>>> >     }
>>>>> >
>>>>> >
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > It looks like jar hell dependency, isn't it?  can anyone guide or
>>>>> help me?
>>>>> >
>>>>> > Another thing, i don t like closures, is it possible to use this fw
>>>>> without
>>>>> > using it?
>>>>> > Another question, are this objects, JavaSparkContext sc,
>>>>> JavaPairRDD<Object,
>>>>> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>>>>> >
>>>>> > Thank you very much and apologizes if the questions are not trending
>>>>> topic
>>>>> > :)
>>>>> >
>>>>> > Alonso Isidoro Roman.
>>>>> >
>>>>> > Mis citas preferidas (de hoy) :
>>>>> > "Si depurar es el proceso de quitar los errores de software, entonces
>>>>> > programar debe ser el proceso de introducirlos..."
>>>>> >  -  Edsger Dijkstra
>>>>> >
>>>>> > My favorite quotes (today):
>>>>> > "If debugging is the process of removing software bugs, then
>>>>> programming
>>>>> > must be the process of putting ..."
>>>>> >   - Edsger Dijkstra
>>>>> >
>>>>> > "If you pay peanuts you get monkeys"
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> -Sheryl
>>>
>>
>>
>
>
> --
> -Sheryl
>

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Sheryl John <sh...@gmail.com>.

Make sure you have the right versions for 'mongo-hadoop-core' and
'hadoop-client'. I noticed you've used hadoop-client version 2.2.0

The maven repositories does not have the mongo-hadoop connector for Hadoop
version 2.2.0, so you have to include the mongo-hadoop connector as an
unmanaged library (as mentioned in the link
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ and example
code).
I'm not sure if this will fix the error.


On Thu, Jun 26, 2014 at 3:23 AM, Alonso Isidoro Roman <al...@gmail.com>
wrote:

> Hi Sheryl,
>
> first at all, thanks for answering. spark-core_2.10 artifact has as
> dependency hadoop-client-1.0.4.jar and mongo-hadoop-core-1.0.0.jar has
> mongo-java-driver-2.7.3.jar.
>
> This is my actual pom.xml
>
> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance"
>
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>
> <groupId>com.aironman.spark</groupId>
>
> <artifactId>simple-project</artifactId>
>
> <modelVersion>4.0.0</modelVersion>
>
> <name>Simple Project</name>
>
> <packaging>jar</packaging>
>
> <version>1.0</version>
>
> <properties>
>
> <maven.compiler.source>1.7</maven.compiler.source>
>
> <maven.compiler.target>1.7</maven.compiler.target>
>
> <hadoop.version>1.0.2</hadoop.version>
>
> </properties>
>
>
>  <repositories>
>
> <repository>
>
> <id>Akka repository</id>
>
> <url>http://repo.akka.io/releases</url>
>
> </repository>
>
> </repositories>
>
> <dependencies>
>
> <dependency> <!-- Spark dependency -->
>
> <groupId>org.apache.spark</groupId>
>
> <artifactId>spark-core_2.10</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
> <dependency> <!-- Mongodb dependency -->
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-hadoop-core</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
>  <dependency>
>
> <groupId>org.apache.hadoop</groupId>
>
> <artifactId>hadoop-client</artifactId>
>
> <version>2.2.0</version>
>
> </dependency>
>
> <dependency>
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-java-driver</artifactId>
>
> <version>2.9.3</version>
>
> </dependency>
>
> </dependencies>
>
> <build>
>
> <outputDirectory>target/java-${maven.compiler.source}/classes</
> outputDirectory>
>
> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
> testOutputDirectory>
>
> <plugins>
>
> <plugin>
>
> <groupId>org.apache.maven.plugins</groupId>
>
> <artifactId>maven-shade-plugin</artifactId>
>
> <version>2.3</version>
>
> <configuration>
>
>
>  <shadedArtifactAttached>false</shadedArtifactAttached>
>
>
>  <outputFile>
> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
> </outputFile>
>
> <artifactSet>
>
> <includes>
>
> <include>*:*</include>
>
> </includes>
>
> </artifactSet>
>
>
>  <filters>
>
> <filter>
>
> <artifact>*:*</artifact>
>
> <!-- <excludes> <exclude>META-INF/*.SF</exclude>
> <exclude>META-INF/*.DSA</exclude>
>
> <exclude>META-INF/*.RSA</exclude> </excludes> -->
>
> </filter>
>
> </filters>
>
>
>  </configuration>
>
> <executions>
>
> <execution>
>
> <phase>package</phase>
>
> <goals>
>
> <goal>shade</goal>
>
> </goals>
>
> <configuration>
>
> <transformers>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>
> <resource>reference.conf</resource>
>
> </transformer>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>
> <resource>log4j.properties</resource>
>
> </transformer>
>
> </transformers>
>
> </configuration>
>
> </execution>
>
> </executions>
>
> </plugin>
>
> </plugins>
>
> </build>
>
> </project>
>
> and the error persists, this is the  spark-submit command`s output
>
> MacBook-Pro-Retina-de-Alonso:ConnectorSparkMongo aironman$
> /Users/aironman/spark-1.0.0/bin/spark-submit --class
> "com.aironman.spark.utils.JavaWordCount" --master local[4]
> target/simple-project-1.0.jar
>
> 14/06/26 12:05:21 INFO SecurityManager: Using Spark's default log4j
> profile: org/apache/spark/log4j-defaults.properties
>
> 14/06/26 12:05:21 INFO SecurityManager: Changing view acls to: aironman
>
> 14/06/26 12:05:21 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(aironman)
>
> 14/06/26 12:05:21 INFO Slf4jLogger: Slf4jLogger started
>
> 14/06/26 12:05:21 INFO Remoting: Starting remoting
>
> 14/06/26 12:05:21 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@192.168.1.35:49681]
>
> 14/06/26 12:05:21 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@192.168.1.35:49681]
>
> 14/06/26 12:05:21 INFO SparkEnv: Registering MapOutputTracker
>
> 14/06/26 12:05:21 INFO SparkEnv: Registering BlockManagerMaster
>
> 14/06/26 12:05:21 INFO DiskBlockManager: Created local directory at
> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-local-20140626120521-09d2
>
> 14/06/26 12:05:21 INFO MemoryStore: MemoryStore started with capacity
> 294.9 MB.
>
> 14/06/26 12:05:21 INFO ConnectionManager: Bound socket to port 49682 with
> id = ConnectionManagerId(192.168.1.35,49682)
>
> 14/06/26 12:05:21 INFO BlockManagerMaster: Trying to register BlockManager
>
> 14/06/26 12:05:21 INFO BlockManagerInfo: Registering block manager
> 192.168.1.35:49682 with 294.9 MB RAM
>
> 14/06/26 12:05:21 INFO BlockManagerMaster: Registered BlockManager
>
> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>
> 14/06/26 12:05:21 INFO HttpBroadcast: Broadcast server started at
> http://192.168.1.35:49683
>
> 14/06/26 12:05:21 INFO HttpFileServer: HTTP File server directory is
> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-e0cff70b-9597-452f-b5db-99aef3245e04
>
> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>
> 14/06/26 12:05:21 INFO SparkUI: Started SparkUI at
> http://192.168.1.35:4040
>
> 2014-06-26 12:05:21.915 java[880:1903] Unable to load realm info from
> SCDynamicStore
>
> 14/06/26 12:05:22 INFO SparkContext: Added JAR
> file:/Users/aironman/Documents/ws-spark/ConnectorSparkMongo/target/simple-project-1.0.jar
> at http://192.168.1.35:49684/jars/simple-project-1.0.jar with timestamp
> 1403777122023
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/mongodb/hadoop/MongoInputFormat
>
> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:50)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.lang.ClassNotFoundException:
> com.mongodb.hadoop.MongoInputFormat
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 8 more
>
>
> com.mongodb.hadoop.MongoInputFormat is located in mongo-hadoop-core-1.0.0.jar
> and theorycally the jar is included in uber jar because i can see in
> the mvm clean package command output
>
>
> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded jar.
>
>
> It looks like the dependency jar is not included in the uber jar, despite
> the line shown above...
>
>
> Thanks in advance, i really appreciate it
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
>
> 2014-06-25 18:47 GMT+02:00 Sheryl John <sh...@gmail.com>:
>
> Hi Alonso,
>>
>> I was able to get Spark working with MongoDB following the above
>> instructions and used SBT to manage dependencies.
>> Try adding 'hadoop-client' and 'mongo-java-driver' dependencies in your
>> pom.xml.
>>
>>
>>
>> On Tue, Jun 24, 2014 at 8:06 AM, Alonso Isidoro Roman <alonsoir@gmail.com
>> > wrote:
>>
>>> Hi Yana, thanks for the answer, of course you are right and it works! no
>>> errors, now i am trying to execute it and i am getting an error
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/mongodb/hadoop/MongoInputFormat
>>>
>>> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>>
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.mongodb.hadoop.MongoInputFormat
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>
>>> ... 8 more
>>>
>>>
>>> That class is located in mongo-hadoop-core-1.0.0.jar which is located
>>> and declared in pom.xml and i am using maven-shade-plugin, this is the
>>> actual pom.xml:
>>>
>>>
>>>
>>> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
>>> http://www.w3.org/2001/XMLSchema-instance"
>>>
>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>
>>> <groupId>com.aironman.spark</groupId>
>>>
>>> <artifactId>simple-project</artifactId>
>>>
>>> <modelVersion>4.0.0</modelVersion>
>>>
>>> <name>Simple Project</name>
>>>
>>> <packaging>jar</packaging>
>>>
>>> <version>1.0</version>
>>>
>>> <properties>
>>>
>>> <maven.compiler.source>1.7</maven.compiler.source>
>>>
>>> <maven.compiler.target>1.7</maven.compiler.target>
>>>
>>> <hadoop.version>1.0.2</hadoop.version>
>>>
>>> </properties>
>>>
>>>
>>> <repositories>
>>>
>>> <repository>
>>>
>>> <id>Akka repository</id>
>>>
>>> <url>http://repo.akka.io/releases</url>
>>>
>>> </repository>
>>>
>>> </repositories>
>>>
>>> <dependencies>
>>>
>>> <dependency> <!-- Spark dependency -->
>>>
>>> <groupId>org.apache.spark</groupId>
>>>
>>> <artifactId>spark-core_2.10</artifactId>
>>>
>>> <version>1.0.0</version>
>>>
>>> </dependency>
>>>
>>>
>>> <dependency>
>>>
>>> <groupId>org.mongodb</groupId>
>>>
>>> <artifactId>mongo-hadoop-core</artifactId>
>>>
>>> <version>1.0.0</version>
>>>
>>> </dependency>
>>>
>>>
>>> </dependencies>
>>>
>>> <build>
>>>
>>> <outputDirectory>target/java-${maven.compiler.source}/classes</
>>> outputDirectory>
>>>
>>> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
>>> testOutputDirectory>
>>>
>>> <plugins>
>>>
>>> <plugin>
>>>
>>> <groupId>org.apache.maven.plugins</groupId>
>>>
>>> <artifactId>maven-shade-plugin</artifactId>
>>>
>>> <version>2.3</version>
>>>
>>> <configuration>
>>>
>>>  <shadedArtifactAttached>false</shadedArtifactAttached>
>>>
>>>  <outputFile>
>>> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
>>> </outputFile>
>>>
>>> <artifactSet>
>>>
>>> <includes>
>>>
>>> <include>*:*</include>
>>>
>>> </includes>
>>>
>>> </artifactSet>
>>>
>>> <!--
>>>
>>>  <filters>
>>>
>>> <filter>
>>>
>>>  <artifact>*:*</artifact>
>>>
>>> <excludes>
>>>
>>> <exclude>META-INF/*.SF</exclude>
>>>
>>>  <exclude>META-INF/*.DSA</exclude>
>>>
>>> <exclude>META-INF/*.RSA</exclude>
>>>
>>> </excludes>
>>>
>>>  </filter>
>>>
>>> </filters>
>>>
>>>  -->
>>>
>>> </configuration>
>>>
>>> <executions>
>>>
>>> <execution>
>>>
>>> <phase>package</phase>
>>>
>>> <goals>
>>>
>>> <goal>shade</goal>
>>>
>>> </goals>
>>>
>>> <configuration>
>>>
>>> <transformers>
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>>>
>>> <resource>reference.conf</resource>
>>>
>>> </transformer>
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer"
>>> >
>>>
>>> <resource>log4j.properties</resource>
>>>
>>> </transformer>
>>>
>>> </transformers>
>>>
>>> </configuration>
>>>
>>> </execution>
>>>
>>> </executions>
>>>
>>> </plugin>
>>>
>>> </plugins>
>>>
>>> </build>
>>>
>>> </project>
>>>
>>> It is confusing me because i can see the output of mvm package:
>>>
>>> ...
>>>
>>> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded
>>> jar.
>>>
>>> ...
>>> in theory, the uber jar is created with all dependencies but if i open
>>> the jar i don't see them.
>>>
>>> Thanks in advance
>>>
>>>
>>> Alonso Isidoro Roman.
>>>
>>> Mis citas preferidas (de hoy) :
>>> "Si depurar es el proceso de quitar los errores de software, entonces
>>> programar debe ser el proceso de introducirlos..."
>>>  -  Edsger Dijkstra
>>>
>>> My favorite quotes (today):
>>> "If debugging is the process of removing software bugs, then programming
>>> must be the process of putting ..."
>>>   - Edsger Dijkstra
>>>
>>> "If you pay peanuts you get monkeys"
>>>
>>>
>>>
>>> 2014-06-23 21:27 GMT+02:00 Yana Kadiyska <ya...@gmail.com>:
>>>
>>>  One thing I noticed around the place where you get the first error --
>>>> you are calling words.map instead of words.mapToPair. map produces
>>>> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
>>>> Java APIs myself but it looks to me like you need to check the types
>>>> more carefully.
>>>>
>>>> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
>>>> <al...@gmail.com> wrote:
>>>> > Hi all,
>>>> >
>>>> > I am new to Spark, so this is probably a basic question. i want to
>>>> explore
>>>> > the possibilities of this fw, concretely using it in conjunction with
>>>> 3
>>>> > party libs, like mongodb, for example.
>>>> >
>>>> > I have been keeping instructions from
>>>> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order
>>>> to
>>>> > connect spark with mongodb. This example is made with
>>>> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml
>>>> with
>>>> > latest versions.
>>>> >
>>>> > This is my pom.xml
>>>> >
>>>> > <project xmlns="http://maven.apache.org/POM/4.0.0"
>>>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>> >
>>>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>>> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>> >
>>>> > <groupId>com.aironman.spark</groupId>
>>>> >
>>>> > <artifactId>simple-project</artifactId>
>>>> >
>>>> > <modelVersion>4.0.0</modelVersion>
>>>> >
>>>> > <name>Simple Project</name>
>>>> >
>>>> > <packaging>jar</packaging>
>>>> >
>>>> > <version>1.0</version>
>>>> >
>>>> > <repositories>
>>>> >
>>>> > <repository>
>>>> >
>>>> > <id>Akka repository</id>
>>>> >
>>>> > <url>http://repo.akka.io/releases</url>
>>>> >
>>>> > </repository>
>>>> >
>>>> > </repositories>
>>>> >
>>>> > <dependencies>
>>>> >
>>>> > <dependency> <!-- Spark dependency -->
>>>> >
>>>> > <groupId>org.apache.spark</groupId>
>>>> >
>>>> > <artifactId>spark-core_2.10</artifactId>
>>>> >
>>>> > <version>1.0.0</version>
>>>> >
>>>> > </dependency>
>>>> >
>>>> >
>>>> > <dependency>
>>>> >
>>>> > <groupId>org.mongodb</groupId>
>>>> >
>>>> > <artifactId>mongo-hadoop-core</artifactId>
>>>> >
>>>> > <version>1.0.0</version>
>>>> >
>>>> > </dependency>
>>>> >
>>>> >
>>>> > </dependencies>
>>>> >
>>>> > </project>
>>>> >
>>>> > As you can see, super simple pom.xml
>>>> >
>>>> > And this is the JavaWordCount.java
>>>> >
>>>> > import java.util.Arrays;
>>>> >
>>>> > import java.util.Collections;
>>>> >
>>>> >
>>>> > import org.apache.hadoop.conf.Configuration;
>>>> >
>>>> > import org.apache.spark.api.java.JavaPairRDD;
>>>> >
>>>> > import org.apache.spark.api.java.JavaRDD;
>>>> >
>>>> > import org.apache.spark.api.java.JavaSparkContext;
>>>> >
>>>> > import org.apache.spark.api.java.function.FlatMapFunction;
>>>> >
>>>> > import org.apache.spark.api.java.function.Function2;
>>>> >
>>>> > import org.apache.spark.api.java.function.PairFunction;
>>>> >
>>>> > import org.bson.BSONObject;
>>>> >
>>>> > import org.bson.BasicBSONObject;
>>>> >
>>>> >
>>>> > import scala.Tuple2;
>>>> >
>>>> >
>>>> > import com.mongodb.hadoop.MongoOutputFormat;
>>>> >
>>>> >
>>>> > /***
>>>> >
>>>> >  * Esta clase se supone que se conecta a un cluster mongodb para
>>>> ejecutar
>>>> > una tarea word count por cada palabra almacenada en la bd.
>>>> >
>>>> >  * el problema es que esta api esta rota, creo. Estoy usando la ultima
>>>> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria
>>>> usar un
>>>> > ejemplo
>>>> >
>>>> >  * sobre spark-streaming conectandose a un base mongodb, o usar
>>>> > spark-streaming junto con spring integration, es decir, conectar
>>>> spark con
>>>> > un servicio web que
>>>> >
>>>> >  * periodicamente alimentaria spark...
>>>> >
>>>> >  * @author aironman
>>>> >
>>>> >  *
>>>> >
>>>> >  */
>>>> >
>>>> > public class JavaWordCount {
>>>> >
>>>> >
>>>> >
>>>> >     public static void main(String[] args) {
>>>> >
>>>> >
>>>> >
>>>> >         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
>>>> > Count");
>>>> >
>>>> >
>>>> >
>>>> >         Configuration config = new Configuration();
>>>> >
>>>> >         config.set("mongo.input.uri",
>>>> > "mongodb:127.0.0.1:27017/beowulf.input");
>>>> >
>>>> >         config.set("mongo.output.uri",
>>>> > "mongodb:127.0.0.1:27017/beowulf.output");
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
>>>> > Object.class, BSONObject.class);
>>>> >
>>>> >
>>>> >
>>>> > //         Input contains tuples of (ObjectId, BSONObject)
>>>> >
>>>> >         JavaRDD<String> words = mongoRDD.flatMap(new
>>>> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>>>> >
>>>> >             @Override
>>>> >
>>>> >             public Iterable<String> call(Tuple2<Object, BSONObject>
>>>> arg) {
>>>> >
>>>> >                 Object o = arg._2.get("text");
>>>> >
>>>> >                 if (o instanceof String) {
>>>> >
>>>> >                     String str = (String) o;
>>>> >
>>>> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", "
>>>> ");
>>>> >
>>>> >                     return Arrays.asList(str.split(" "));
>>>> >
>>>> >                 } else {
>>>> >
>>>> >                     return Collections.emptyList();
>>>> >
>>>> >                 }
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >         //here is an error, The method map(Function<String,R>) in the
>>>> type
>>>> > JavaRDD<String> is not applicable for the arguments (new
>>>> > PairFunction<String,String,Integer>(){})
>>>> >
>>>> >         JavaPairRDD<String, Integer> ones = words.map(new
>>>> > PairFunction<String, String, Integer>() {
>>>> >
>>>> >             public Tuple2<String, Integer> call(String s) {
>>>> >
>>>> >                 return new Tuple2<>(s, 1);
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
>>>> > Function2<Integer, Integer, Integer>() {
>>>> >
>>>> >             public Integer call(Integer i1, Integer i2) {
>>>> >
>>>> >                 return i1 + i2;
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >
>>>> >
>>>> >         //another error, The method
>>>> map(Function<Tuple2<String,Integer>,R>)
>>>> > in the type JavaPairRDD<String,Integer> is not applicable for the
>>>> arguments
>>>> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>>>> >
>>>> >
>>>> > //         Output contains tuples of (null, BSONObject) - ObjectId
>>>> will be
>>>> > generated by Mongo driver if null
>>>> >
>>>> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
>>>> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
>>>> >
>>>> >             @Override
>>>> >
>>>> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
>>>> Integer>
>>>> > tuple) {
>>>> >
>>>> >                 BSONObject bson = new BasicBSONObject();
>>>> >
>>>> >                 bson.put("word", tuple._1);
>>>> >
>>>> >                 bson.put("count", tuple._2);
>>>> >
>>>> >                 return new Tuple2<>(null, bson);
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >
>>>> >
>>>> > //         Only MongoOutputFormat and config are relevant
>>>> >
>>>> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
>>>> > Object.class, MongoOutputFormat.class, config);
>>>> >
>>>> >     }
>>>> >
>>>> >
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > It looks like jar hell dependency, isn't it?  can anyone guide or
>>>> help me?
>>>> >
>>>> > Another thing, i don t like closures, is it possible to use this fw
>>>> without
>>>> > using it?
>>>> > Another question, are this objects, JavaSparkContext sc,
>>>> JavaPairRDD<Object,
>>>> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>>>> >
>>>> > Thank you very much and apologizes if the questions are not trending
>>>> topic
>>>> > :)
>>>> >
>>>> > Alonso Isidoro Roman.
>>>> >
>>>> > Mis citas preferidas (de hoy) :
>>>> > "Si depurar es el proceso de quitar los errores de software, entonces
>>>> > programar debe ser el proceso de introducirlos..."
>>>> >  -  Edsger Dijkstra
>>>> >
>>>> > My favorite quotes (today):
>>>> > "If debugging is the process of removing software bugs, then
>>>> programming
>>>> > must be the process of putting ..."
>>>> >   - Edsger Dijkstra
>>>> >
>>>> > "If you pay peanuts you get monkeys"
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> -Sheryl
>>
>
>


-- 
-Sheryl

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Alonso Isidoro Roman <al...@gmail.com>.

Hi Sheryl,

first at all, thanks for answering. spark-core_2.10 artifact has as
dependency hadoop-client-1.0.4.jar and mongo-hadoop-core-1.0.0.jar has
mongo-java-driver-2.7.3.jar.

This is my actual pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>com.aironman.spark</groupId>

<artifactId>simple-project</artifactId>

<modelVersion>4.0.0</modelVersion>

<name>Simple Project</name>

<packaging>jar</packaging>

<version>1.0</version>

<properties>

<maven.compiler.source>1.7</maven.compiler.source>

<maven.compiler.target>1.7</maven.compiler.target>

<hadoop.version>1.0.2</hadoop.version>

</properties>


 <repositories>

<repository>

<id>Akka repository</id>

<url>http://repo.akka.io/releases</url>

</repository>

</repositories>

<dependencies>

<dependency> <!-- Spark dependency -->

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.0.0</version>

</dependency>

<dependency> <!-- Mongodb dependency -->

<groupId>org.mongodb</groupId>

<artifactId>mongo-hadoop-core</artifactId>

<version>1.0.0</version>

</dependency>


 <dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

<version>2.2.0</version>

</dependency>

<dependency>

<groupId>org.mongodb</groupId>

<artifactId>mongo-java-driver</artifactId>

<version>2.9.3</version>

</dependency>

</dependencies>

<build>

<outputDirectory>target/java-${maven.compiler.source}/classes</
outputDirectory>

<testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
testOutputDirectory>

<plugins>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-shade-plugin</artifactId>

<version>2.3</version>

<configuration>


 <shadedArtifactAttached>false</shadedArtifactAttached>


 <outputFile>
${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
</outputFile>

<artifactSet>

<includes>

<include>*:*</include>

</includes>

</artifactSet>


 <filters>

<filter>

<artifact>*:*</artifact>

<!-- <excludes> <exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>

<exclude>META-INF/*.RSA</exclude> </excludes> -->

</filter>

</filters>


 </configuration>

<executions>

<execution>

<phase>package</phase>

<goals>

<goal>shade</goal>

</goals>

<configuration>

<transformers>

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.AppendingTransformer">

<resource>reference.conf</resource>

</transformer>

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">

<resource>log4j.properties</resource>

</transformer>

</transformers>

</configuration>

</execution>

</executions>

</plugin>

</plugins>

</build>

</project>

and the error persists, this is the  spark-submit command`s output

MacBook-Pro-Retina-de-Alonso:ConnectorSparkMongo aironman$
/Users/aironman/spark-1.0.0/bin/spark-submit --class
"com.aironman.spark.utils.JavaWordCount" --master local[4]
target/simple-project-1.0.jar

14/06/26 12:05:21 INFO SecurityManager: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties

14/06/26 12:05:21 INFO SecurityManager: Changing view acls to: aironman

14/06/26 12:05:21 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(aironman)

14/06/26 12:05:21 INFO Slf4jLogger: Slf4jLogger started

14/06/26 12:05:21 INFO Remoting: Starting remoting

14/06/26 12:05:21 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@192.168.1.35:49681]

14/06/26 12:05:21 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@192.168.1.35:49681]

14/06/26 12:05:21 INFO SparkEnv: Registering MapOutputTracker

14/06/26 12:05:21 INFO SparkEnv: Registering BlockManagerMaster

14/06/26 12:05:21 INFO DiskBlockManager: Created local directory at
/var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-local-20140626120521-09d2

14/06/26 12:05:21 INFO MemoryStore: MemoryStore started with capacity 294.9
MB.

14/06/26 12:05:21 INFO ConnectionManager: Bound socket to port 49682 with
id = ConnectionManagerId(192.168.1.35,49682)

14/06/26 12:05:21 INFO BlockManagerMaster: Trying to register BlockManager

14/06/26 12:05:21 INFO BlockManagerInfo: Registering block manager
192.168.1.35:49682 with 294.9 MB RAM

14/06/26 12:05:21 INFO BlockManagerMaster: Registered BlockManager

14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server

14/06/26 12:05:21 INFO HttpBroadcast: Broadcast server started at
http://192.168.1.35:49683

14/06/26 12:05:21 INFO HttpFileServer: HTTP File server directory is
/var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-e0cff70b-9597-452f-b5db-99aef3245e04

14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server

14/06/26 12:05:21 INFO SparkUI: Started SparkUI at http://192.168.1.35:4040

2014-06-26 12:05:21.915 java[880:1903] Unable to load realm info from
SCDynamicStore

14/06/26 12:05:22 INFO SparkContext: Added JAR
file:/Users/aironman/Documents/ws-spark/ConnectorSparkMongo/target/simple-project-1.0.jar
at http://192.168.1.35:49684/jars/simple-project-1.0.jar with timestamp
1403777122023

Exception in thread "main" java.lang.NoClassDefFoundError:
com/mongodb/hadoop/MongoInputFormat

at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:50)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException:
com.mongodb.hadoop.MongoInputFormat

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

... 8 more


com.mongodb.hadoop.MongoInputFormat is located in mongo-hadoop-core-1.0.0.jar
and theorycally the jar is included in uber jar because i can see in
the mvm clean package command output


[INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded jar.


It looks like the dependency jar is not included in the uber jar, despite
the line shown above...


Thanks in advance, i really appreciate it

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-06-25 18:47 GMT+02:00 Sheryl John <sh...@gmail.com>:

> Hi Alonso,
>
> I was able to get Spark working with MongoDB following the above
> instructions and used SBT to manage dependencies.
> Try adding 'hadoop-client' and 'mongo-java-driver' dependencies in your
> pom.xml.
>
>
>
> On Tue, Jun 24, 2014 at 8:06 AM, Alonso Isidoro Roman <al...@gmail.com>
> wrote:
>
>> Hi Yana, thanks for the answer, of course you are right and it works! no
>> errors, now i am trying to execute it and i am getting an error
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/mongodb/hadoop/MongoInputFormat
>>
>> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.mongodb.hadoop.MongoInputFormat
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>
>> ... 8 more
>>
>>
>> That class is located in mongo-hadoop-core-1.0.0.jar which is located
>> and declared in pom.xml and i am using maven-shade-plugin, this is the
>> actual pom.xml:
>>
>>
>>
>> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
>> http://www.w3.org/2001/XMLSchema-instance"
>>
>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>
>> <groupId>com.aironman.spark</groupId>
>>
>> <artifactId>simple-project</artifactId>
>>
>> <modelVersion>4.0.0</modelVersion>
>>
>> <name>Simple Project</name>
>>
>> <packaging>jar</packaging>
>>
>> <version>1.0</version>
>>
>> <properties>
>>
>> <maven.compiler.source>1.7</maven.compiler.source>
>>
>> <maven.compiler.target>1.7</maven.compiler.target>
>>
>> <hadoop.version>1.0.2</hadoop.version>
>>
>> </properties>
>>
>>
>> <repositories>
>>
>> <repository>
>>
>> <id>Akka repository</id>
>>
>> <url>http://repo.akka.io/releases</url>
>>
>> </repository>
>>
>> </repositories>
>>
>> <dependencies>
>>
>> <dependency> <!-- Spark dependency -->
>>
>> <groupId>org.apache.spark</groupId>
>>
>> <artifactId>spark-core_2.10</artifactId>
>>
>> <version>1.0.0</version>
>>
>> </dependency>
>>
>>
>> <dependency>
>>
>> <groupId>org.mongodb</groupId>
>>
>> <artifactId>mongo-hadoop-core</artifactId>
>>
>> <version>1.0.0</version>
>>
>> </dependency>
>>
>>
>> </dependencies>
>>
>> <build>
>>
>> <outputDirectory>target/java-${maven.compiler.source}/classes</
>> outputDirectory>
>>
>> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
>> testOutputDirectory>
>>
>> <plugins>
>>
>> <plugin>
>>
>> <groupId>org.apache.maven.plugins</groupId>
>>
>> <artifactId>maven-shade-plugin</artifactId>
>>
>> <version>2.3</version>
>>
>> <configuration>
>>
>>  <shadedArtifactAttached>false</shadedArtifactAttached>
>>
>>  <outputFile>
>> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
>> </outputFile>
>>
>> <artifactSet>
>>
>> <includes>
>>
>> <include>*:*</include>
>>
>> </includes>
>>
>> </artifactSet>
>>
>> <!--
>>
>>  <filters>
>>
>> <filter>
>>
>>  <artifact>*:*</artifact>
>>
>> <excludes>
>>
>> <exclude>META-INF/*.SF</exclude>
>>
>>  <exclude>META-INF/*.DSA</exclude>
>>
>> <exclude>META-INF/*.RSA</exclude>
>>
>> </excludes>
>>
>>  </filter>
>>
>> </filters>
>>
>>  -->
>>
>> </configuration>
>>
>> <executions>
>>
>> <execution>
>>
>> <phase>package</phase>
>>
>> <goals>
>>
>> <goal>shade</goal>
>>
>> </goals>
>>
>> <configuration>
>>
>> <transformers>
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>>
>> <resource>reference.conf</resource>
>>
>> </transformer>
>>
>> <transformer
>>
>> implementation=
>> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>>
>> <resource>log4j.properties</resource>
>>
>> </transformer>
>>
>> </transformers>
>>
>> </configuration>
>>
>> </execution>
>>
>> </executions>
>>
>> </plugin>
>>
>> </plugins>
>>
>> </build>
>>
>> </project>
>>
>> It is confusing me because i can see the output of mvm package:
>>
>> ...
>>
>> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded
>> jar.
>>
>> ...
>> in theory, the uber jar is created with all dependencies but if i open
>> the jar i don't see them.
>>
>> Thanks in advance
>>
>>
>> Alonso Isidoro Roman.
>>
>> Mis citas preferidas (de hoy) :
>> "Si depurar es el proceso de quitar los errores de software, entonces
>> programar debe ser el proceso de introducirlos..."
>>  -  Edsger Dijkstra
>>
>> My favorite quotes (today):
>> "If debugging is the process of removing software bugs, then programming
>> must be the process of putting ..."
>>   - Edsger Dijkstra
>>
>> "If you pay peanuts you get monkeys"
>>
>>
>>
>> 2014-06-23 21:27 GMT+02:00 Yana Kadiyska <ya...@gmail.com>:
>>
>>  One thing I noticed around the place where you get the first error --
>>> you are calling words.map instead of words.mapToPair. map produces
>>> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
>>> Java APIs myself but it looks to me like you need to check the types
>>> more carefully.
>>>
>>> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
>>> <al...@gmail.com> wrote:
>>> > Hi all,
>>> >
>>> > I am new to Spark, so this is probably a basic question. i want to
>>> explore
>>> > the possibilities of this fw, concretely using it in conjunction with 3
>>> > party libs, like mongodb, for example.
>>> >
>>> > I have been keeping instructions from
>>> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order
>>> to
>>> > connect spark with mongodb. This example is made with
>>> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml
>>> with
>>> > latest versions.
>>> >
>>> > This is my pom.xml
>>> >
>>> > <project xmlns="http://maven.apache.org/POM/4.0.0"
>>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>> >
>>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>> >
>>> > <groupId>com.aironman.spark</groupId>
>>> >
>>> > <artifactId>simple-project</artifactId>
>>> >
>>> > <modelVersion>4.0.0</modelVersion>
>>> >
>>> > <name>Simple Project</name>
>>> >
>>> > <packaging>jar</packaging>
>>> >
>>> > <version>1.0</version>
>>> >
>>> > <repositories>
>>> >
>>> > <repository>
>>> >
>>> > <id>Akka repository</id>
>>> >
>>> > <url>http://repo.akka.io/releases</url>
>>> >
>>> > </repository>
>>> >
>>> > </repositories>
>>> >
>>> > <dependencies>
>>> >
>>> > <dependency> <!-- Spark dependency -->
>>> >
>>> > <groupId>org.apache.spark</groupId>
>>> >
>>> > <artifactId>spark-core_2.10</artifactId>
>>> >
>>> > <version>1.0.0</version>
>>> >
>>> > </dependency>
>>> >
>>> >
>>> > <dependency>
>>> >
>>> > <groupId>org.mongodb</groupId>
>>> >
>>> > <artifactId>mongo-hadoop-core</artifactId>
>>> >
>>> > <version>1.0.0</version>
>>> >
>>> > </dependency>
>>> >
>>> >
>>> > </dependencies>
>>> >
>>> > </project>
>>> >
>>> > As you can see, super simple pom.xml
>>> >
>>> > And this is the JavaWordCount.java
>>> >
>>> > import java.util.Arrays;
>>> >
>>> > import java.util.Collections;
>>> >
>>> >
>>> > import org.apache.hadoop.conf.Configuration;
>>> >
>>> > import org.apache.spark.api.java.JavaPairRDD;
>>> >
>>> > import org.apache.spark.api.java.JavaRDD;
>>> >
>>> > import org.apache.spark.api.java.JavaSparkContext;
>>> >
>>> > import org.apache.spark.api.java.function.FlatMapFunction;
>>> >
>>> > import org.apache.spark.api.java.function.Function2;
>>> >
>>> > import org.apache.spark.api.java.function.PairFunction;
>>> >
>>> > import org.bson.BSONObject;
>>> >
>>> > import org.bson.BasicBSONObject;
>>> >
>>> >
>>> > import scala.Tuple2;
>>> >
>>> >
>>> > import com.mongodb.hadoop.MongoOutputFormat;
>>> >
>>> >
>>> > /***
>>> >
>>> >  * Esta clase se supone que se conecta a un cluster mongodb para
>>> ejecutar
>>> > una tarea word count por cada palabra almacenada en la bd.
>>> >
>>> >  * el problema es que esta api esta rota, creo. Estoy usando la ultima
>>> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria
>>> usar un
>>> > ejemplo
>>> >
>>> >  * sobre spark-streaming conectandose a un base mongodb, o usar
>>> > spark-streaming junto con spring integration, es decir, conectar spark
>>> con
>>> > un servicio web que
>>> >
>>> >  * periodicamente alimentaria spark...
>>> >
>>> >  * @author aironman
>>> >
>>> >  *
>>> >
>>> >  */
>>> >
>>> > public class JavaWordCount {
>>> >
>>> >
>>> >
>>> >     public static void main(String[] args) {
>>> >
>>> >
>>> >
>>> >         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
>>> > Count");
>>> >
>>> >
>>> >
>>> >         Configuration config = new Configuration();
>>> >
>>> >         config.set("mongo.input.uri",
>>> > "mongodb:127.0.0.1:27017/beowulf.input");
>>> >
>>> >         config.set("mongo.output.uri",
>>> > "mongodb:127.0.0.1:27017/beowulf.output");
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >         JavaPairRDD<Object, BSONObject> mongoRDD =
>>> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
>>> > Object.class, BSONObject.class);
>>> >
>>> >
>>> >
>>> > //         Input contains tuples of (ObjectId, BSONObject)
>>> >
>>> >         JavaRDD<String> words = mongoRDD.flatMap(new
>>> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>>> >
>>> >             @Override
>>> >
>>> >             public Iterable<String> call(Tuple2<Object, BSONObject>
>>> arg) {
>>> >
>>> >                 Object o = arg._2.get("text");
>>> >
>>> >                 if (o instanceof String) {
>>> >
>>> >                     String str = (String) o;
>>> >
>>> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", "
>>> ");
>>> >
>>> >                     return Arrays.asList(str.split(" "));
>>> >
>>> >                 } else {
>>> >
>>> >                     return Collections.emptyList();
>>> >
>>> >                 }
>>> >
>>> >             }
>>> >
>>> >         });
>>> >
>>> >         //here is an error, The method map(Function<String,R>) in the
>>> type
>>> > JavaRDD<String> is not applicable for the arguments (new
>>> > PairFunction<String,String,Integer>(){})
>>> >
>>> >         JavaPairRDD<String, Integer> ones = words.map(new
>>> > PairFunction<String, String, Integer>() {
>>> >
>>> >             public Tuple2<String, Integer> call(String s) {
>>> >
>>> >                 return new Tuple2<>(s, 1);
>>> >
>>> >             }
>>> >
>>> >         });
>>> >
>>> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
>>> > Function2<Integer, Integer, Integer>() {
>>> >
>>> >             public Integer call(Integer i1, Integer i2) {
>>> >
>>> >                 return i1 + i2;
>>> >
>>> >             }
>>> >
>>> >         });
>>> >
>>> >
>>> >
>>> >         //another error, The method
>>> map(Function<Tuple2<String,Integer>,R>)
>>> > in the type JavaPairRDD<String,Integer> is not applicable for the
>>> arguments
>>> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>>> >
>>> >
>>> > //         Output contains tuples of (null, BSONObject) - ObjectId
>>> will be
>>> > generated by Mongo driver if null
>>> >
>>> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
>>> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
>>> >
>>> >             @Override
>>> >
>>> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
>>> Integer>
>>> > tuple) {
>>> >
>>> >                 BSONObject bson = new BasicBSONObject();
>>> >
>>> >                 bson.put("word", tuple._1);
>>> >
>>> >                 bson.put("count", tuple._2);
>>> >
>>> >                 return new Tuple2<>(null, bson);
>>> >
>>> >             }
>>> >
>>> >         });
>>> >
>>> >
>>> >
>>> > //         Only MongoOutputFormat and config are relevant
>>> >
>>> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
>>> > Object.class, MongoOutputFormat.class, config);
>>> >
>>> >     }
>>> >
>>> >
>>> >
>>> > }
>>> >
>>> >
>>> > It looks like jar hell dependency, isn't it?  can anyone guide or help
>>> me?
>>> >
>>> > Another thing, i don t like closures, is it possible to use this fw
>>> without
>>> > using it?
>>> > Another question, are this objects, JavaSparkContext sc,
>>> JavaPairRDD<Object,
>>> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>>> >
>>> > Thank you very much and apologizes if the questions are not trending
>>> topic
>>> > :)
>>> >
>>> > Alonso Isidoro Roman.
>>> >
>>> > Mis citas preferidas (de hoy) :
>>> > "Si depurar es el proceso de quitar los errores de software, entonces
>>> > programar debe ser el proceso de introducirlos..."
>>> >  -  Edsger Dijkstra
>>> >
>>> > My favorite quotes (today):
>>> > "If debugging is the process of removing software bugs, then
>>> programming
>>> > must be the process of putting ..."
>>> >   - Edsger Dijkstra
>>> >
>>> > "If you pay peanuts you get monkeys"
>>> >
>>>
>>
>>
>
>
> --
> -Sheryl
>

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Sheryl John <sh...@gmail.com>.

Hi Alonso,

I was able to get Spark working with MongoDB following the above
instructions and used SBT to manage dependencies.
Try adding 'hadoop-client' and 'mongo-java-driver' dependencies in your
pom.xml.



On Tue, Jun 24, 2014 at 8:06 AM, Alonso Isidoro Roman <al...@gmail.com>
wrote:

> Hi Yana, thanks for the answer, of course you are right and it works! no
> errors, now i am trying to execute it and i am getting an error
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/mongodb/hadoop/MongoInputFormat
>
> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.lang.ClassNotFoundException:
> com.mongodb.hadoop.MongoInputFormat
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 8 more
>
>
> That class is located in mongo-hadoop-core-1.0.0.jar which is located and
> declared in pom.xml and i am using maven-shade-plugin, this is the actual
> pom.xml:
>
>
>
> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance"
>
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>
> <groupId>com.aironman.spark</groupId>
>
> <artifactId>simple-project</artifactId>
>
> <modelVersion>4.0.0</modelVersion>
>
> <name>Simple Project</name>
>
> <packaging>jar</packaging>
>
> <version>1.0</version>
>
> <properties>
>
> <maven.compiler.source>1.7</maven.compiler.source>
>
> <maven.compiler.target>1.7</maven.compiler.target>
>
> <hadoop.version>1.0.2</hadoop.version>
>
> </properties>
>
>
> <repositories>
>
> <repository>
>
> <id>Akka repository</id>
>
> <url>http://repo.akka.io/releases</url>
>
> </repository>
>
> </repositories>
>
> <dependencies>
>
> <dependency> <!-- Spark dependency -->
>
> <groupId>org.apache.spark</groupId>
>
> <artifactId>spark-core_2.10</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
> <dependency>
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-hadoop-core</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
> </dependencies>
>
> <build>
>
> <outputDirectory>target/java-${maven.compiler.source}/classes</
> outputDirectory>
>
> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
> testOutputDirectory>
>
> <plugins>
>
> <plugin>
>
> <groupId>org.apache.maven.plugins</groupId>
>
> <artifactId>maven-shade-plugin</artifactId>
>
> <version>2.3</version>
>
> <configuration>
>
>  <shadedArtifactAttached>false</shadedArtifactAttached>
>
>  <outputFile>
> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
> </outputFile>
>
> <artifactSet>
>
> <includes>
>
> <include>*:*</include>
>
> </includes>
>
> </artifactSet>
>
> <!--
>
>  <filters>
>
> <filter>
>
>  <artifact>*:*</artifact>
>
> <excludes>
>
> <exclude>META-INF/*.SF</exclude>
>
>  <exclude>META-INF/*.DSA</exclude>
>
> <exclude>META-INF/*.RSA</exclude>
>
> </excludes>
>
>  </filter>
>
> </filters>
>
>  -->
>
> </configuration>
>
> <executions>
>
> <execution>
>
> <phase>package</phase>
>
> <goals>
>
> <goal>shade</goal>
>
> </goals>
>
> <configuration>
>
> <transformers>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>
> <resource>reference.conf</resource>
>
> </transformer>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>
> <resource>log4j.properties</resource>
>
> </transformer>
>
> </transformers>
>
> </configuration>
>
> </execution>
>
> </executions>
>
> </plugin>
>
> </plugins>
>
> </build>
>
> </project>
>
> It is confusing me because i can see the output of mvm package:
>
> ...
>
> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded jar.
>
> ...
> in theory, the uber jar is created with all dependencies but if i open the
> jar i don't see them.
>
> Thanks in advance
>
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
>
> 2014-06-23 21:27 GMT+02:00 Yana Kadiyska <ya...@gmail.com>:
>
>  One thing I noticed around the place where you get the first error --
>> you are calling words.map instead of words.mapToPair. map produces
>> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
>> Java APIs myself but it looks to me like you need to check the types
>> more carefully.
>>
>> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
>> <al...@gmail.com> wrote:
>> > Hi all,
>> >
>> > I am new to Spark, so this is probably a basic question. i want to
>> explore
>> > the possibilities of this fw, concretely using it in conjunction with 3
>> > party libs, like mongodb, for example.
>> >
>> > I have been keeping instructions from
>> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order to
>> > connect spark with mongodb. This example is made with
>> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml
>> with
>> > latest versions.
>> >
>> > This is my pom.xml
>> >
>> > <project xmlns="http://maven.apache.org/POM/4.0.0"
>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> >
>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
>> >
>> > <groupId>com.aironman.spark</groupId>
>> >
>> > <artifactId>simple-project</artifactId>
>> >
>> > <modelVersion>4.0.0</modelVersion>
>> >
>> > <name>Simple Project</name>
>> >
>> > <packaging>jar</packaging>
>> >
>> > <version>1.0</version>
>> >
>> > <repositories>
>> >
>> > <repository>
>> >
>> > <id>Akka repository</id>
>> >
>> > <url>http://repo.akka.io/releases</url>
>> >
>> > </repository>
>> >
>> > </repositories>
>> >
>> > <dependencies>
>> >
>> > <dependency> <!-- Spark dependency -->
>> >
>> > <groupId>org.apache.spark</groupId>
>> >
>> > <artifactId>spark-core_2.10</artifactId>
>> >
>> > <version>1.0.0</version>
>> >
>> > </dependency>
>> >
>> >
>> > <dependency>
>> >
>> > <groupId>org.mongodb</groupId>
>> >
>> > <artifactId>mongo-hadoop-core</artifactId>
>> >
>> > <version>1.0.0</version>
>> >
>> > </dependency>
>> >
>> >
>> > </dependencies>
>> >
>> > </project>
>> >
>> > As you can see, super simple pom.xml
>> >
>> > And this is the JavaWordCount.java
>> >
>> > import java.util.Arrays;
>> >
>> > import java.util.Collections;
>> >
>> >
>> > import org.apache.hadoop.conf.Configuration;
>> >
>> > import org.apache.spark.api.java.JavaPairRDD;
>> >
>> > import org.apache.spark.api.java.JavaRDD;
>> >
>> > import org.apache.spark.api.java.JavaSparkContext;
>> >
>> > import org.apache.spark.api.java.function.FlatMapFunction;
>> >
>> > import org.apache.spark.api.java.function.Function2;
>> >
>> > import org.apache.spark.api.java.function.PairFunction;
>> >
>> > import org.bson.BSONObject;
>> >
>> > import org.bson.BasicBSONObject;
>> >
>> >
>> > import scala.Tuple2;
>> >
>> >
>> > import com.mongodb.hadoop.MongoOutputFormat;
>> >
>> >
>> > /***
>> >
>> >  * Esta clase se supone que se conecta a un cluster mongodb para
>> ejecutar
>> > una tarea word count por cada palabra almacenada en la bd.
>> >
>> >  * el problema es que esta api esta rota, creo. Estoy usando la ultima
>> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria
>> usar un
>> > ejemplo
>> >
>> >  * sobre spark-streaming conectandose a un base mongodb, o usar
>> > spark-streaming junto con spring integration, es decir, conectar spark
>> con
>> > un servicio web que
>> >
>> >  * periodicamente alimentaria spark...
>> >
>> >  * @author aironman
>> >
>> >  *
>> >
>> >  */
>> >
>> > public class JavaWordCount {
>> >
>> >
>> >
>> >     public static void main(String[] args) {
>> >
>> >
>> >
>> >         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
>> > Count");
>> >
>> >
>> >
>> >         Configuration config = new Configuration();
>> >
>> >         config.set("mongo.input.uri",
>> > "mongodb:127.0.0.1:27017/beowulf.input");
>> >
>> >         config.set("mongo.output.uri",
>> > "mongodb:127.0.0.1:27017/beowulf.output");
>> >
>> >
>> >
>> >
>> >
>> >         JavaPairRDD<Object, BSONObject> mongoRDD =
>> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
>> > Object.class, BSONObject.class);
>> >
>> >
>> >
>> > //         Input contains tuples of (ObjectId, BSONObject)
>> >
>> >         JavaRDD<String> words = mongoRDD.flatMap(new
>> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>> >
>> >             @Override
>> >
>> >             public Iterable<String> call(Tuple2<Object, BSONObject>
>> arg) {
>> >
>> >                 Object o = arg._2.get("text");
>> >
>> >                 if (o instanceof String) {
>> >
>> >                     String str = (String) o;
>> >
>> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", " ");
>> >
>> >                     return Arrays.asList(str.split(" "));
>> >
>> >                 } else {
>> >
>> >                     return Collections.emptyList();
>> >
>> >                 }
>> >
>> >             }
>> >
>> >         });
>> >
>> >         //here is an error, The method map(Function<String,R>) in the
>> type
>> > JavaRDD<String> is not applicable for the arguments (new
>> > PairFunction<String,String,Integer>(){})
>> >
>> >         JavaPairRDD<String, Integer> ones = words.map(new
>> > PairFunction<String, String, Integer>() {
>> >
>> >             public Tuple2<String, Integer> call(String s) {
>> >
>> >                 return new Tuple2<>(s, 1);
>> >
>> >             }
>> >
>> >         });
>> >
>> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
>> > Function2<Integer, Integer, Integer>() {
>> >
>> >             public Integer call(Integer i1, Integer i2) {
>> >
>> >                 return i1 + i2;
>> >
>> >             }
>> >
>> >         });
>> >
>> >
>> >
>> >         //another error, The method
>> map(Function<Tuple2<String,Integer>,R>)
>> > in the type JavaPairRDD<String,Integer> is not applicable for the
>> arguments
>> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>> >
>> >
>> > //         Output contains tuples of (null, BSONObject) - ObjectId will
>> be
>> > generated by Mongo driver if null
>> >
>> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
>> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
>> >
>> >             @Override
>> >
>> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
>> Integer>
>> > tuple) {
>> >
>> >                 BSONObject bson = new BasicBSONObject();
>> >
>> >                 bson.put("word", tuple._1);
>> >
>> >                 bson.put("count", tuple._2);
>> >
>> >                 return new Tuple2<>(null, bson);
>> >
>> >             }
>> >
>> >         });
>> >
>> >
>> >
>> > //         Only MongoOutputFormat and config are relevant
>> >
>> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
>> > Object.class, MongoOutputFormat.class, config);
>> >
>> >     }
>> >
>> >
>> >
>> > }
>> >
>> >
>> > It looks like jar hell dependency, isn't it?  can anyone guide or help
>> me?
>> >
>> > Another thing, i don t like closures, is it possible to use this fw
>> without
>> > using it?
>> > Another question, are this objects, JavaSparkContext sc,
>> JavaPairRDD<Object,
>> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>> >
>> > Thank you very much and apologizes if the questions are not trending
>> topic
>> > :)
>> >
>> > Alonso Isidoro Roman.
>> >
>> > Mis citas preferidas (de hoy) :
>> > "Si depurar es el proceso de quitar los errores de software, entonces
>> > programar debe ser el proceso de introducirlos..."
>> >  -  Edsger Dijkstra
>> >
>> > My favorite quotes (today):
>> > "If debugging is the process of removing software bugs, then programming
>> > must be the process of putting ..."
>> >   - Edsger Dijkstra
>> >
>> > "If you pay peanuts you get monkeys"
>> >
>>
>
>


-- 
-Sheryl

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Alonso Isidoro Roman <al...@gmail.com>.

Hi Yana, thanks for the answer, of course you are right and it works! no
errors, now i am trying to execute it and i am getting an error

Exception in thread "main" java.lang.NoClassDefFoundError:
com/mongodb/hadoop/MongoInputFormat

at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException:
com.mongodb.hadoop.MongoInputFormat

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

... 8 more


That class is located in mongo-hadoop-core-1.0.0.jar which is located and
declared in pom.xml and i am using maven-shade-plugin, this is the actual
pom.xml:



<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>com.aironman.spark</groupId>

<artifactId>simple-project</artifactId>

<modelVersion>4.0.0</modelVersion>

<name>Simple Project</name>

<packaging>jar</packaging>

<version>1.0</version>

<properties>

<maven.compiler.source>1.7</maven.compiler.source>

<maven.compiler.target>1.7</maven.compiler.target>

<hadoop.version>1.0.2</hadoop.version>

</properties>


<repositories>

<repository>

<id>Akka repository</id>

<url>http://repo.akka.io/releases</url>

</repository>

</repositories>

<dependencies>

<dependency> <!-- Spark dependency -->

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.0.0</version>

</dependency>


<dependency>

<groupId>org.mongodb</groupId>

<artifactId>mongo-hadoop-core</artifactId>

<version>1.0.0</version>

</dependency>


</dependencies>

<build>

<outputDirectory>target/java-${maven.compiler.source}/classes</
outputDirectory>

<testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
testOutputDirectory>

<plugins>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-shade-plugin</artifactId>

<version>2.3</version>

<configuration>

 <shadedArtifactAttached>false</shadedArtifactAttached>

 <outputFile>
${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
</outputFile>

<artifactSet>

<includes>

<include>*:*</include>

</includes>

</artifactSet>

<!--

<filters>

<filter>

<artifact>*:*</artifact>

<excludes>

<exclude>META-INF/*.SF</exclude>

<exclude>META-INF/*.DSA</exclude>

<exclude>META-INF/*.RSA</exclude>

</excludes>

</filter>

</filters>

-->

</configuration>

<executions>

<execution>

<phase>package</phase>

<goals>

<goal>shade</goal>

</goals>

<configuration>

<transformers>

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.AppendingTransformer">

<resource>reference.conf</resource>

</transformer>

<transformer

implementation=
"org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">

<resource>log4j.properties</resource>

</transformer>

</transformers>

</configuration>

</execution>

</executions>

</plugin>

</plugins>

</build>

</project>

It is confusing me because i can see the output of mvm package:

...

[INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded jar.

...
in theory, the uber jar is created with all dependencies but if i open the
jar i don't see them.

Thanks in advance


Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-06-23 21:27 GMT+02:00 Yana Kadiyska <ya...@gmail.com>:

> One thing I noticed around the place where you get the first error --
> you are calling words.map instead of words.mapToPair. map produces
> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
> Java APIs myself but it looks to me like you need to check the types
> more carefully.
>
> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
> <al...@gmail.com> wrote:
> > Hi all,
> >
> > I am new to Spark, so this is probably a basic question. i want to
> explore
> > the possibilities of this fw, concretely using it in conjunction with 3
> > party libs, like mongodb, for example.
> >
> > I have been keeping instructions from
> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order to
> > connect spark with mongodb. This example is made with
> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml with
> > latest versions.
> >
> > This is my pom.xml
> >
> > <project xmlns="http://maven.apache.org/POM/4.0.0"
> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> >
> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
> >
> > <groupId>com.aironman.spark</groupId>
> >
> > <artifactId>simple-project</artifactId>
> >
> > <modelVersion>4.0.0</modelVersion>
> >
> > <name>Simple Project</name>
> >
> > <packaging>jar</packaging>
> >
> > <version>1.0</version>
> >
> > <repositories>
> >
> > <repository>
> >
> > <id>Akka repository</id>
> >
> > <url>http://repo.akka.io/releases</url>
> >
> > </repository>
> >
> > </repositories>
> >
> > <dependencies>
> >
> > <dependency> <!-- Spark dependency -->
> >
> > <groupId>org.apache.spark</groupId>
> >
> > <artifactId>spark-core_2.10</artifactId>
> >
> > <version>1.0.0</version>
> >
> > </dependency>
> >
> >
> > <dependency>
> >
> > <groupId>org.mongodb</groupId>
> >
> > <artifactId>mongo-hadoop-core</artifactId>
> >
> > <version>1.0.0</version>
> >
> > </dependency>
> >
> >
> > </dependencies>
> >
> > </project>
> >
> > As you can see, super simple pom.xml
> >
> > And this is the JavaWordCount.java
> >
> > import java.util.Arrays;
> >
> > import java.util.Collections;
> >
> >
> > import org.apache.hadoop.conf.Configuration;
> >
> > import org.apache.spark.api.java.JavaPairRDD;
> >
> > import org.apache.spark.api.java.JavaRDD;
> >
> > import org.apache.spark.api.java.JavaSparkContext;
> >
> > import org.apache.spark.api.java.function.FlatMapFunction;
> >
> > import org.apache.spark.api.java.function.Function2;
> >
> > import org.apache.spark.api.java.function.PairFunction;
> >
> > import org.bson.BSONObject;
> >
> > import org.bson.BasicBSONObject;
> >
> >
> > import scala.Tuple2;
> >
> >
> > import com.mongodb.hadoop.MongoOutputFormat;
> >
> >
> > /***
> >
> >  * Esta clase se supone que se conecta a un cluster mongodb para ejecutar
> > una tarea word count por cada palabra almacenada en la bd.
> >
> >  * el problema es que esta api esta rota, creo. Estoy usando la ultima
> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria usar
> un
> > ejemplo
> >
> >  * sobre spark-streaming conectandose a un base mongodb, o usar
> > spark-streaming junto con spring integration, es decir, conectar spark
> con
> > un servicio web que
> >
> >  * periodicamente alimentaria spark...
> >
> >  * @author aironman
> >
> >  *
> >
> >  */
> >
> > public class JavaWordCount {
> >
> >
> >
> >     public static void main(String[] args) {
> >
> >
> >
> >         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
> > Count");
> >
> >
> >
> >         Configuration config = new Configuration();
> >
> >         config.set("mongo.input.uri",
> > "mongodb:127.0.0.1:27017/beowulf.input");
> >
> >         config.set("mongo.output.uri",
> > "mongodb:127.0.0.1:27017/beowulf.output");
> >
> >
> >
> >
> >
> >         JavaPairRDD<Object, BSONObject> mongoRDD =
> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
> > Object.class, BSONObject.class);
> >
> >
> >
> > //         Input contains tuples of (ObjectId, BSONObject)
> >
> >         JavaRDD<String> words = mongoRDD.flatMap(new
> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
> >
> >             @Override
> >
> >             public Iterable<String> call(Tuple2<Object, BSONObject> arg)
> {
> >
> >                 Object o = arg._2.get("text");
> >
> >                 if (o instanceof String) {
> >
> >                     String str = (String) o;
> >
> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", " ");
> >
> >                     return Arrays.asList(str.split(" "));
> >
> >                 } else {
> >
> >                     return Collections.emptyList();
> >
> >                 }
> >
> >             }
> >
> >         });
> >
> >         //here is an error, The method map(Function<String,R>) in the
> type
> > JavaRDD<String> is not applicable for the arguments (new
> > PairFunction<String,String,Integer>(){})
> >
> >         JavaPairRDD<String, Integer> ones = words.map(new
> > PairFunction<String, String, Integer>() {
> >
> >             public Tuple2<String, Integer> call(String s) {
> >
> >                 return new Tuple2<>(s, 1);
> >
> >             }
> >
> >         });
> >
> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
> > Function2<Integer, Integer, Integer>() {
> >
> >             public Integer call(Integer i1, Integer i2) {
> >
> >                 return i1 + i2;
> >
> >             }
> >
> >         });
> >
> >
> >
> >         //another error, The method
> map(Function<Tuple2<String,Integer>,R>)
> > in the type JavaPairRDD<String,Integer> is not applicable for the
> arguments
> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
> >
> >
> > //         Output contains tuples of (null, BSONObject) - ObjectId will
> be
> > generated by Mongo driver if null
> >
> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
> >
> >             @Override
> >
> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
> Integer>
> > tuple) {
> >
> >                 BSONObject bson = new BasicBSONObject();
> >
> >                 bson.put("word", tuple._1);
> >
> >                 bson.put("count", tuple._2);
> >
> >                 return new Tuple2<>(null, bson);
> >
> >             }
> >
> >         });
> >
> >
> >
> > //         Only MongoOutputFormat and config are relevant
> >
> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
> > Object.class, MongoOutputFormat.class, config);
> >
> >     }
> >
> >
> >
> > }
> >
> >
> > It looks like jar hell dependency, isn't it?  can anyone guide or help
> me?
> >
> > Another thing, i don t like closures, is it possible to use this fw
> without
> > using it?
> > Another question, are this objects, JavaSparkContext sc,
> JavaPairRDD<Object,
> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
> >
> > Thank you very much and apologizes if the questions are not trending
> topic
> > :)
> >
> > Alonso Isidoro Roman.
> >
> > Mis citas preferidas (de hoy) :
> > "Si depurar es el proceso de quitar los errores de software, entonces
> > programar debe ser el proceso de introducirlos..."
> >  -  Edsger Dijkstra
> >
> > My favorite quotes (today):
> > "If debugging is the process of removing software bugs, then programming
> > must be the process of putting ..."
> >   - Edsger Dijkstra
> >
> > "If you pay peanuts you get monkeys"
> >
>

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

Posted by Yana Kadiyska <ya...@gmail.com>.

One thing I noticed around the place where you get the first error --
you are calling words.map instead of words.mapToPair. map produces
JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
Java APIs myself but it looks to me like you need to check the types
more carefully.

On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
<al...@gmail.com> wrote:
> Hi all,
>
> I am new to Spark, so this is probably a basic question. i want to explore
> the possibilities of this fw, concretely using it in conjunction with 3
> party libs, like mongodb, for example.
>
> I have been keeping instructions from
> http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order to
> connect spark with mongodb. This example is made with
> spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml with
> latest versions.
>
> This is my pom.xml
>
> <project xmlns="http://maven.apache.org/POM/4.0.0"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>
> <groupId>com.aironman.spark</groupId>
>
> <artifactId>simple-project</artifactId>
>
> <modelVersion>4.0.0</modelVersion>
>
> <name>Simple Project</name>
>
> <packaging>jar</packaging>
>
> <version>1.0</version>
>
> <repositories>
>
> <repository>
>
> <id>Akka repository</id>
>
> <url>http://repo.akka.io/releases</url>
>
> </repository>
>
> </repositories>
>
> <dependencies>
>
> <dependency> <!-- Spark dependency -->
>
> <groupId>org.apache.spark</groupId>
>
> <artifactId>spark-core_2.10</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
> <dependency>
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-hadoop-core</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
> </dependencies>
>
> </project>
>
> As you can see, super simple pom.xml
>
> And this is the JavaWordCount.java
>
> import java.util.Arrays;
>
> import java.util.Collections;
>
>
> import org.apache.hadoop.conf.Configuration;
>
> import org.apache.spark.api.java.JavaPairRDD;
>
> import org.apache.spark.api.java.JavaRDD;
>
> import org.apache.spark.api.java.JavaSparkContext;
>
> import org.apache.spark.api.java.function.FlatMapFunction;
>
> import org.apache.spark.api.java.function.Function2;
>
> import org.apache.spark.api.java.function.PairFunction;
>
> import org.bson.BSONObject;
>
> import org.bson.BasicBSONObject;
>
>
> import scala.Tuple2;
>
>
> import com.mongodb.hadoop.MongoOutputFormat;
>
>
> /***
>
>  * Esta clase se supone que se conecta a un cluster mongodb para ejecutar
> una tarea word count por cada palabra almacenada en la bd.
>
>  * el problema es que esta api esta rota, creo. Estoy usando la ultima
> version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria usar un
> ejemplo
>
>  * sobre spark-streaming conectandose a un base mongodb, o usar
> spark-streaming junto con spring integration, es decir, conectar spark con
> un servicio web que
>
>  * periodicamente alimentaria spark...
>
>  * @author aironman
>
>  *
>
>  */
>
> public class JavaWordCount {
>
>
>
>     public static void main(String[] args) {
>
>
>
>         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
> Count");
>
>
>
>         Configuration config = new Configuration();
>
>         config.set("mongo.input.uri",
> "mongodb:127.0.0.1:27017/beowulf.input");
>
>         config.set("mongo.output.uri",
> "mongodb:127.0.0.1:27017/beowulf.output");
>
>
>
>
>
>         JavaPairRDD<Object, BSONObject> mongoRDD =
> sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
> Object.class, BSONObject.class);
>
>
>
> //         Input contains tuples of (ObjectId, BSONObject)
>
>         JavaRDD<String> words = mongoRDD.flatMap(new
> FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>
>             @Override
>
>             public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
>
>                 Object o = arg._2.get("text");
>
>                 if (o instanceof String) {
>
>                     String str = (String) o;
>
>                     str = str.toLowerCase().replaceAll("[.,!?\n]", " ");
>
>                     return Arrays.asList(str.split(" "));
>
>                 } else {
>
>                     return Collections.emptyList();
>
>                 }
>
>             }
>
>         });
>
>         //here is an error, The method map(Function<String,R>) in the type
> JavaRDD<String> is not applicable for the arguments (new
> PairFunction<String,String,Integer>(){})
>
>         JavaPairRDD<String, Integer> ones = words.map(new
> PairFunction<String, String, Integer>() {
>
>             public Tuple2<String, Integer> call(String s) {
>
>                 return new Tuple2<>(s, 1);
>
>             }
>
>         });
>
>         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
> Function2<Integer, Integer, Integer>() {
>
>             public Integer call(Integer i1, Integer i2) {
>
>                 return i1 + i2;
>
>             }
>
>         });
>
>
>
>         //another error, The method map(Function<Tuple2<String,Integer>,R>)
> in the type JavaPairRDD<String,Integer> is not applicable for the arguments
> (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>
>
> //         Output contains tuples of (null, BSONObject) - ObjectId will be
> generated by Mongo driver if null
>
>         JavaPairRDD<Object, BSONObject> save = counts.map(new
> PairFunction<Tuple2<String, Integer>, Object, BSONObject>() {
>
>             @Override
>
>             public Tuple2<Object, BSONObject> call(Tuple2<String, Integer>
> tuple) {
>
>                 BSONObject bson = new BasicBSONObject();
>
>                 bson.put("word", tuple._1);
>
>                 bson.put("count", tuple._2);
>
>                 return new Tuple2<>(null, bson);
>
>             }
>
>         });
>
>
>
> //         Only MongoOutputFormat and config are relevant
>
>         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
> Object.class, MongoOutputFormat.class, config);
>
>     }
>
>
>
> }
>
>
> It looks like jar hell dependency, isn't it?  can anyone guide or help me?
>
> Another thing, i don t like closures, is it possible to use this fw without
> using it?
> Another question, are this objects, JavaSparkContext sc, JavaPairRDD<Object,
> BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>
> Thank you very much and apologizes if the questions are not trending topic
> :)
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>