You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sampo Niskanen <sa...@wellmo.com> on 2014/02/04 14:58:32 UTC

Re: Spark + MongoDB

Hi,

Thanks for the pointer.  However, I'm still unable to generate the RDD
using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to
the Java SimpleApp in the quickstart at
http://spark.incubator.apache.org/docs/latest/quick-start.html

The mongo-hadoop connector contains two versions of MongoInputFormat, one
extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the
other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.
 Neither of them is accepted by the compiler, and I'm unsure why:

        JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
        sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class,
Object.class, BSONObject.class);
        sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
Object.class, BSONObject.class);

Eclipse gives the following error for both the the latter two lines:

 Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>,
Class<V>) of type JavaSparkContext is not applicable for the arguments
(JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The
inferred type MongoInputFormat is not a valid substitute for the bounded
parameter <F extends InputFormat<K,V>>

I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop
versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I
haven't figured out how to select which Hadoop version Spark uses, when
required from an sbt file.  (The SBT file is the one described in the
quickstart.)

Thanks for any help.

Best regards,
   Sampo N.

On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das
<ta...@gmail.com>wrote:

> I walked through the example in the second link you gave. The Treasury
> Yield example referred there is here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
> Note the InputFormat and OutputFormat used in the job configuration. This
> InputFormat and OutputFormat specifies how to write data in and out of
> MongoDB. You should be able to use the same InputFormat and outputFormat
> class in Spark as well. For saving files to MongoDB, use
> yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to
> read from MongoDB  sparkContext.hadoopFile(..... specify input format class
> ....) .
>
> TD
>
>
> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
> sampo.niskanen@wellmo.com> wrote:
>
>> Hi,
>>
>> We're starting to build an analytics framework for our wellness service.
>>  While our data is not yet Big, we'd like to use a framework that will
>> scale as needed, and Spark seems to be the best around.
>>
>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>> to use Spark in connection with MongoDB.  Apparently, I should be able to
>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>> also with Spark, but haven't figured out how.
>>
>> I've run through the Spark tutorials and been able to setup a
>> single-machine Hadoop system with the MongoDB connector as instructed at
>>
>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>> and
>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>
>> Could someone give some instructions or pointers on how to configure and
>> use the mongo-hadoop connector with Spark?  I haven't been able to find any
>> documentation about this.
>>
>>
>> Thanks.
>>
>>
>> Best regards,
>>    Sampo N.
>>
>>
>>
>

Re: Spark + MongoDB

Posted by Matei Zaharia <ma...@gmail.com>.

Very cool, thanks for writing this. I’ll link it from our website.

Matei

On Feb 18, 2014, at 12:44 PM, Sampo Niskanen <sa...@wellmo.com> wrote:

> Hi,
> 
> Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an example application:
> http://codeforhire.com/2014/02/18/using-spark-with-mongodb/
> 
> Hope it's of use to someone else as well.
> 
> 
> Cheers,
> 
>     Sampo Niskanen
>     Lead developer / Wellmo
> 
>     sampo.niskanen@wellmo.com
>     +358 40 820 5291
>  
> 
> 
> On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <ta...@gmail.com> wrote:
> Can you try using sc.newAPIHadoop**  ?
> There are two kinds of classes because the Hadoop API for input and output format had undergone a significant change a few years ago. 
> 
> TD
> 
> 
> On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <sa...@wellmo.com> wrote:
> Hi,
> 
> Thanks for the pointer.  However, I'm still unable to generate the RDD using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java SimpleApp in the quickstart at http://spark.incubator.apache.org/docs/latest/quick-start.html
> 
> 
> The mongo-hadoop connector contains two versions of MongoInputFormat, one extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure why:
> 
>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>         sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, BSONObject.class);
>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);
> 
> Eclipse gives the following error for both the the latter two lines:
> 
> Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, Class<V>) of type JavaSparkContext is not applicable for the arguments (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The inferred type MongoInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>
> 
> 
> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't figured out how to select which Hadoop version Spark uses, when required from an sbt file.  (The SBT file is the one described in the quickstart.)
> 
> 
> Thanks for any help.
> 
> 
> Best regards,
>    Sampo N.
> 
> 
> 
> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <ta...@gmail.com> wrote:
> I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 
> 
> TD
> 
> 
> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <sa...@wellmo.com> wrote:
> Hi,
> 
> We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.
> 
> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.
> 
> I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
> and
> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
> 
> Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.
> 
> 
> Thanks.
> 
> 
> Best regards,
>    Sampo N.
> 
> 
> 
> 
> 
>

Re: Spark + MongoDB

Posted by Sampo Niskanen <sa...@wellmo.com>.

Hi,

Since getting Spark + MongoDB to work together was not very obvious (at
least to me) I wrote a tutorial about it in my blog with an example
application:
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/

Hope it's of use to someone else as well.


Cheers,

*    Sampo Niskanen*

*Lead developer / Wellmo*
    sampo.niskanen@wellmo.com
    +358 40 820 5291



On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das
<ta...@gmail.com>wrote:

> Can you try using sc.newAPIHadoop**  ?
> There are two kinds of classes because the Hadoop API for input and output
> format had undergone a significant change a few years ago.
>
> TD
>
>
> On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <sa...@wellmo.com>wrote:
>
>> Hi,
>>
>> Thanks for the pointer.  However, I'm still unable to generate the RDD
>> using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to
>> the Java SimpleApp in the quickstart at
>> http://spark.incubator.apache.org/docs/latest/quick-start.html
>>
>>
>> The mongo-hadoop connector contains two versions of MongoInputFormat, one
>> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>,
>> the other extending org.apache.hadoop.mapred.InputFormat<Object,
>> BSONObject>.  Neither of them is accepted by the compiler, and I'm
>> unsure why:
>>
>>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>>         sc.hadoopRDD(job,
>> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class,
>> BSONObject.class);
>>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
>> Object.class, BSONObject.class);
>>
>> Eclipse gives the following error for both the the latter two lines:
>>
>>  Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>,
>> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the
>> arguments (JobConf, Class<MongoInputFormat>, Class<Object>,
>> Class<BSONObject>). The inferred type MongoInputFormat is not a valid
>> substitute for the bounded parameter <F extends InputFormat<K,V>>
>>
>>
>>
>> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop
>> versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I
>> haven't figured out how to select which Hadoop version Spark uses, when
>> required from an sbt file.  (The SBT file is the one described in the
>> quickstart.)
>>
>>
>> Thanks for any help.
>>
>>
>> Best regards,
>>    Sampo N.
>>
>>
>>
>> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> I walked through the example in the second link you gave. The Treasury
>>> Yield example referred there is here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
>>> Note the InputFormat and OutputFormat used in the job configuration. This
>>> InputFormat and OutputFormat specifies how to write data in and out of
>>> MongoDB. You should be able to use the same InputFormat and outputFormat
>>> class in Spark as well. For saving files to MongoDB, use
>>> yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to
>>> read from MongoDB  sparkContext.hadoopFile(..... specify input format class
>>> ....) .
>>>
>>> TD
>>>
>>>
>>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
>>> sampo.niskanen@wellmo.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're starting to build an analytics framework for our wellness
>>>> service.  While our data is not yet Big, we'd like to use a framework that
>>>> will scale as needed, and Spark seems to be the best around.
>>>>
>>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>>>> to use Spark in connection with MongoDB.  Apparently, I should be able to
>>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>>>> also with Spark, but haven't figured out how.
>>>>
>>>> I've run through the Spark tutorials and been able to setup a
>>>> single-machine Hadoop system with the MongoDB connector as instructed at
>>>>
>>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>>> and
>>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>>>
>>>> Could someone give some instructions or pointers on how to configure
>>>> and use the mongo-hadoop connector with Spark?  I haven't been able to find
>>>> any documentation about this.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Best regards,
>>>>    Sampo N.
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark + MongoDB

Posted by Tathagata Das <ta...@gmail.com>.

Can you try using sc.newAPIHadoop**  ?
There are two kinds of classes because the Hadoop API for input and output
format had undergone a significant change a few years ago.

TD


On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <sa...@wellmo.com>wrote:

> Hi,
>
> Thanks for the pointer.  However, I'm still unable to generate the RDD
> using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to
> the Java SimpleApp in the quickstart at
> http://spark.incubator.apache.org/docs/latest/quick-start.html
>
>
> The mongo-hadoop connector contains two versions of MongoInputFormat, one
> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>,
> the other extending org.apache.hadoop.mapred.InputFormat<Object,
> BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure
> why:
>
>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>         sc.hadoopRDD(job,
> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class,
> BSONObject.class);
>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
> Object.class, BSONObject.class);
>
> Eclipse gives the following error for both the the latter two lines:
>
>  Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>,
> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the
> arguments (JobConf, Class<MongoInputFormat>, Class<Object>,
> Class<BSONObject>). The inferred type MongoInputFormat is not a valid
> substitute for the bounded parameter <F extends InputFormat<K,V>>
>
>
>
> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop
> versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I
> haven't figured out how to select which Hadoop version Spark uses, when
> required from an sbt file.  (The SBT file is the one described in the
> quickstart.)
>
>
> Thanks for any help.
>
>
> Best regards,
>    Sampo N.
>
>
>
> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> I walked through the example in the second link you gave. The Treasury
>> Yield example referred there is here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
>> Note the InputFormat and OutputFormat used in the job configuration. This
>> InputFormat and OutputFormat specifies how to write data in and out of
>> MongoDB. You should be able to use the same InputFormat and outputFormat
>> class in Spark as well. For saving files to MongoDB, use
>> yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to
>> read from MongoDB  sparkContext.hadoopFile(..... specify input format class
>> ....) .
>>
>> TD
>>
>>
>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
>> sampo.niskanen@wellmo.com> wrote:
>>
>>> Hi,
>>>
>>> We're starting to build an analytics framework for our wellness service.
>>>  While our data is not yet Big, we'd like to use a framework that will
>>> scale as needed, and Spark seems to be the best around.
>>>
>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>>> to use Spark in connection with MongoDB.  Apparently, I should be able to
>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>>> also with Spark, but haven't figured out how.
>>>
>>> I've run through the Spark tutorials and been able to setup a
>>> single-machine Hadoop system with the MongoDB connector as instructed at
>>>
>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>> and
>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>>
>>> Could someone give some instructions or pointers on how to configure and
>>> use the mongo-hadoop connector with Spark?  I haven't been able to find any
>>> documentation about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> Best regards,
>>>    Sampo N.
>>>
>>>
>>>
>>
>