You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sandeep Giri <sa...@knowbigdata.com> on 2015/09/11 11:48:56 UTC

Re: MongoDB and Spark

use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>
wrote:

> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>

RE: MongoDB and Spark

Posted by "Mishra, Abhishek" <Ab...@xerox.com>.

Hello,

Don’t get me wrong here….just as per my understanding after reading your reply…. Are you telling me about MongoDb instances on multiple nodes????


I am talking of a single mongoDb instance/server having multiple collection in it….(say multiple tables).

Please help me in understanding.
Abhishek


From: Corey Nolet [mailto:cjnolet@gmail.com]
Sent: Friday, September 11, 2015 7:58 PM
To: Sandeep Giri
Cc: Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
Subject: Re: MongoDB and Spark

Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over the network to the task that needs to process it. This is bad. The other reason (which is also related to the issue of locality) is that I'm not sure if there's an easy way to spread the results of a query over multiple different clients- thus you'd probably have to start your Spark RDD with a single partition and then repartition. What you've done at that point is you've taken data from multiple mongodb nodes and you've collected them on a single node just to re-partition them, again across the network, onto multiple nodes. This is also bad.
I think this is the reason it was recommended to use MongoDB's mapreduce because they can use their locality information internally. I had this same issue w/ Couchbase a couple years back- it's unfortunate but it's the reality.



On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri <sa...@knowbigdata.com>> wrote:
I think it should be possible by loading collections as RDD and then doing a union on them.

Regards,
Sandeep Giri,
+1 347 781 4573<tel:%2B1%20347%20781%204573> (US)
+91-953-899-8962 (IN)

www.KnowBigData.com.<http://KnowBigData.com.>
Phone: +1-253-397-1945<tel:%2B1-253-397-1945> (Office)

[linkedin icon]<https://linkedin.com/company/knowbigdata>[other site icon]<http://knowbigdata.com>  [facebook icon] <https://facebook.com/knowbigdata> [twitter icon] <https://twitter.com/IKnowBigData>


On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <Ab...@xerox.com>> wrote:
Anything using Spark RDD’s ???

Abhishek

From: Sandeep Giri [mailto:sandeep@knowbigdata.com<ma...@knowbigdata.com>]
Sent: Friday, September 11, 2015 3:19 PM
To: Mishra, Abhishek; user@spark.apache.org<ma...@spark.apache.org>; dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: MongoDB and Spark


use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>> wrote:
Hello ,

Is there any way to query multiple collections from mongodb using spark and java.  And i want to create only one Configuration Object. Please help if anyone has something regarding this.


Thank You
Abhishek

Re: MongoDB and Spark

Posted by Corey Nolet <cj...@gmail.com>.

Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.

On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri <sa...@knowbigdata.com>
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. <http://KnowBigData.com.>
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
> other site icon] <http://knowbigdata.com>  [image: facebook icon]
> <https://facebook.com/knowbigdata> [image: twitter icon]
> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> Abhishek.Mishra@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sandeep@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>

Re: MongoDB and Spark

Posted by Corey Nolet <cj...@gmail.com>.

Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.

On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri <sa...@knowbigdata.com>
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. <http://KnowBigData.com.>
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
> other site icon] <http://knowbigdata.com>  [image: facebook icon]
> <https://facebook.com/knowbigdata> [image: twitter icon]
> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> Abhishek.Mishra@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sandeep@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>

Re: MongoDB and Spark

Posted by Sandeep Giri <sa...@knowbigdata.com>.

I think it should be possible by loading collections as RDD and then doing
a union on them.

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. <http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
other site icon] <http://knowbigdata.com>  [image: facebook icon]
<https://facebook.com/knowbigdata> [image: twitter icon]
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>


On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <Abhishek.Mishra@xerox.com
> wrote:

> Anything using Spark RDD’s ???
>
>
>
> Abhishek
>
>
>
> *From:* Sandeep Giri [mailto:sandeep@knowbigdata.com]
> *Sent:* Friday, September 11, 2015 3:19 PM
> *To:* Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
> *Subject:* Re: MongoDB and Spark
>
>
>
> use map-reduce.
>
>
>
> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>
> wrote:
>
> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>
>

Re: MongoDB and Spark

Posted by Sandeep Giri <sa...@knowbigdata.com>.

I think it should be possible by loading collections as RDD and then doing
a union on them.

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. <http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
other site icon] <http://knowbigdata.com>  [image: facebook icon]
<https://facebook.com/knowbigdata> [image: twitter icon]
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>


On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <Abhishek.Mishra@xerox.com
> wrote:

> Anything using Spark RDD’s ???
>
>
>
> Abhishek
>
>
>
> *From:* Sandeep Giri [mailto:sandeep@knowbigdata.com]
> *Sent:* Friday, September 11, 2015 3:19 PM
> *To:* Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
> *Subject:* Re: MongoDB and Spark
>
>
>
> use map-reduce.
>
>
>
> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>
> wrote:
>
> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>
>

RE: MongoDB and Spark

Posted by "Mishra, Abhishek" <Ab...@xerox.com>.

Anything using Spark RDD’s ???

Abhishek

From: Sandeep Giri [mailto:sandeep@knowbigdata.com]
Sent: Friday, September 11, 2015 3:19 PM
To: Mishra, Abhishek; user@spark.apache.org; dev@spark.apache.org
Subject: Re: MongoDB and Spark

use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek <Ab...@xerox.com>> wrote:
Hello ,

Is there any way to query multiple collections from mongodb using spark and java.  And i want to create only one Configuration Object. Please help if anyone has something regarding this.

Thank You
Abhishek