You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Deepesh Maheshwari <de...@gmail.com> on 2015/08/31 08:56:17 UTC

Slow Mongo Read from Spark

Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.

/**** Code *****/

config.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
config.set("mongo.input.query","{host: 'abc.com'}");

JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");

        JavaPairRDD<Object, BSONObject> mongoRDD =
sc.newAPIHadoopRDD(config,
                com.mongodb.hadoop.MongoInputFormat.class, Object.class,
                BSONObject.class);

        long count=mongoRDD.count();

There are about 1.5million record.
Though i am getting data but read operation took around 15min to read whole.

Is this Api really too slow or am i missing something.
Please suggest if there is an alternate approach to read data from Mongo
faster.

Thanks,
Deepesh

Re: Slow Mongo Read from Spark

Posted by Deepesh Maheshwari <de...@gmail.com>.

Because of existing architecture , i am bound to use mongodb.

Please suggest for this

On Thu, Sep 3, 2015 at 9:10 PM, Jörn Franke <jo...@gmail.com> wrote:

> You might think about another storage layer not being mongodb
> (hdfs+orc+compression or hdfs+parquet+compression)  to improve performance
>
> Le jeu. 3 sept. 2015 à 9:15, Akhil Das <ak...@sigmoidanalytics.com> a
> écrit :
>
>> On SSD you will get around 30-40MB/s on a single machine (on 4 cores).
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> tried it,,gives the same above exception
>>>
>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>> scheme: mongodb
>>>
>>> In you case, do you have used above code.
>>> What read throughput , you get?
>>>
>>> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
>>>> itself underneath and it doesnt mean it will only read from HDFS. Give it a
>>>> shot if you haven't tried it already (it just the inputformat and the
>>>> reader which are different from your approach).
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>
>>>>> Hi Akhil,
>>>>>
>>>>> This code snippet is from below link
>>>>>
>>>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>>>>
>>>>> Here it reading data from HDFS file system but in our case i need to
>>>>> read from mongodb.
>>>>>
>>>>> I have tried it earlier and now again tried it but is giving below
>>>>> error which is self explanantory.
>>>>>
>>>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>>>> scheme: mongodb
>>>>>
>>>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <akhil@sigmoidanalytics.com
>>>>> > wrote:
>>>>>
>>>>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>>>>
>>>>>>         Configuration bsonDataConfig = new Configuration();
>>>>>>         bsonDataConfig.set("mongo.job.input.format",
>>>>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>>>>
>>>>>>         Configuration predictionsConfig = new Configuration();
>>>>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>>>>
>>>>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>>>>> sc.newAPIHadoopFile(
>>>>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>>>>                 BSONObject.class, bsonDataConfig);
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>>>>
>>>>>>> I am not getting constructor for above values
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So, i tried to shuffle the values in constructor .
>>>>>>> [image: Inline image 2]
>>>>>>>
>>>>>>> But, it is giving this error.Please suggest
>>>>>>> [image: Inline image 3]
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <
>>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>>
>>>>>>>> Can you try with these key value classes and see the performance?
>>>>>>>>
>>>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>>>>
>>>>>>>>
>>>>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>>>>
>>>>>>>>
>>>>>>>> Taken from databricks blog
>>>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>>>>
>>>>>>>>> /**** Code *****/
>>>>>>>>>
>>>>>>>>> config.set("mongo.job.input.format",
>>>>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>>>>
>>>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>>>>
>>>>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>>>>> Object.class,
>>>>>>>>>                 BSONObject.class);
>>>>>>>>>
>>>>>>>>>         long count=mongoRDD.count();
>>>>>>>>>
>>>>>>>>> There are about 1.5million record.
>>>>>>>>> Though i am getting data but read operation took around 15min to
>>>>>>>>> read whole.
>>>>>>>>>
>>>>>>>>> Is this Api really too slow or am i missing something.
>>>>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>>>>> Mongo faster.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Deepesh
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Slow Mongo Read from Spark

Posted by Jörn Franke <jo...@gmail.com>.

You might think about another storage layer not being mongodb
(hdfs+orc+compression or hdfs+parquet+compression)  to improve performance

Le jeu. 3 sept. 2015 à 9:15, Akhil Das <ak...@sigmoidanalytics.com> a
écrit :

> On SSD you will get around 30-40MB/s on a single machine (on 4 cores).
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari <
> deepesh.maheshwari17@gmail.com> wrote:
>
>> tried it,,gives the same above exception
>>
>> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
>> mongodb
>>
>> In you case, do you have used above code.
>> What read throughput , you get?
>>
>> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
>>> itself underneath and it doesnt mean it will only read from HDFS. Give it a
>>> shot if you haven't tried it already (it just the inputformat and the
>>> reader which are different from your approach).
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
>>> deepesh.maheshwari17@gmail.com> wrote:
>>>
>>>> Hi Akhil,
>>>>
>>>> This code snippet is from below link
>>>>
>>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>>>
>>>> Here it reading data from HDFS file system but in our case i need to
>>>> read from mongodb.
>>>>
>>>> I have tried it earlier and now again tried it but is giving below
>>>> error which is self explanantory.
>>>>
>>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>>> scheme: mongodb
>>>>
>>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>>>
>>>>>         Configuration bsonDataConfig = new Configuration();
>>>>>         bsonDataConfig.set("mongo.job.input.format",
>>>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>>>
>>>>>         Configuration predictionsConfig = new Configuration();
>>>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>>>
>>>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>>>> sc.newAPIHadoopFile(
>>>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>>>                 BSONObject.class, bsonDataConfig);
>>>>>
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>
>>>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>>>
>>>>>> I am not getting constructor for above values
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So, i tried to shuffle the values in constructor .
>>>>>> [image: Inline image 2]
>>>>>>
>>>>>> But, it is giving this error.Please suggest
>>>>>> [image: Inline image 3]
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <
>>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>>
>>>>>>> Can you try with these key value classes and see the performance?
>>>>>>>
>>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>>>
>>>>>>>
>>>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>>>
>>>>>>>
>>>>>>> Taken from databricks blog
>>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>>>
>>>>>>>> /**** Code *****/
>>>>>>>>
>>>>>>>> config.set("mongo.job.input.format",
>>>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>>>
>>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>>>
>>>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>>>> Object.class,
>>>>>>>>                 BSONObject.class);
>>>>>>>>
>>>>>>>>         long count=mongoRDD.count();
>>>>>>>>
>>>>>>>> There are about 1.5million record.
>>>>>>>> Though i am getting data but read operation took around 15min to
>>>>>>>> read whole.
>>>>>>>>
>>>>>>>> Is this Api really too slow or am i missing something.
>>>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>>>> Mongo faster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Deepesh
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

On SSD you will get around 30-40MB/s on a single machine (on 4 cores).

Thanks
Best Regards

On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> tried it,,gives the same above exception
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> mongodb
>
> In you case, do you have used above code.
> What read throughput , you get?
>
> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
>> itself underneath and it doesnt mean it will only read from HDFS. Give it a
>> shot if you haven't tried it already (it just the inputformat and the
>> reader which are different from your approach).
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> Hi Akhil,
>>>
>>> This code snippet is from below link
>>>
>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>>
>>> Here it reading data from HDFS file system but in our case i need to
>>> read from mongodb.
>>>
>>> I have tried it earlier and now again tried it but is giving below error
>>> which is self explanantory.
>>>
>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>> scheme: mongodb
>>>
>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>>
>>>>         Configuration bsonDataConfig = new Configuration();
>>>>         bsonDataConfig.set("mongo.job.input.format",
>>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>>
>>>>         Configuration predictionsConfig = new Configuration();
>>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>>
>>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>>> sc.newAPIHadoopFile(
>>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>>                 BSONObject.class, bsonDataConfig);
>>>>
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>
>>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>>
>>>>> I am not getting constructor for above values
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So, i tried to shuffle the values in constructor .
>>>>> [image: Inline image 2]
>>>>>
>>>>> But, it is giving this error.Please suggest
>>>>> [image: Inline image 3]
>>>>>
>>>>> Best Regards
>>>>>
>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <
>>>>> akhil@sigmoidanalytics.com> wrote:
>>>>>
>>>>>> Can you try with these key value classes and see the performance?
>>>>>>
>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>>
>>>>>>
>>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>>
>>>>>>
>>>>>> Taken from databricks blog
>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>>
>>>>>>> /**** Code *****/
>>>>>>>
>>>>>>> config.set("mongo.job.input.format",
>>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>>
>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>>
>>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>>> Object.class,
>>>>>>>                 BSONObject.class);
>>>>>>>
>>>>>>>         long count=mongoRDD.count();
>>>>>>>
>>>>>>> There are about 1.5million record.
>>>>>>> Though i am getting data but read operation took around 15min to
>>>>>>> read whole.
>>>>>>>
>>>>>>> Is this Api really too slow or am i missing something.
>>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>>> Mongo faster.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Deepesh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Deepesh Maheshwari <de...@gmail.com>.

tried it,,gives the same above exception

Exception in thread "main" java.io.IOException: No FileSystem for scheme:
mongodb

In you case, do you have used above code.
What read throughput , you get?

On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
> itself underneath and it doesnt mean it will only read from HDFS. Give it a
> shot if you haven't tried it already (it just the inputformat and the
> reader which are different from your approach).
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
> deepesh.maheshwari17@gmail.com> wrote:
>
>> Hi Akhil,
>>
>> This code snippet is from below link
>>
>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>
>> Here it reading data from HDFS file system but in our case i need to read
>> from mongodb.
>>
>> I have tried it earlier and now again tried it but is giving below error
>> which is self explanantory.
>>
>> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
>> mongodb
>>
>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>
>>>         Configuration bsonDataConfig = new Configuration();
>>>         bsonDataConfig.set("mongo.job.input.format",
>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>
>>>         Configuration predictionsConfig = new Configuration();
>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>
>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>> sc.newAPIHadoopFile(
>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>                 BSONObject.class, bsonDataConfig);
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>> deepesh.maheshwari17@gmail.com> wrote:
>>>
>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>
>>>> I am not getting constructor for above values
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So, i tried to shuffle the values in constructor .
>>>> [image: Inline image 2]
>>>>
>>>> But, it is giving this error.Please suggest
>>>> [image: Inline image 3]
>>>>
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <akhil@sigmoidanalytics.com
>>>> > wrote:
>>>>
>>>>> Can you try with these key value classes and see the performance?
>>>>>
>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>
>>>>>
>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>
>>>>>
>>>>> Taken from databricks blog
>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>>
>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>
>>>>>> /**** Code *****/
>>>>>>
>>>>>> config.set("mongo.job.input.format",
>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>
>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>
>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>> Object.class,
>>>>>>                 BSONObject.class);
>>>>>>
>>>>>>         long count=mongoRDD.count();
>>>>>>
>>>>>> There are about 1.5million record.
>>>>>> Though i am getting data but read operation took around 15min to read
>>>>>> whole.
>>>>>>
>>>>>> Is this Api really too slow or am i missing something.
>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>> Mongo faster.
>>>>>>
>>>>>> Thanks,
>>>>>> Deepesh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
itself underneath and it doesnt mean it will only read from HDFS. Give it a
shot if you haven't tried it already (it just the inputformat and the
reader which are different from your approach).

Thanks
Best Regards

On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> Hi Akhil,
>
> This code snippet is from below link
>
> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>
> Here it reading data from HDFS file system but in our case i need to read
> from mongodb.
>
> I have tried it earlier and now again tried it but is giving below error
> which is self explanantory.
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> mongodb
>
> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Here's a piece of code which works well for us (spark 1.4.1)
>>
>>         Configuration bsonDataConfig = new Configuration();
>>         bsonDataConfig.set("mongo.job.input.format",
>> "com.mongodb.hadoop.BSONFileInputFormat");
>>
>>         Configuration predictionsConfig = new Configuration();
>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>
>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>> sc.newAPIHadoopFile(
>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>                 BSONObject.class, bsonDataConfig);
>>
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>
>>> I am not getting constructor for above values
>>>
>>> [image: Inline image 1]
>>>
>>> So, i tried to shuffle the values in constructor .
>>> [image: Inline image 2]
>>>
>>> But, it is giving this error.Please suggest
>>> [image: Inline image 3]
>>>
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Can you try with these key value classes and see the performance?
>>>>
>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>
>>>>
>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>
>>>>
>>>> Taken from databricks blog
>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>
>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>
>>>>> /**** Code *****/
>>>>>
>>>>> config.set("mongo.job.input.format",
>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>
>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>
>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>> sc.newAPIHadoopRDD(config,
>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>> Object.class,
>>>>>                 BSONObject.class);
>>>>>
>>>>>         long count=mongoRDD.count();
>>>>>
>>>>> There are about 1.5million record.
>>>>> Though i am getting data but read operation took around 15min to read
>>>>> whole.
>>>>>
>>>>> Is this Api really too slow or am i missing something.
>>>>> Please suggest if there is an alternate approach to read data from
>>>>> Mongo faster.
>>>>>
>>>>> Thanks,
>>>>> Deepesh
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Deepesh Maheshwari <de...@gmail.com>.

Hi Akhil,

This code snippet is from below link
https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java

Here it reading data from HDFS file system but in our case i need to read
from mongodb.

I have tried it earlier and now again tried it but is giving below error
which is self explanantory.

Exception in thread "main" java.io.IOException: No FileSystem for scheme:
mongodb

On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Here's a piece of code which works well for us (spark 1.4.1)
>
>         Configuration bsonDataConfig = new Configuration();
>         bsonDataConfig.set("mongo.job.input.format",
> "com.mongodb.hadoop.BSONFileInputFormat");
>
>         Configuration predictionsConfig = new Configuration();
>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>
>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
> sc.newAPIHadoopFile(
>             ratingsUri, BSONFileInputFormat.class, Object.class,
>                 BSONObject.class, bsonDataConfig);
>
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
> deepesh.maheshwari17@gmail.com> wrote:
>
>> Hi, I am using <spark.version>1.3.0</spark.version>
>>
>> I am not getting constructor for above values
>>
>> [image: Inline image 1]
>>
>> So, i tried to shuffle the values in constructor .
>> [image: Inline image 2]
>>
>> But, it is giving this error.Please suggest
>> [image: Inline image 3]
>>
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Can you try with these key value classes and see the performance?
>>>
>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>
>>>
>>> keyClassName = "org.apache.hadoop.io.Text"
>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>
>>>
>>> Taken from databricks blog
>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>> deepesh.maheshwari17@gmail.com> wrote:
>>>
>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>
>>>> /**** Code *****/
>>>>
>>>> config.set("mongo.job.input.format",
>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>
>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>
>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>> sc.newAPIHadoopRDD(config,
>>>>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>>>>                 BSONObject.class);
>>>>
>>>>         long count=mongoRDD.count();
>>>>
>>>> There are about 1.5million record.
>>>> Though i am getting data but read operation took around 15min to read
>>>> whole.
>>>>
>>>> Is this Api really too slow or am i missing something.
>>>> Please suggest if there is an alternate approach to read data from
>>>> Mongo faster.
>>>>
>>>> Thanks,
>>>> Deepesh
>>>>
>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Here's a piece of code which works well for us (spark 1.4.1)

        Configuration bsonDataConfig = new Configuration();
        bsonDataConfig.set("mongo.job.input.format",
"com.mongodb.hadoop.BSONFileInputFormat");

        Configuration predictionsConfig = new Configuration();
        predictionsConfig.set("mongo.output.uri", mongodbUri);

        JavaPairRDD<Object,BSONObject> bsonRatingsData =
sc.newAPIHadoopFile(
            ratingsUri, BSONFileInputFormat.class, Object.class,
                BSONObject.class, bsonDataConfig);


Thanks
Best Regards

On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> Hi, I am using <spark.version>1.3.0</spark.version>
>
> I am not getting constructor for above values
>
> [image: Inline image 1]
>
> So, i tried to shuffle the values in constructor .
> [image: Inline image 2]
>
> But, it is giving this error.Please suggest
> [image: Inline image 3]
>
> Best Regards
>
> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Can you try with these key value classes and see the performance?
>>
>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>
>>
>> keyClassName = "org.apache.hadoop.io.Text"
>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>
>>
>> Taken from databricks blog
>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>
>>> /**** Code *****/
>>>
>>> config.set("mongo.job.input.format",
>>> "com.mongodb.hadoop.MongoInputFormat");
>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>
>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>
>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>> sc.newAPIHadoopRDD(config,
>>>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>>>                 BSONObject.class);
>>>
>>>         long count=mongoRDD.count();
>>>
>>> There are about 1.5million record.
>>> Though i am getting data but read operation took around 15min to read
>>> whole.
>>>
>>> Is this Api really too slow or am i missing something.
>>> Please suggest if there is an alternate approach to read data from Mongo
>>> faster.
>>>
>>> Thanks,
>>> Deepesh
>>>
>>
>>
>

Re: Slow Mongo Read from Spark

Posted by Deepesh Maheshwari <de...@gmail.com>.

Hi, I am using <spark.version>1.3.0</spark.version>

I am not getting constructor for above values

[image: Inline image 1]

So, i tried to shuffle the values in constructor .
[image: Inline image 2]

But, it is giving this error.Please suggest
[image: Inline image 3]

Best Regards

On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Can you try with these key value classes and see the performance?
>
> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>
>
> keyClassName = "org.apache.hadoop.io.Text"
> valueClassName = "org.apache.hadoop.io.MapWritable"
>
>
> Taken from databricks blog
> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
> deepesh.maheshwari17@gmail.com> wrote:
>
>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>
>> /**** Code *****/
>>
>> config.set("mongo.job.input.format",
>> "com.mongodb.hadoop.MongoInputFormat");
>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>> config.set("mongo.input.query","{host: 'abc.com'}");
>>
>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>
>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>> sc.newAPIHadoopRDD(config,
>>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>>                 BSONObject.class);
>>
>>         long count=mongoRDD.count();
>>
>> There are about 1.5million record.
>> Though i am getting data but read operation took around 15min to read
>> whole.
>>
>> Is this Api really too slow or am i missing something.
>> Please suggest if there is an alternate approach to read data from Mongo
>> faster.
>>
>> Thanks,
>> Deepesh
>>
>
>

Re: Slow Mongo Read from Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Can you try with these key value classes and see the performance?

inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"


keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"


Taken from databricks blog
<https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>

Thanks
Best Regards

On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>
> /**** Code *****/
>
> config.set("mongo.job.input.format",
> "com.mongodb.hadoop.MongoInputFormat");
> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
> config.set("mongo.input.query","{host: 'abc.com'}");
>
> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>
>         JavaPairRDD<Object, BSONObject> mongoRDD =
> sc.newAPIHadoopRDD(config,
>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>                 BSONObject.class);
>
>         long count=mongoRDD.count();
>
> There are about 1.5million record.
> Though i am getting data but read operation took around 15min to read
> whole.
>
> Is this Api really too slow or am i missing something.
> Please suggest if there is an alternate approach to read data from Mongo
> faster.
>
> Thanks,
> Deepesh
>