You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by Julien Nioche <li...@gmail.com> on 2013/09/18 11:47:48 UTC

Fwd: 2.x vs. 1.x speed

Including dev@gora.apache.org as not all of you are on the Nutch lists ;-)

Julien

---------- Forwarded message ----------
From: Julien Nioche <li...@gmail.com>
Date: 16 September 2013 17:43
Subject: Re: 2.x vs. 1.x speed
To: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apache.org"
<de...@nutch.apache.org>
Cc: Otis Gospodnetic <ot...@yahoo.com>


Guys,

Following the discussion we had some time ago about comparing 1.x with 2.x,
we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Feel free to comment.

Best,

Julien


On 24 August 2013 05:51, Lewis John Mcgibbney <le...@gmail.com>wrote:

> I am sure that Renato (if he is watching) can plugin maybe as well.
> We find in Gora that in every sense of the word, native Hadoop stores such
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> via getParitions we retrieve GoraInputSplits natively which means splits
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
> as well. On  the other hand (currently) stores such as Cassandra and Web
> service stores such as DynamoDB do not support Hadoop out of the box (the
> former we are working on and hope to  have implemented in Gora soon)
> therefore it is not as simple to get partitions in the same way we would in
> a Hadoop native store. We therefore obtain one partition to be used as an
> InputSplit for the MR job. This is certainly an area for concern and right
> now a bottleneck for some operations. We continue to work on this.
>
>
> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
> >
> wrote:
> > Hi Otis
> >
> > Definitely *not *the fetching speed. Actually everything but *not* the
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
> > performance with fetching is pretty much always limited by the politeness
> > settings, not the implementation.
> >
> > Re-backend : some backend implementations are more mature than others.
> The
> > one for HBase is probably the one most widely used, the Cassandra one has
> > been greatly improved in particular performance-wise , the SQL one is
> > broken etc... we need to measure this as this is just a gut feeling at
> this
> > stage
> >
> > Now for  what is slower and why, again this has to be measured but I
> expect
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
> > not done by the backends (some might provide a way of doing it) but this
> is
> > done on the client side, when we create the input for mapred. In other
> > words we pull things from the backend just to discard it. Since 2.x does
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
> > single input) we scan the whole table even if we want to fetch or parse a
> > handful of entries.
> >
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
> > from that point of view since it pulls and deserializes only what it
> needs.
> >
> > Finally the most costly steps in a large crawl in 1.x are the generation
> > and update as we have to read/write the crawldb entirely. The way the
> > updates are done in 2.x is different and should be a lot faster.
> >
> > Please could anyone correct me if I am wrong. Some of this is based on my
> > understanding of 2.x which dates back from quite a while and some of the
> > stuff might have changed in the meantime. The performance would probably
> > vary a lot based on the fine tuning of each backend implementation but
> > having some basic comparison would confirm some of the assertions above.
> >
> > Julien
> >
> >
> > [1] https://issues.apache.org/jira/browse/GORA-119
> >
> >
> > Julien, could you please elaborate a bit about your comment about speed
> >> depending on the backend used?
> >>
> >> Yes, you were the person I was referring to :)
> >>
> >> Oh, and *believe* you said it was the fetching speed that was different
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> 2.x?
> >>
> >> Thanks,
> >> Otis
> >> ----
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >> >________________________________
> >> > From: Julien Nioche <li...@gmail.com>
> >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >> >Subject: Re: 2.x vs. 1.x speed
> >> >
> >> >
> >> >Hi Otis,
> >> >
> >> >That certainly depends on the backend used but on the whole it wouldn't
> be
> >> >surprising. Would be good to have some data to substantiate it. I am
> >> >planning to put my intern on the case and have some basic comparison as
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> else
> >> >wants to do it please go ahead.
> >> >
> >> >In case I happen to be the person who told you that Otis, well at least
> I
> >> >am consistent ;-)
> >> >
> >> >Julien
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >On 6 August 2013 09:08, Otis Gospodnetic <ot...@gmail.com>
> >> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
> >> >> (a little?) slower than 1.x.  Is that still the case?
> >> >>
> >> >> Thanks,
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> >> Performance Monitoring -- http://sematext.com/spm
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >*
> >> >*Open Source Solutions for Text Engineering
> >> >
> >> >http://digitalpebble.blogspot.com/
> >> >http://www.digitalpebble.com
> >> >http://twitter.com/digitalpebble
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
> --
> *Lewis*
>



-- 
*
*
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by Henry Saputra <he...@gmail.com>.
Thanks for sharing Julien, really informative.

We need to start profiling the performance for each Gora data store
and hopefully can take advantage of advance features from each backend
store with still provide good abstraction for application that use
Gora as in memory access to the underlying data.


- Henry

On Wed, Sep 18, 2013 at 2:47 AM, Julien Nioche
<li...@gmail.com> wrote:
> Including dev@gora.apache.org as not all of you are on the Nutch lists ;-)
>
> Julien
>
> ---------- Forwarded message ----------
> From: Julien Nioche <li...@gmail.com>
> Date: 16 September 2013 17:43
> Subject: Re: 2.x vs. 1.x speed
> To: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apache.org"
> <de...@nutch.apache.org>
> Cc: Otis Gospodnetic <ot...@yahoo.com>
>
>
> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with 2.x,
> we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
> Feel free to comment.
>
> Best,
>
> Julien
>
>
> On 24 August 2013 05:51, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> I am sure that Renato (if he is watching) can plugin maybe as well.
>> We find in Gora that in every sense of the word, native Hadoop stores such
>> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>> via getParitions we retrieve GoraInputSplits natively which means splits
>> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
>> as well. On  the other hand (currently) stores such as Cassandra and Web
>> service stores such as DynamoDB do not support Hadoop out of the box (the
>> former we are working on and hope to  have implemented in Gora soon)
>> therefore it is not as simple to get partitions in the same way we would in
>> a Hadoop native store. We therefore obtain one partition to be used as an
>> InputSplit for the MR job. This is certainly an area for concern and right
>> now a bottleneck for some operations. We continue to work on this.
>>
>>
>> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
>> >
>> wrote:
>> > Hi Otis
>> >
>> > Definitely *not *the fetching speed. Actually everything but *not* the
>> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
>> > performance with fetching is pretty much always limited by the politeness
>> > settings, not the implementation.
>> >
>> > Re-backend : some backend implementations are more mature than others.
>> The
>> > one for HBase is probably the one most widely used, the Cassandra one has
>> > been greatly improved in particular performance-wise , the SQL one is
>> > broken etc... we need to measure this as this is just a gut feeling at
>> this
>> > stage
>> >
>> > Now for  what is slower and why, again this has to be measured but I
>> expect
>> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
>> > not done by the backends (some might provide a way of doing it) but this
>> is
>> > done on the client side, when we create the input for mapred. In other
>> > words we pull things from the backend just to discard it. Since 2.x does
>> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
>> as
>> > single input) we scan the whole table even if we want to fetch or parse a
>> > handful of entries.
>> >
>> > On the other hand, 2.x specifies what columns to retrieve for a given
>> job,
>> > whereas 1.x will for instance deserialize the crawldatum entirely. The
>> > metadata objects are costly to read/write so 2.x might have the upper
>> hand
>> > from that point of view since it pulls and deserializes only what it
>> needs.
>> >
>> > Finally the most costly steps in a large crawl in 1.x are the generation
>> > and update as we have to read/write the crawldb entirely. The way the
>> > updates are done in 2.x is different and should be a lot faster.
>> >
>> > Please could anyone correct me if I am wrong. Some of this is based on my
>> > understanding of 2.x which dates back from quite a while and some of the
>> > stuff might have changed in the meantime. The performance would probably
>> > vary a lot based on the fine tuning of each backend implementation but
>> > having some basic comparison would confirm some of the assertions above.
>> >
>> > Julien
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/GORA-119
>> >
>> >
>> > Julien, could you please elaborate a bit about your comment about speed
>> >> depending on the backend used?
>> >>
>> >> Yes, you were the person I was referring to :)
>> >>
>> >> Oh, and *believe* you said it was the fetching speed that was different
>> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>> 2.x?
>> >>
>> >> Thanks,
>> >> Otis
>> >> ----
>> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>> >> http://sematext.com/spm
>> >>
>> >>
>> >>
>> >>
>> >> >________________________________
>> >> > From: Julien Nioche <li...@gmail.com>
>> >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> >> >Sent: Tuesday, August 6, 2013 10:54 AM
>> >> >Subject: Re: 2.x vs. 1.x speed
>> >> >
>> >> >
>> >> >Hi Otis,
>> >> >
>> >> >That certainly depends on the backend used but on the whole it wouldn't
>> be
>> >> >surprising. Would be good to have some data to substantiate it. I am
>> >> >planning to put my intern on the case and have some basic comparison as
>> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>> else
>> >> >wants to do it please go ahead.
>> >> >
>> >> >In case I happen to be the person who told you that Otis, well at least
>> I
>> >> >am consistent ;-)
>> >> >
>> >> >Julien
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >On 6 August 2013 09:08, Otis Gospodnetic <ot...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> At some point earlier this year I spoke to a person who told me 2.x
>> is
>> >> >> (a little?) slower than 1.x.  Is that still the case?
>> >> >>
>> >> >> Thanks,
>> >> >> Otis
>> >> >> --
>> >> >> Solr & ElasticSearch Support -- http://sematext.com/
>> >> >> Performance Monitoring -- http://sematext.com/spm
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >--
>> >> >*
>> >> >*Open Source Solutions for Text Engineering
>> >> >
>> >> >http://digitalpebble.blogspot.com/
>> >> >http://www.digitalpebble.com
>> >> >http://twitter.com/digitalpebble
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> > http://twitter.com/digitalpebble
>> >
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble