You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2013/09/16 18:43:58 UTC

Re: 2.x vs. 1.x speed

Guys,

Following the discussion we had some time ago about comparing 1.x with 2.x,
we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Feel free to comment.

Best,

Julien


On 24 August 2013 05:51, Lewis John Mcgibbney <le...@gmail.com>wrote:

> I am sure that Renato (if he is watching) can plugin maybe as well.
> We find in Gora that in every sense of the word, native Hadoop stores such
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> via getParitions we retrieve GoraInputSplits natively which means splits
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
> as well. On  the other hand (currently) stores such as Cassandra and Web
> service stores such as DynamoDB do not support Hadoop out of the box (the
> former we are working on and hope to  have implemented in Gora soon)
> therefore it is not as simple to get partitions in the same way we would in
> a Hadoop native store. We therefore obtain one partition to be used as an
> InputSplit for the MR job. This is certainly an area for concern and right
> now a bottleneck for some operations. We continue to work on this.
>
>
> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
> >
> wrote:
> > Hi Otis
> >
> > Definitely *not *the fetching speed. Actually everything but *not* the
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
> > performance with fetching is pretty much always limited by the politeness
> > settings, not the implementation.
> >
> > Re-backend : some backend implementations are more mature than others.
> The
> > one for HBase is probably the one most widely used, the Cassandra one has
> > been greatly improved in particular performance-wise , the SQL one is
> > broken etc... we need to measure this as this is just a gut feeling at
> this
> > stage
> >
> > Now for  what is slower and why, again this has to be measured but I
> expect
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
> > not done by the backends (some might provide a way of doing it) but this
> is
> > done on the client side, when we create the input for mapred. In other
> > words we pull things from the backend just to discard it. Since 2.x does
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
> > single input) we scan the whole table even if we want to fetch or parse a
> > handful of entries.
> >
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
> > from that point of view since it pulls and deserializes only what it
> needs.
> >
> > Finally the most costly steps in a large crawl in 1.x are the generation
> > and update as we have to read/write the crawldb entirely. The way the
> > updates are done in 2.x is different and should be a lot faster.
> >
> > Please could anyone correct me if I am wrong. Some of this is based on my
> > understanding of 2.x which dates back from quite a while and some of the
> > stuff might have changed in the meantime. The performance would probably
> > vary a lot based on the fine tuning of each backend implementation but
> > having some basic comparison would confirm some of the assertions above.
> >
> > Julien
> >
> >
> > [1] https://issues.apache.org/jira/browse/GORA-119
> >
> >
> > Julien, could you please elaborate a bit about your comment about speed
> >> depending on the backend used?
> >>
> >> Yes, you were the person I was referring to :)
> >>
> >> Oh, and *believe* you said it was the fetching speed that was different
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> 2.x?
> >>
> >> Thanks,
> >> Otis
> >> ----
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >> >________________________________
> >> > From: Julien Nioche <li...@gmail.com>
> >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >> >Subject: Re: 2.x vs. 1.x speed
> >> >
> >> >
> >> >Hi Otis,
> >> >
> >> >That certainly depends on the backend used but on the whole it wouldn't
> be
> >> >surprising. Would be good to have some data to substantiate it. I am
> >> >planning to put my intern on the case and have some basic comparison as
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> else
> >> >wants to do it please go ahead.
> >> >
> >> >In case I happen to be the person who told you that Otis, well at least
> I
> >> >am consistent ;-)
> >> >
> >> >Julien
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >On 6 August 2013 09:08, Otis Gospodnetic <ot...@gmail.com>
> >> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
> >> >> (a little?) slower than 1.x.  Is that still the case?
> >> >>
> >> >> Thanks,
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> >> Performance Monitoring -- http://sematext.com/spm
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >*
> >> >*Open Source Solutions for Text Engineering
> >> >
> >> >http://digitalpebble.blogspot.com/
> >> >http://www.digitalpebble.com
> >> >http://twitter.com/digitalpebble
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by Tejas Patil <te...@gmail.com>.

This is awesome Julien :) Thanks for sharing !!


On Mon, Sep 16, 2013 at 9:43 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with 2.x,
> we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
> Feel free to comment.
>
> Best,
>
> Julien
>
>
> On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> >wrote:
>
> > I am sure that Renato (if he is watching) can plugin maybe as well.
> > We find in Gora that in every sense of the word, native Hadoop stores
> such
> > as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> > via getParitions we retrieve GoraInputSplits natively which means splits
> > are obtained for MapReduce jobs... such as many of the jobs we run in
> Nutch
> > as well. On  the other hand (currently) stores such as Cassandra and Web
> > service stores such as DynamoDB do not support Hadoop out of the box (the
> > former we are working on and hope to  have implemented in Gora soon)
> > therefore it is not as simple to get partitions in the same way we would
> in
> > a Hadoop native store. We therefore obtain one partition to be used as an
> > InputSplit for the MR job. This is certainly an area for concern and
> right
> > now a bottleneck for some operations. We continue to work on this.
> >
> >
> > On Wednesday, August 7, 2013, Julien Nioche <
> lists.digitalpebble@gmail.com
> > >
> > wrote:
> > > Hi Otis
> > >
> > > Definitely *not *the fetching speed. Actually everything but *not* the
> > > fetching speed. The fetcher is pretty much the same as 1.x and anyway
> the
> > > performance with fetching is pretty much always limited by the
> politeness
> > > settings, not the implementation.
> > >
> > > Re-backend : some backend implementations are more mature than others.
> > The
> > > one for HBase is probably the one most widely used, the Cassandra one
> has
> > > been greatly improved in particular performance-wise , the SQL one is
> > > broken etc... we need to measure this as this is just a gut feeling at
> > this
> > > stage
> > >
> > > Now for  what is slower and why, again this has to be measured but I
> > expect
> > > 2.x to be slower partly because of [1], i.e. the filtering of entries
> is
> > > not done by the backends (some might provide a way of doing it) but
> this
> > is
> > > done on the client side, when we create the input for mapred. In other
> > > words we pull things from the backend just to discard it. Since 2.x
> does
> > > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> > as
> > > single input) we scan the whole table even if we want to fetch or
> parse a
> > > handful of entries.
> > >
> > > On the other hand, 2.x specifies what columns to retrieve for a given
> > job,
> > > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > > metadata objects are costly to read/write so 2.x might have the upper
> > hand
> > > from that point of view since it pulls and deserializes only what it
> > needs.
> > >
> > > Finally the most costly steps in a large crawl in 1.x are the
> generation
> > > and update as we have to read/write the crawldb entirely. The way the
> > > updates are done in 2.x is different and should be a lot faster.
> > >
> > > Please could anyone correct me if I am wrong. Some of this is based on
> my
> > > understanding of 2.x which dates back from quite a while and some of
> the
> > > stuff might have changed in the meantime. The performance would
> probably
> > > vary a lot based on the fine tuning of each backend implementation but
> > > having some basic comparison would confirm some of the assertions
> above.
> > >
> > > Julien
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/GORA-119
> > >
> > >
> > > Julien, could you please elaborate a bit about your comment about speed
> > >> depending on the backend used?
> > >>
> > >> Yes, you were the person I was referring to :)
> > >>
> > >> Oh, and *believe* you said it was the fetching speed that was
> different
> > >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> > 2.x?
> > >>
> > >> Thanks,
> > >> Otis
> > >> ----
> > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> > >> http://sematext.com/spm
> > >>
> > >>
> > >>
> > >>
> > >> >________________________________
> > >> > From: Julien Nioche <li...@gmail.com>
> > >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > >> >Sent: Tuesday, August 6, 2013 10:54 AM
> > >> >Subject: Re: 2.x vs. 1.x speed
> > >> >
> > >> >
> > >> >Hi Otis,
> > >> >
> > >> >That certainly depends on the backend used but on the whole it
> wouldn't
> > be
> > >> >surprising. Would be good to have some data to substantiate it. I am
> > >> >planning to put my intern on the case and have some basic comparison
> as
> > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> > else
> > >> >wants to do it please go ahead.
> > >> >
> > >> >In case I happen to be the person who told you that Otis, well at
> least
> > I
> > >> >am consistent ;-)
> > >> >
> > >> >Julien
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com
> >
> > >> wrote:
> > >> >
> > >> >> Hello,
> > >> >>
> > >> >> At some point earlier this year I spoke to a person who told me 2.x
> > is
> > >> >> (a little?) slower than 1.x.  Is that still the case?
> > >> >>
> > >> >> Thanks,
> > >> >> Otis
> > >> >> --
> > >> >> Solr & ElasticSearch Support -- http://sematext.com/
> > >> >> Performance Monitoring -- http://sematext.com/spm
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> >--
> > >> >*
> > >> >*Open Source Solutions for Text Engineering
> > >> >
> > >> >http://digitalpebble.blogspot.com/
> > >> >http://www.digitalpebble.com
> > >> >http://twitter.com/digitalpebble
> > >> >
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: 2.x vs. 1.x speed

Posted by Henry Saputra <he...@gmail.com>.

Thanks for sharing Julien, really informative.

We need to start profiling the performance for each Gora data store
and hopefully can take advantage of advance features from each backend
store with still provide good abstraction for application that use
Gora as in memory access to the underlying data.


- Henry

On Wed, Sep 18, 2013 at 2:47 AM, Julien Nioche
<li...@gmail.com> wrote:
> Including dev@gora.apache.org as not all of you are on the Nutch lists ;-)
>
> Julien
>
> ---------- Forwarded message ----------
> From: Julien Nioche <li...@gmail.com>
> Date: 16 September 2013 17:43
> Subject: Re: 2.x vs. 1.x speed
> To: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apache.org"
> <de...@nutch.apache.org>
> Cc: Otis Gospodnetic <ot...@yahoo.com>
>
>
> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with 2.x,
> we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
> Feel free to comment.
>
> Best,
>
> Julien
>
>
> On 24 August 2013 05:51, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> I am sure that Renato (if he is watching) can plugin maybe as well.
>> We find in Gora that in every sense of the word, native Hadoop stores such
>> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>> via getParitions we retrieve GoraInputSplits natively which means splits
>> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
>> as well. On  the other hand (currently) stores such as Cassandra and Web
>> service stores such as DynamoDB do not support Hadoop out of the box (the
>> former we are working on and hope to  have implemented in Gora soon)
>> therefore it is not as simple to get partitions in the same way we would in
>> a Hadoop native store. We therefore obtain one partition to be used as an
>> InputSplit for the MR job. This is certainly an area for concern and right
>> now a bottleneck for some operations. We continue to work on this.
>>
>>
>> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
>> >
>> wrote:
>> > Hi Otis
>> >
>> > Definitely *not *the fetching speed. Actually everything but *not* the
>> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
>> > performance with fetching is pretty much always limited by the politeness
>> > settings, not the implementation.
>> >
>> > Re-backend : some backend implementations are more mature than others.
>> The
>> > one for HBase is probably the one most widely used, the Cassandra one has
>> > been greatly improved in particular performance-wise , the SQL one is
>> > broken etc... we need to measure this as this is just a gut feeling at
>> this
>> > stage
>> >
>> > Now for  what is slower and why, again this has to be measured but I
>> expect
>> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
>> > not done by the backends (some might provide a way of doing it) but this
>> is
>> > done on the client side, when we create the input for mapred. In other
>> > words we pull things from the backend just to discard it. Since 2.x does
>> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
>> as
>> > single input) we scan the whole table even if we want to fetch or parse a
>> > handful of entries.
>> >
>> > On the other hand, 2.x specifies what columns to retrieve for a given
>> job,
>> > whereas 1.x will for instance deserialize the crawldatum entirely. The
>> > metadata objects are costly to read/write so 2.x might have the upper
>> hand
>> > from that point of view since it pulls and deserializes only what it
>> needs.
>> >
>> > Finally the most costly steps in a large crawl in 1.x are the generation
>> > and update as we have to read/write the crawldb entirely. The way the
>> > updates are done in 2.x is different and should be a lot faster.
>> >
>> > Please could anyone correct me if I am wrong. Some of this is based on my
>> > understanding of 2.x which dates back from quite a while and some of the
>> > stuff might have changed in the meantime. The performance would probably
>> > vary a lot based on the fine tuning of each backend implementation but
>> > having some basic comparison would confirm some of the assertions above.
>> >
>> > Julien
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/GORA-119
>> >
>> >
>> > Julien, could you please elaborate a bit about your comment about speed
>> >> depending on the backend used?
>> >>
>> >> Yes, you were the person I was referring to :)
>> >>
>> >> Oh, and *believe* you said it was the fetching speed that was different
>> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>> 2.x?
>> >>
>> >> Thanks,
>> >> Otis
>> >> ----
>> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>> >> http://sematext.com/spm
>> >>
>> >>
>> >>
>> >>
>> >> >________________________________
>> >> > From: Julien Nioche <li...@gmail.com>
>> >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> >> >Sent: Tuesday, August 6, 2013 10:54 AM
>> >> >Subject: Re: 2.x vs. 1.x speed
>> >> >
>> >> >
>> >> >Hi Otis,
>> >> >
>> >> >That certainly depends on the backend used but on the whole it wouldn't
>> be
>> >> >surprising. Would be good to have some data to substantiate it. I am
>> >> >planning to put my intern on the case and have some basic comparison as
>> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>> else
>> >> >wants to do it please go ahead.
>> >> >
>> >> >In case I happen to be the person who told you that Otis, well at least
>> I
>> >> >am consistent ;-)
>> >> >
>> >> >Julien
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >On 6 August 2013 09:08, Otis Gospodnetic <ot...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> At some point earlier this year I spoke to a person who told me 2.x
>> is
>> >> >> (a little?) slower than 1.x.  Is that still the case?
>> >> >>
>> >> >> Thanks,
>> >> >> Otis
>> >> >> --
>> >> >> Solr & ElasticSearch Support -- http://sematext.com/
>> >> >> Performance Monitoring -- http://sematext.com/spm
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >--
>> >> >*
>> >> >*Open Source Solutions for Text Engineering
>> >> >
>> >> >http://digitalpebble.blogspot.com/
>> >> >http://www.digitalpebble.com
>> >> >http://twitter.com/digitalpebble
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> > http://twitter.com/digitalpebble
>> >
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Fwd: 2.x vs. 1.x speed

Posted by Julien Nioche <li...@gmail.com>.

Including dev@gora.apache.org as not all of you are on the Nutch lists ;-)

Julien

---------- Forwarded message ----------
From: Julien Nioche <li...@gmail.com>
Date: 16 September 2013 17:43
Subject: Re: 2.x vs. 1.x speed
To: "user@nutch.apache.org" <us...@nutch.apache.org>, "dev@nutch.apache.org"
<de...@nutch.apache.org>
Cc: Otis Gospodnetic <ot...@yahoo.com>


Guys,

Following the discussion we had some time ago about comparing 1.x with 2.x,
we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Feel free to comment.

Best,

Julien


On 24 August 2013 05:51, Lewis John Mcgibbney <le...@gmail.com>wrote:

> I am sure that Renato (if he is watching) can plugin maybe as well.
> We find in Gora that in every sense of the word, native Hadoop stores such
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> via getParitions we retrieve GoraInputSplits natively which means splits
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
> as well. On  the other hand (currently) stores such as Cassandra and Web
> service stores such as DynamoDB do not support Hadoop out of the box (the
> former we are working on and hope to  have implemented in Gora soon)
> therefore it is not as simple to get partitions in the same way we would in
> a Hadoop native store. We therefore obtain one partition to be used as an
> InputSplit for the MR job. This is certainly an area for concern and right
> now a bottleneck for some operations. We continue to work on this.
>
>
> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
> >
> wrote:
> > Hi Otis
> >
> > Definitely *not *the fetching speed. Actually everything but *not* the
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
> > performance with fetching is pretty much always limited by the politeness
> > settings, not the implementation.
> >
> > Re-backend : some backend implementations are more mature than others.
> The
> > one for HBase is probably the one most widely used, the Cassandra one has
> > been greatly improved in particular performance-wise , the SQL one is
> > broken etc... we need to measure this as this is just a gut feeling at
> this
> > stage
> >
> > Now for  what is slower and why, again this has to be measured but I
> expect
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
> > not done by the backends (some might provide a way of doing it) but this
> is
> > done on the client side, when we create the input for mapred. In other
> > words we pull things from the backend just to discard it. Since 2.x does
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
> > single input) we scan the whole table even if we want to fetch or parse a
> > handful of entries.
> >
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
> > from that point of view since it pulls and deserializes only what it
> needs.
> >
> > Finally the most costly steps in a large crawl in 1.x are the generation
> > and update as we have to read/write the crawldb entirely. The way the
> > updates are done in 2.x is different and should be a lot faster.
> >
> > Please could anyone correct me if I am wrong. Some of this is based on my
> > understanding of 2.x which dates back from quite a while and some of the
> > stuff might have changed in the meantime. The performance would probably
> > vary a lot based on the fine tuning of each backend implementation but
> > having some basic comparison would confirm some of the assertions above.
> >
> > Julien
> >
> >
> > [1] https://issues.apache.org/jira/browse/GORA-119
> >
> >
> > Julien, could you please elaborate a bit about your comment about speed
> >> depending on the backend used?
> >>
> >> Yes, you were the person I was referring to :)
> >>
> >> Oh, and *believe* you said it was the fetching speed that was different
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> 2.x?
> >>
> >> Thanks,
> >> Otis
> >> ----
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >> >________________________________
> >> > From: Julien Nioche <li...@gmail.com>
> >> >To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >> >Subject: Re: 2.x vs. 1.x speed
> >> >
> >> >
> >> >Hi Otis,
> >> >
> >> >That certainly depends on the backend used but on the whole it wouldn't
> be
> >> >surprising. Would be good to have some data to substantiate it. I am
> >> >planning to put my intern on the case and have some basic comparison as
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> else
> >> >wants to do it please go ahead.
> >> >
> >> >In case I happen to be the person who told you that Otis, well at least
> I
> >> >am consistent ;-)
> >> >
> >> >Julien
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >On 6 August 2013 09:08, Otis Gospodnetic <ot...@gmail.com>
> >> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
> >> >> (a little?) slower than 1.x.  Is that still the case?
> >> >>
> >> >> Thanks,
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> >> Performance Monitoring -- http://sematext.com/spm
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >*
> >> >*Open Source Solutions for Text Engineering
> >> >
> >> >http://digitalpebble.blogspot.com/
> >> >http://www.digitalpebble.com
> >> >http://twitter.com/digitalpebble
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
> --
> *Lewis*
>



-- 
*
*
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by Julien Nioche <li...@gmail.com>.

Hi Renato

Great to hear from you

On 16 September 2013 18:42, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Thanks for sharing Julien! These are indeed interesting results.
> Just a quick question, did you use a single server to run this? or did you
> set up a minimum number of servers for it?


as explained in the blog this is in pseudo distributed mode i.e single
server

this is because HBase or
> Cassandra will improve their latency if we scale them out.
>

see the conclusion of my post. I pointed at a number of possible
explanations, mostly do to with GORA. Scaling out would also make 1.x
faster :-) the question is whether there is a size of the crawldb / number
of machines where the balance would change?

Can you explain why would processing a smaller db on a single node with
Nutch 2 would take proportionally longer than a larger db on a larger
cluster?

Thanks

Julien



>
>
> Renato M.
>
>
> 2013/9/16 Markus Jelsma <ma...@openindex.io>
>
> > Thanks! That was interesting.
> >
> > -----Original message-----
> > From: Julien Nioche<li...@gmail.com>
> > Sent: Monday 16th September 2013 18:45
> > To: user@nutch.apache.org; dev@nutch.apache.org
> > Cc: Otis Gospodnetic <ot...@yahoo.com>
> > Subject: Re: 2.x vs. 1.x speed
> >
> > Guys,
> >
> > Following the discussion we had some time ago about comparing 1.x with
> > 2.x, we did dome tests and put the results on
> >
> > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
> > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
> >
> > Feel free to comment.
> >
> > Best,
> >
> > Julien
> >
> > On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> <mailto:
> > lewis.mcgibbney@gmail.com>> wrote:
> >
> > I am sure that Renato (if he is watching) can plugin maybe as well.
> >
> > We find in Gora that in every sense of the word, native Hadoop stores
> such
> >
> > as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> >
> > via getParitions we retrieve GoraInputSplits natively which means splits
> >
> > are obtained for MapReduce jobs... such as many of the jobs we run in
> Nutch
> >
> > as well. On  the other hand (currently) stores such as Cassandra and Web
> >
> > service stores such as DynamoDB do not support Hadoop out of the box (the
> >
> > former we are working on and hope to  have implemented in Gora soon)
> >
> > therefore it is not as simple to get partitions in the same way we would
> in
> >
> > a Hadoop native store. We therefore obtain one partition to be used as an
> >
> > InputSplit for the MR job. This is certainly an area for concern and
> right
> >
> > now a bottleneck for some operations. We continue to work on this.
> >
> > On Wednesday, August 7, 2013, Julien Nioche <
> lists.digitalpebble@gmail.com<mailto:
> > lists.digitalpebble@gmail.com>>
> >
> > wrote:
> >
> > > Hi Otis
> >
> > >
> >
> > > Definitely *not *the fetching speed. Actually everything but *not* the
> >
> > > fetching speed. The fetcher is pretty much the same as 1.x and anyway
> the
> >
> > > performance with fetching is pretty much always limited by the
> politeness
> >
> > > settings, not the implementation.
> >
> > >
> >
> > > Re-backend : some backend implementations are more mature than others.
> > The
> >
> > > one for HBase is probably the one most widely used, the Cassandra one
> has
> >
> > > been greatly improved in particular performance-wise , the SQL one is
> >
> > > broken etc... we need to measure this as this is just a gut feeling at
> >
> > this
> >
> > > stage
> >
> > >
> >
> > > Now for  what is slower and why, again this has to be measured but I
> >
> > expect
> >
> > > 2.x to be slower partly because of [1], i.e. the filtering of entries
> is
> >
> > > not done by the backends (some might provide a way of doing it) but
> this
> >
> > is
> >
> > > done on the client side, when we create the input for mapred. In other
> >
> > > words we pull things from the backend just to discard it. Since 2.x
> does
> >
> > > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> > as
> >
> > > single input) we scan the whole table even if we want to fetch or
> parse a
> >
> > > handful of entries.
> >
> > >
> >
> > > On the other hand, 2.x specifies what columns to retrieve for a given
> > job,
> >
> > > whereas 1.x will for instance deserialize the crawldatum entirely. The
> >
> > > metadata objects are costly to read/write so 2.x might have the upper
> > hand
> >
> > > from that point of view since it pulls and deserializes only what it
> >
> > needs.
> >
> > >
> >
> > > Finally the most costly steps in a large crawl in 1.x are the
> generation
> >
> > > and update as we have to read/write the crawldb entirely. The way the
> >
> > > updates are done in 2.x is different and should be a lot faster.
> >
> > >
> >
> > > Please could anyone correct me if I am wrong. Some of this is based on
> my
> >
> > > understanding of 2.x which dates back from quite a while and some of
> the
> >
> > > stuff might have changed in the meantime. The performance would
> probably
> >
> > > vary a lot based on the fine tuning of each backend implementation but
> >
> > > having some basic comparison would confirm some of the assertions
> above.
> >
> > >
> >
> > > Julien
> >
> > >
> >
> > >
> >
> > > [1] https://issues.apache.org/jira/browse/GORA-119 <
> > https://issues.apache.org/jira/browse/GORA-119>
> >
> > >
> >
> > >
> >
> > > Julien, could you please elaborate a bit about your comment about speed
> >
> > >> depending on the backend used?
> >
> > >>
> >
> > >> Yes, you were the person I was referring to :)
> >
> > >>
> >
> > >> Oh, and *believe* you said it was the fetching speed that was
> different
> >
> > >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> >
> > 2.x?
> >
> > >>
> >
> > >> Thanks,
> >
> > >> Otis
> >
> > >> ----
> >
> > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >
> > >> http://sematext.com/spm <http://sematext.com/spm>
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> >________________________________
> >
> > >> > From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
> > lists.digitalpebble@gmail.com>>
> >
> > >> >To: "user@nutch.apache.org <ma...@nutch.apache.org>" <
> > user@nutch.apache.org <ma...@nutch.apache.org>>
> >
> > >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >
> > >> >Subject: Re: 2.x vs. 1.x speed
> >
> > >> >
> >
> > >> >
> >
> > >> >Hi Otis,
> >
> > >> >
> >
> > >> >That certainly depends on the backend used but on the whole it
> wouldnt
> >
> > be
> >
> > >> >surprising. Would be good to have some data to substantiate it. I am
> >
> > >> >planning to put my intern on the case and have some basic comparison
> as
> >
> > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> >
> > else
> >
> > >> >wants to do it please go ahead.
> >
> > >> >
> >
> > >> >In case I happen to be the person who told you that Otis, well at
> least
> >
> > I
> >
> > >> >am consistent ;-)
> >
> > >> >
> >
> > >> >Julien
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com
> <mailto:
> > otis.gospodnetic@gmail.com>>
> >
> > >> wrote:
> >
> > >> >
> >
> > >> >> Hello,
> >
> > >> >>
> >
> > >> >> At some point earlier this year I spoke to a person who told me 2.x
> > is
> >
> > >> >> (a little?) slower than 1.x.  Is that still the case?
> >
> > >> >>
> >
> > >> >> Thanks,
> >
> > >> >> Otis
> >
> > >> >> --
> >
> > >> >> Solr & ElasticSearch Support -- http://sematext.com/ <
> > http://sematext.com/>
> >
> > >> >> Performance Monitoring -- http://sematext.com/spm <
> > http://sematext.com/spm>
> >
> > >> >>
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >--
> >
> > >> >*
> >
> > >> >*Open Source Solutions for Text Engineering
> >
> > >> >
> >
> > >> >http://digitalpebble.blogspot.com/ <
> http://digitalpebble.blogspot.com/
> > >
> >
> > >> >http://www.digitalpebble.com <http://www.digitalpebble.com>
> >
> > >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >>
> >
> > >
> >
> > >
> >
> > >
> >
> > > --
> >
> > > *
> >
> > > *Open Source Solutions for Text Engineering
> >
> > >
> >
> > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/
> >
> >
> > > http://www.digitalpebble.com <http://www.digitalpebble.com>
> >
> > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> > >
> >
> > --
> >
> > *Lewis*
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by Julien Nioche <li...@gmail.com>.

Hi Kaveh,

This was a light-hearted comment, no worries ;-) My blog post is meant to
be a starting point for comparisons, improvements and discussions so it is
great to hear your thoughts. Would be great it you could share your
findings indeed.

Best,

Julien




On 16 September 2013 23:50, kaveh minooie <ka...@plutoz.com> wrote:

> :) believe me, what ever attitude you might have seen in that sentence was
> just my own guilty conscious manifesting itself. never the less, you are
> right and I absolutely apologize for that.
>
> Now I have to say that the reason that I haven't really posted anything is
> not just cause I am lazy, but because I am not sure how to go about it in a
> way that would be meaningful to whoever is going to read it. the
> performance, in a distribute environment, is affected by many things of
> which few are directly related to nutch. a lot of it has to do with how the
> hadoop is set up( how many map or reduce jobs are being run per core? what
> is the replication factor in the hadoop, if and what kind of compression is
> being used, etc, ) the hardware that is being used, and if we are using
> gora then the performance of the storage backend and how that has been set
> up is also gonna be a big factor as well. not to mention, at least for the
> current version of gora, that the storage backends that run on top of
> hadoop have fundamentally different characteristics with the ones that do
> not run on top of hadoop, so I am not sure if a head to head comparison
> between just the numbers would be informative or just misleading.
>
> What I am trying to say, I guess, is that if people who have more
> experience in creating this kinds of report could suggest some sort of
> guideline or something, it would be very helpful to me and, I am sure,
> other people as well, to post these kind of numbers. I think that the best
> possible outcome would be to have some sort of 'zoo' section on the site
> which would have all these reports for different scenarios. from my own
> experience, I can say that one of the biggest problems that I had when I
> started using nutch and still have to some degree, was that I was never
> sure what I am doing is right because there were never a reference point
> with which I could compare my own results, and if it wasn't because of this
> fantastic mailing list, I would have been dead.
>
> also, "realistic" was definitely the wrong word to use. I do agree with
> you, base on what I have seen on the list, that too many people start using
> the 2.x version without having enough amount of data to justify it. This
> definitely would be a very good point to mention, specially on the web
> site, that if you don't have more than x number of links to work with, do
> not use 2.x version, at least not yet.
>
> that being said I'll start keeping track of my results and I'll share it
> with everyone hopefully in near future.
>
> again thanks thou for posting those numbers.
>
>
>
> On 09/16/2013 12:06 PM, Julien Nioche wrote:
>
>> Hi Kaveh
>>
>> Finally, someone posted some metrics, thanks Julian.
>>
>>
>> No probs. You could have done the same experiment since you felt it was
>> needed ;-)
>>
>>
>>  I just need to point out, in addition to Renato's question, the size of
>>> the data that you choose to use for the test is not really fair.IMHO, for
>>> 2.x to be some what realistic,
>>>
>>
>> your gonna want to have a crawldb with at least afew hundreds of millions
>>
>>> of links and fetch list of again at least 1 or 2 million. what do you
>>> guys
>>> think?
>>>
>>
>>
>> If realistic means close to real usage then you'll find that most people
>> use Nutch on dbs smaller than 3M urls. From that point of view, this
>> experiment is realistic. It is also realistic with the meaning that it can
>> be reproduce easily : fetching millions or urls would take a lot of time
>> and having 00's M pages requires a larger cluster ($$$$)
>>
>> Again, I mentioned I my post that it would be interesting to do it with a
>> larger cluster but at least we can discuss the limitations in design and
>> implementation that Nutch 2 currently has.
>>
>> The main point is that this test was a relative comparison between 2
>> versions, not an absolute benchmark of how long it takes to run a crawl.
>> Knowing how Nutch 2 fairs in relation to Nutch 1 is quite useful,
>> especially with new users expecting a more recent version to perform
>> better
>> than the old one.
>>
>> Feel free to try on a larger cluster and dataset and share your results,
>> it
>> will be interesting to see if there is a difference from what I measured
>> on
>> a single machine
>>
>> Thanks
>>
>> Julien
>>
>>
>>
>>
>>
>>
>>
>>>
>>> On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:
>>>
>>>  Thanks for sharing Julien! These are indeed interesting results.
>>>> Just a quick question, did you use a single server to run this? or did
>>>> you
>>>> set up a minimum number of servers for it? this is because HBase or
>>>> Cassandra will improve their latency if we scale them out.
>>>>
>>>>
>>>> Renato M.
>>>>
>>>>
>>>> 2013/9/16 Markus Jelsma <ma...@openindex.io>
>>>>
>>>>   Thanks! That was interesting.
>>>>
>>>>>
>>>>> -----Original message-----
>>>>> From: Julien Nioche<lists.digitalpebble@**g**mail.com<http://gmail.com>
>>>>> <lists.digitalpebble@**gmail.com <li...@gmail.com>>
>>>>>
>>>>>>
>>>>>>  Sent: Monday 16th September 2013 18:45
>>>>> To: user@nutch.apache.org; dev@nutch.apache.org
>>>>> Cc: Otis Gospodnetic <ot...@yahoo.com>
>>>>> Subject: Re: 2.x vs. 1.x speed
>>>>>
>>>>> Guys,
>>>>>
>>>>> Following the discussion we had some time ago about comparing 1.x with
>>>>> 2.x, we did dome tests and put the results on
>>>>>
>>>>> http://digitalpebble.blogspot.****co.uk/2013/09/nutch-fight-**17-**<http://co.uk/2013/09/nutch-fight-17-**>
>>>>> vs-221.html<http://**digitalpebble.blogspot.co.uk/**
>>>>> 2013/09/nutch-fight-17-vs-221.**html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>>>>> ><
>>>>> http://digitalpebble.blogspot.****co.uk/2013/09/nutch-fight-**17-**<http://co.uk/2013/09/nutch-fight-17-**>
>>>>> vs-221.html<http://**digitalpebble.blogspot.co.uk/**
>>>>> 2013/09/nutch-fight-17-vs-221.**html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>>>>> >
>>>>>
>>>>>
>>>>>>
>>>>> Feel free to comment.
>>>>>
>>>>> Best,
>>>>>
>>>>> Julien
>>>>>
>>>>> On 24 August 2013 05:51, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com
>>>>> <**mailto:
>>>>>
>>>>> lewis.mcgibbney@gmail.com>> wrote:
>>>>>
>>>>> I am sure that Renato (if he is watching) can plugin maybe as well.
>>>>>
>>>>> We find in Gora that in every sense of the word, native Hadoop stores
>>>>> such
>>>>>
>>>>> as Avro, HBase and  Accumulo when we execute a query with
>>>>> GiraInputFormat
>>>>>
>>>>> via getParitions we retrieve GoraInputSplits natively which means
>>>>> splits
>>>>>
>>>>> are obtained for MapReduce jobs... such as many of the jobs we run in
>>>>> Nutch
>>>>>
>>>>> as well. On  the other hand (currently) stores such as Cassandra and
>>>>> Web
>>>>>
>>>>> service stores such as DynamoDB do not support Hadoop out of the box
>>>>> (the
>>>>>
>>>>> former we are working on and hope to  have implemented in Gora soon)
>>>>>
>>>>> therefore it is not as simple to get partitions in the same way we
>>>>> would
>>>>> in
>>>>>
>>>>> a Hadoop native store. We therefore obtain one partition to be used as
>>>>> an
>>>>>
>>>>> InputSplit for the MR job. This is certainly an area for concern and
>>>>> right
>>>>>
>>>>> now a bottleneck for some operations. We continue to work on this.
>>>>>
>>>>> On Wednesday, August 7, 2013, Julien Nioche <
>>>>> lists.digitalpebble@gmail.com****<mailto:
>>>>> lists.digitalpebble@gmail.com>****>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>   Hi Otis
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    Definitely *not *the fetching speed. Actually everything but *not*
>>>>> the
>>>>>
>>>>>>
>>>>>>
>>>>>   fetching speed. The fetcher is pretty much the same as 1.x and anyway
>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>   performance with fetching is pretty much always limited by the
>>>>>
>>>>>> politeness
>>>>>>
>>>>>>
>>>>>   settings, not the implementation.
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    Re-backend : some backend implementations are more mature than
>>>>> others.
>>>>>
>>>>>>
>>>>>>  The
>>>>>
>>>>>   one for HBase is probably the one most widely used, the Cassandra one
>>>>>
>>>>>> has
>>>>>>
>>>>>>
>>>>>   been greatly improved in particular performance-wise , the SQL one is
>>>>>
>>>>>>
>>>>>>
>>>>>   broken etc... we need to measure this as this is just a gut feeling
>>>>> at
>>>>>
>>>>>>
>>>>>>
>>>>> this
>>>>>
>>>>>   stage
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    Now for  what is slower and why, again this has to be measured but
>>>>> I
>>>>>
>>>>>>
>>>>>>
>>>>> expect
>>>>>
>>>>>   2.x to be slower partly because of [1], i.e. the filtering of
>>>>> entries is
>>>>>
>>>>>>
>>>>>>
>>>>>   not done by the backends (some might provide a way of doing it) but
>>>>> this
>>>>>
>>>>>>
>>>>>>
>>>>> is
>>>>>
>>>>>   done on the client side, when we create the input for mapred. In
>>>>> other
>>>>>
>>>>>>
>>>>>>
>>>>>   words we pull things from the backend just to discard it. Since 2.x
>>>>> does
>>>>>
>>>>>>
>>>>>>
>>>>>   not have segments like 1.x (which the fetch + parse mapreduce jobs
>>>>> take
>>>>>
>>>>>>
>>>>>>  as
>>>>>
>>>>>   single input) we scan the whole table even if we want to fetch or
>>>>> parse
>>>>>
>>>>>> a
>>>>>>
>>>>>>
>>>>>   handful of entries.
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    On the other hand, 2.x specifies what columns to retrieve for a
>>>>> given
>>>>>
>>>>>>
>>>>>>  job,
>>>>>
>>>>>   whereas 1.x will for instance deserialize the crawldatum entirely.
>>>>> The
>>>>>
>>>>>>
>>>>>>
>>>>>   metadata objects are costly to read/write so 2.x might have the upper
>>>>>
>>>>>>
>>>>>>  hand
>>>>>
>>>>>   from that point of view since it pulls and deserializes only what it
>>>>>
>>>>>>
>>>>>>
>>>>> needs.
>>>>>
>>>>>
>>>>>
>>>>>>    Finally the most costly steps in a large crawl in 1.x are the
>>>>> generation
>>>>>
>>>>>>
>>>>>>
>>>>>   and update as we have to read/write the crawldb entirely. The way the
>>>>>
>>>>>>
>>>>>>
>>>>>   updates are done in 2.x is different and should be a lot faster.
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    Please could anyone correct me if I am wrong. Some of this is
>>>>> based on
>>>>>
>>>>>> my
>>>>>>
>>>>>>
>>>>>   understanding of 2.x which dates back from quite a while and some of
>>>>> the
>>>>>
>>>>>>
>>>>>>
>>>>>   stuff might have changed in the meantime. The performance would
>>>>> probably
>>>>>
>>>>>>
>>>>>>
>>>>>   vary a lot based on the fine tuning of each backend implementation
>>>>> but
>>>>>
>>>>>>
>>>>>>
>>>>>   having some basic comparison would confirm some of the assertions
>>>>> above.
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    Julien
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>>    [1] https://issues.apache.org/****jira/browse/GORA-119<https://issues.apache.org/**jira/browse/GORA-119>
>>>>> <https://**issues.apache.org/jira/browse/**GORA-119<https://issues.apache.org/jira/browse/GORA-119>
>>>>> ><
>>>>>
>>>>>>
>>>>>>  https://issues.apache.org/****jira/browse/GORA-119<https://issues.apache.org/**jira/browse/GORA-119>
>>>>> <https://**issues.apache.org/jira/browse/**GORA-119<https://issues.apache.org/jira/browse/GORA-119>
>>>>> >
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>>    Julien, could you please elaborate a bit about your comment about
>>>>> speed
>>>>>
>>>>>>
>>>>>>
>>>>>   depending on the backend used?
>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>>>    Yes, you were the person I was referring to :)
>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>>>    Oh, and *believe* you said it was the fetching speed that was
>>>>> different
>>>>>
>>>>>>
>>>>>>>
>>>>>>    between 1.x and 2.x.  Is that right?  Or is some other phase
>>>>> slower in
>>>>>
>>>>>>
>>>>>>>
>>>>>>  2.x?
>>>>>
>>>>>
>>>>>
>>>>>>>    Thanks,
>>>>>
>>>>>>
>>>>>>>
>>>>>>    Otis
>>>>>
>>>>>>
>>>>>>>
>>>>>>    ----
>>>>>
>>>>>>
>>>>>>>
>>>>>>    Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>>>>>
>>>>>>
>>>>>>>
>>>>>>    http://sematext.com/spm <http://sematext.com/spm>
>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>>>
>>>>>
>>>>>>>
>>>>>
>>>>>>>
>>>>>
>>>>>>>    ______________________________****__
>>>>>
>>>>>
>>>>>>>>
>>>>>>>    From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
>>>>>
>>>>>>
>>>>>>>>  lists.digitalpebble@gmail.com>****>
>>>>>>>
>>>>>>
>>>>>   To: "user@nutch.apache.org <ma...@nutch.apache.org>****" <
>>>>>
>>>>>>
>>>>>>>>  user@nutch.apache.org <ma...@nutch.apache.org>****>
>>>>>>>
>>>>>>
>>>>>   Sent: Tuesday, August 6, 2013 10:54 AM
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    Subject: Re: 2.x vs. 1.x speed
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>    Hi Otis,
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>    That certainly depends on the backend used but on the whole it
>>>>> wouldnt
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>  be
>>>>>
>>>>>   surprising. Would be good to have some data to substantiate it. I am
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    planning to put my intern on the case and have some basic
>>>>> comparison as
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    soon as she gets a good grip of Hadoop / Nutch etc... but if
>>>>> someone
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>  else
>>>>>
>>>>>   wants to do it please go ahead.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>    In case I happen to be the person who told you that Otis, well
>>>>> at least
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>  I
>>>>>
>>>>>   am consistent ;-)
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>    Julien
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>    On 6 August 2013 09:08, Otis Gospodnetic <
>>>>> otis.gospodnetic@gmail.com<**
>>>>>
>>>>>  mailto:
>>>>>>>>
>>>>>>>>  otis.gospodnetic@gmail.com>>
>>>>>>>
>>>>>>
>>>>>   wrote:
>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>>>>    Hello,
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>>>>>    At some point earlier this year I spoke to a person who told me
>>>>> 2.x
>>>>>
>>>>>>
>>>>>>>>>  is
>>>>>>>>
>>>>>>>
>>>>>   (a little?) slower than 1.x.  Is that still the case?
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>>>>>    Thanks,
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>    Otis
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>    --
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>    Solr & ElasticSearch Support -- http://sematext.com/ <
>>>>>
>>>>>>
>>>>>>>>>  http://sematext.com/>
>>>>>>>>
>>>>>>>
>>>>>   Performance Monitoring -- http://sematext.com/spm <
>>>>>
>>>>>>
>>>>>>>>>  http://sematext.com/spm>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>    --
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    *
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    *Open Source Solutions for Text Engineering
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>    http://digitalpebble.blogspot.****com/<http://digitalpebble.**
>>>>> blogspot.com/ <http://digitalpebble.blogspot.com/>><
>>>>>
>>>>>> http://digitalpebble.**blogspo**t.com/ <http://blogspot.com/><
>>>>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>    http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>    http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>> <
>>>>>
>>>>>> http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>>>>> http://twitter.com/**digitalpebble<http://twitter.com/digitalpebble>
>>>>>>>> >
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>>
>>>>>
>>>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>>    --
>>>>>
>>>>>>
>>>>>>
>>>>>   *
>>>>>
>>>>>>
>>>>>>
>>>>>   *Open Source Solutions for Text Engineering
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>    http://digitalpebble.blogspot.****com/<http://digitalpebble.**
>>>>> blogspot.com/ <http://digitalpebble.blogspot.com/>><
>>>>>
>>>>>> http://digitalpebble.**blogspo**t.com/ <http://blogspot.com/><
>>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>> >
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>   http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>>
>>>>>>
>>>>>>
>>>>>   http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>> <
>>>>>
>>>>>> http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>
>>>>>> >>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>  --
>>>>>
>>>>> *Lewis*
>>>>>
>>>>> --
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.****com/<http://digitalpebble.**
>>>>> blogspot.com/ <http://digitalpebble.blogspot.com/>><
>>>>> http://digitalpebble.**blogspo**t.com/ <http://blogspot.com/><
>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>> >
>>>>>
>>>>>>
>>>>>>  http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>> http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>> <
>>>>> http://twitter.com/****digitalpebble<http://twitter.com/**digitalpebble><
>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>  --
>>> Kaveh Minooie
>>>
>>>
>>
>>
>>
> --
> Kaveh Minooie
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by kaveh minooie <ka...@plutoz.com>.

:) believe me, what ever attitude you might have seen in that sentence 
was just my own guilty conscious manifesting itself. never the less, you 
are right and I absolutely apologize for that.

Now I have to say that the reason that I haven't really posted anything 
is not just cause I am lazy, but because I am not sure how to go about 
it in a way that would be meaningful to whoever is going to read it. the 
performance, in a distribute environment, is affected by many things of 
which few are directly related to nutch. a lot of it has to do with how 
the hadoop is set up( how many map or reduce jobs are being run per 
core? what is the replication factor in the hadoop, if and what kind of 
compression is being used, etc, ) the hardware that is being used, and 
if we are using gora then the performance of the storage backend and how 
that has been set up is also gonna be a big factor as well. not to 
mention, at least for the current version of gora, that the storage 
backends that run on top of hadoop have fundamentally different 
characteristics with the ones that do not run on top of hadoop, so I am 
not sure if a head to head comparison between just the numbers would be 
informative or just misleading.

What I am trying to say, I guess, is that if people who have more 
experience in creating this kinds of report could suggest some sort of 
guideline or something, it would be very helpful to me and, I am sure, 
other people as well, to post these kind of numbers. I think that the 
best possible outcome would be to have some sort of 'zoo' section on the 
site which would have all these reports for different scenarios. from my 
own experience, I can say that one of the biggest problems that I had 
when I started using nutch and still have to some degree, was that I was 
never sure what I am doing is right because there were never a reference 
point with which I could compare my own results, and if it wasn't 
because of this fantastic mailing list, I would have been dead.

also, "realistic" was definitely the wrong word to use. I do agree with 
you, base on what I have seen on the list, that too many people start 
using the 2.x version without having enough amount of data to justify 
it. This definitely would be a very good point to mention, specially on 
the web site, that if you don't have more than x number of links to work 
with, do not use 2.x version, at least not yet.

that being said I'll start keeping track of my results and I'll share it 
with everyone hopefully in near future.

again thanks thou for posting those numbers.


On 09/16/2013 12:06 PM, Julien Nioche wrote:
> Hi Kaveh
>
> Finally, someone posted some metrics, thanks Julian.
>
>
> No probs. You could have done the same experiment since you felt it was
> needed ;-)
>
>
>> I just need to point out, in addition to Renato's question, the size of
>> the data that you choose to use for the test is not really fair.IMHO, for
>> 2.x to be some what realistic,
>
> your gonna want to have a crawldb with at least afew hundreds of millions
>> of links and fetch list of again at least 1 or 2 million. what do you guys
>> think?
>
>
> If realistic means close to real usage then you'll find that most people
> use Nutch on dbs smaller than 3M urls. From that point of view, this
> experiment is realistic. It is also realistic with the meaning that it can
> be reproduce easily : fetching millions or urls would take a lot of time
> and having 00's M pages requires a larger cluster ($$$$)
>
> Again, I mentioned I my post that it would be interesting to do it with a
> larger cluster but at least we can discuss the limitations in design and
> implementation that Nutch 2 currently has.
>
> The main point is that this test was a relative comparison between 2
> versions, not an absolute benchmark of how long it takes to run a crawl.
> Knowing how Nutch 2 fairs in relation to Nutch 1 is quite useful,
> especially with new users expecting a more recent version to perform better
> than the old one.
>
> Feel free to try on a larger cluster and dataset and share your results, it
> will be interesting to see if there is a difference from what I measured on
> a single machine
>
> Thanks
>
> Julien
>
>
>
>
>
>
>>
>>
>> On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:
>>
>>> Thanks for sharing Julien! These are indeed interesting results.
>>> Just a quick question, did you use a single server to run this? or did you
>>> set up a minimum number of servers for it? this is because HBase or
>>> Cassandra will improve their latency if we scale them out.
>>>
>>>
>>> Renato M.
>>>
>>>
>>> 2013/9/16 Markus Jelsma <ma...@openindex.io>
>>>
>>>   Thanks! That was interesting.
>>>>
>>>> -----Original message-----
>>>> From: Julien Nioche<li...@gmail.com>
>>>>>
>>>> Sent: Monday 16th September 2013 18:45
>>>> To: user@nutch.apache.org; dev@nutch.apache.org
>>>> Cc: Otis Gospodnetic <ot...@yahoo.com>
>>>> Subject: Re: 2.x vs. 1.x speed
>>>>
>>>> Guys,
>>>>
>>>> Following the discussion we had some time ago about comparing 1.x with
>>>> 2.x, we did dome tests and put the results on
>>>>
>>>> http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
>>>> vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html><
>>>> http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
>>>> vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>>>>>
>>>>
>>>> Feel free to comment.
>>>>
>>>> Best,
>>>>
>>>> Julien
>>>>
>>>> On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
>>>> <**mailto:
>>>> lewis.mcgibbney@gmail.com>> wrote:
>>>>
>>>> I am sure that Renato (if he is watching) can plugin maybe as well.
>>>>
>>>> We find in Gora that in every sense of the word, native Hadoop stores
>>>> such
>>>>
>>>> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>>>>
>>>> via getParitions we retrieve GoraInputSplits natively which means splits
>>>>
>>>> are obtained for MapReduce jobs... such as many of the jobs we run in
>>>> Nutch
>>>>
>>>> as well. On  the other hand (currently) stores such as Cassandra and Web
>>>>
>>>> service stores such as DynamoDB do not support Hadoop out of the box (the
>>>>
>>>> former we are working on and hope to  have implemented in Gora soon)
>>>>
>>>> therefore it is not as simple to get partitions in the same way we would
>>>> in
>>>>
>>>> a Hadoop native store. We therefore obtain one partition to be used as an
>>>>
>>>> InputSplit for the MR job. This is certainly an area for concern and
>>>> right
>>>>
>>>> now a bottleneck for some operations. We continue to work on this.
>>>>
>>>> On Wednesday, August 7, 2013, Julien Nioche <
>>>> lists.digitalpebble@gmail.com**<mailto:
>>>> lists.digitalpebble@gmail.com>**>
>>>>
>>>> wrote:
>>>>
>>>>   Hi Otis
>>>>>
>>>>
>>>>
>>>>>
>>>>   Definitely *not *the fetching speed. Actually everything but *not* the
>>>>>
>>>>
>>>>   fetching speed. The fetcher is pretty much the same as 1.x and anyway
>>>>> the
>>>>>
>>>>
>>>>   performance with fetching is pretty much always limited by the
>>>>> politeness
>>>>>
>>>>
>>>>   settings, not the implementation.
>>>>>
>>>>
>>>>
>>>>>
>>>>   Re-backend : some backend implementations are more mature than others.
>>>>>
>>>> The
>>>>
>>>>   one for HBase is probably the one most widely used, the Cassandra one
>>>>> has
>>>>>
>>>>
>>>>   been greatly improved in particular performance-wise , the SQL one is
>>>>>
>>>>
>>>>   broken etc... we need to measure this as this is just a gut feeling at
>>>>>
>>>>
>>>> this
>>>>
>>>>   stage
>>>>>
>>>>
>>>>
>>>>>
>>>>   Now for  what is slower and why, again this has to be measured but I
>>>>>
>>>>
>>>> expect
>>>>
>>>>   2.x to be slower partly because of [1], i.e. the filtering of entries is
>>>>>
>>>>
>>>>   not done by the backends (some might provide a way of doing it) but this
>>>>>
>>>>
>>>> is
>>>>
>>>>   done on the client side, when we create the input for mapred. In other
>>>>>
>>>>
>>>>   words we pull things from the backend just to discard it. Since 2.x does
>>>>>
>>>>
>>>>   not have segments like 1.x (which the fetch + parse mapreduce jobs take
>>>>>
>>>> as
>>>>
>>>>   single input) we scan the whole table even if we want to fetch or parse
>>>>> a
>>>>>
>>>>
>>>>   handful of entries.
>>>>>
>>>>
>>>>
>>>>>
>>>>   On the other hand, 2.x specifies what columns to retrieve for a given
>>>>>
>>>> job,
>>>>
>>>>   whereas 1.x will for instance deserialize the crawldatum entirely. The
>>>>>
>>>>
>>>>   metadata objects are costly to read/write so 2.x might have the upper
>>>>>
>>>> hand
>>>>
>>>>   from that point of view since it pulls and deserializes only what it
>>>>>
>>>>
>>>> needs.
>>>>
>>>>
>>>>>
>>>>   Finally the most costly steps in a large crawl in 1.x are the generation
>>>>>
>>>>
>>>>   and update as we have to read/write the crawldb entirely. The way the
>>>>>
>>>>
>>>>   updates are done in 2.x is different and should be a lot faster.
>>>>>
>>>>
>>>>
>>>>>
>>>>   Please could anyone correct me if I am wrong. Some of this is based on
>>>>> my
>>>>>
>>>>
>>>>   understanding of 2.x which dates back from quite a while and some of the
>>>>>
>>>>
>>>>   stuff might have changed in the meantime. The performance would probably
>>>>>
>>>>
>>>>   vary a lot based on the fine tuning of each backend implementation but
>>>>>
>>>>
>>>>   having some basic comparison would confirm some of the assertions above.
>>>>>
>>>>
>>>>
>>>>>
>>>>   Julien
>>>>>
>>>>
>>>>
>>>>>
>>>>
>>>>>
>>>>   [1] https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119><
>>>>>
>>>> https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119>
>>>>>
>>>>
>>>>
>>>>>
>>>>
>>>>>
>>>>   Julien, could you please elaborate a bit about your comment about speed
>>>>>
>>>>
>>>>   depending on the backend used?
>>>>>>
>>>>>
>>>>
>>>>>>
>>>>   Yes, you were the person I was referring to :)
>>>>>>
>>>>>
>>>>
>>>>>>
>>>>   Oh, and *believe* you said it was the fetching speed that was different
>>>>>>
>>>>>
>>>>   between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>>>>>>
>>>>>
>>>> 2.x?
>>>>
>>>>
>>>>>>
>>>>   Thanks,
>>>>>>
>>>>>
>>>>   Otis
>>>>>>
>>>>>
>>>>   ----
>>>>>>
>>>>>
>>>>   Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>>>>>>
>>>>>
>>>>   http://sematext.com/spm <http://sematext.com/spm>
>>>>>>
>>>>>
>>>>
>>>>>>
>>>>
>>>>>>
>>>>
>>>>>>
>>>>
>>>>>>
>>>>   ______________________________**__
>>>>>>>
>>>>>>
>>>>   From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
>>>>>>>
>>>>>> lists.digitalpebble@gmail.com>**>
>>>>
>>>>   To: "user@nutch.apache.org <ma...@nutch.apache.org>**" <
>>>>>>>
>>>>>> user@nutch.apache.org <ma...@nutch.apache.org>**>
>>>>
>>>>   Sent: Tuesday, August 6, 2013 10:54 AM
>>>>>>>
>>>>>>
>>>>   Subject: Re: 2.x vs. 1.x speed
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>   Hi Otis,
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>   That certainly depends on the backend used but on the whole it wouldnt
>>>>>>>
>>>>>>
>>>> be
>>>>
>>>>   surprising. Would be good to have some data to substantiate it. I am
>>>>>>>
>>>>>>
>>>>   planning to put my intern on the case and have some basic comparison as
>>>>>>>
>>>>>>
>>>>   soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>>>>>>>
>>>>>>
>>>> else
>>>>
>>>>   wants to do it please go ahead.
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>   In case I happen to be the person who told you that Otis, well at least
>>>>>>>
>>>>>>
>>>> I
>>>>
>>>>   am consistent ;-)
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>   Julien
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>   On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com<**
>>>>>>> mailto:
>>>>>>>
>>>>>> otis.gospodnetic@gmail.com>>
>>>>
>>>>   wrote:
>>>>>>
>>>>>
>>>>
>>>>>>>
>>>>   Hello,
>>>>>>>>
>>>>>>>
>>>>
>>>>>>>>
>>>>   At some point earlier this year I spoke to a person who told me 2.x
>>>>>>>>
>>>>>>> is
>>>>
>>>>   (a little?) slower than 1.x.  Is that still the case?
>>>>>>>>
>>>>>>>
>>>>
>>>>>>>>
>>>>   Thanks,
>>>>>>>>
>>>>>>>
>>>>   Otis
>>>>>>>>
>>>>>>>
>>>>   --
>>>>>>>>
>>>>>>>
>>>>   Solr & ElasticSearch Support -- http://sematext.com/ <
>>>>>>>>
>>>>>>> http://sematext.com/>
>>>>
>>>>   Performance Monitoring -- http://sematext.com/spm <
>>>>>>>>
>>>>>>> http://sematext.com/spm>
>>>>
>>>>
>>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>   --
>>>>>>>
>>>>>>
>>>>   *
>>>>>>>
>>>>>>
>>>>   *Open Source Solutions for Text Engineering
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>   http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>>>
>>>>>>
>>>>>
>>>>   http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>>>>
>>>>>>
>>>>   http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>>
>>>>
>>>>>>
>>>>
>>>>>
>>>>
>>>>>
>>>>
>>>>>
>>>>   --
>>>>>
>>>>
>>>>   *
>>>>>
>>>>
>>>>   *Open Source Solutions for Text Engineering
>>>>>
>>>>
>>>>
>>>>>
>>>>   http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>>
>>>>>
>>>>
>>>>   http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>>
>>>>
>>>>   http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>>
>>>>
>>>>
>>>>>
>>>> --
>>>>
>>>> *Lewis*
>>>>
>>>> --
>>>>
>>>> Open Source Solutions for Text Engineering
>>>>
>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>
>>>> http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>
>>>>
>>>>
>>>>
>>>
>> --
>> Kaveh Minooie
>>
>
>
>

-- 
Kaveh Minooie

Re: 2.x vs. 1.x speed

Posted by Julien Nioche <li...@gmail.com>.

Hi Kaveh

Finally, someone posted some metrics, thanks Julian.


No probs. You could have done the same experiment since you felt it was
needed ;-)


> I just need to point out, in addition to Renato's question, the size of
> the data that you choose to use for the test is not really fair.IMHO, for
> 2.x to be some what realistic,

your gonna want to have a crawldb with at least afew hundreds of millions
> of links and fetch list of again at least 1 or 2 million. what do you guys
> think?


If realistic means close to real usage then you'll find that most people
use Nutch on dbs smaller than 3M urls. From that point of view, this
experiment is realistic. It is also realistic with the meaning that it can
be reproduce easily : fetching millions or urls would take a lot of time
and having 00's M pages requires a larger cluster ($$$$)

Again, I mentioned I my post that it would be interesting to do it with a
larger cluster but at least we can discuss the limitations in design and
implementation that Nutch 2 currently has.

The main point is that this test was a relative comparison between 2
versions, not an absolute benchmark of how long it takes to run a crawl.
Knowing how Nutch 2 fairs in relation to Nutch 1 is quite useful,
especially with new users expecting a more recent version to perform better
than the old one.

Feel free to try on a larger cluster and dataset and share your results, it
will be interesting to see if there is a difference from what I measured on
a single machine

Thanks

Julien






>
>
> On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:
>
>> Thanks for sharing Julien! These are indeed interesting results.
>> Just a quick question, did you use a single server to run this? or did you
>> set up a minimum number of servers for it? this is because HBase or
>> Cassandra will improve their latency if we scale them out.
>>
>>
>> Renato M.
>>
>>
>> 2013/9/16 Markus Jelsma <ma...@openindex.io>
>>
>>  Thanks! That was interesting.
>>>
>>> -----Original message-----
>>> From: Julien Nioche<li...@gmail.com>
>>> >
>>> Sent: Monday 16th September 2013 18:45
>>> To: user@nutch.apache.org; dev@nutch.apache.org
>>> Cc: Otis Gospodnetic <ot...@yahoo.com>
>>> Subject: Re: 2.x vs. 1.x speed
>>>
>>> Guys,
>>>
>>> Following the discussion we had some time ago about comparing 1.x with
>>> 2.x, we did dome tests and put the results on
>>>
>>> http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
>>> vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html><
>>> http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
>>> vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>>> >
>>>
>>> Feel free to comment.
>>>
>>> Best,
>>>
>>> Julien
>>>
>>> On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
>>> <**mailto:
>>> lewis.mcgibbney@gmail.com>> wrote:
>>>
>>> I am sure that Renato (if he is watching) can plugin maybe as well.
>>>
>>> We find in Gora that in every sense of the word, native Hadoop stores
>>> such
>>>
>>> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>>>
>>> via getParitions we retrieve GoraInputSplits natively which means splits
>>>
>>> are obtained for MapReduce jobs... such as many of the jobs we run in
>>> Nutch
>>>
>>> as well. On  the other hand (currently) stores such as Cassandra and Web
>>>
>>> service stores such as DynamoDB do not support Hadoop out of the box (the
>>>
>>> former we are working on and hope to  have implemented in Gora soon)
>>>
>>> therefore it is not as simple to get partitions in the same way we would
>>> in
>>>
>>> a Hadoop native store. We therefore obtain one partition to be used as an
>>>
>>> InputSplit for the MR job. This is certainly an area for concern and
>>> right
>>>
>>> now a bottleneck for some operations. We continue to work on this.
>>>
>>> On Wednesday, August 7, 2013, Julien Nioche <
>>> lists.digitalpebble@gmail.com**<mailto:
>>> lists.digitalpebble@gmail.com>**>
>>>
>>> wrote:
>>>
>>>  Hi Otis
>>>>
>>>
>>>
>>>>
>>>  Definitely *not *the fetching speed. Actually everything but *not* the
>>>>
>>>
>>>  fetching speed. The fetcher is pretty much the same as 1.x and anyway
>>>> the
>>>>
>>>
>>>  performance with fetching is pretty much always limited by the
>>>> politeness
>>>>
>>>
>>>  settings, not the implementation.
>>>>
>>>
>>>
>>>>
>>>  Re-backend : some backend implementations are more mature than others.
>>>>
>>> The
>>>
>>>  one for HBase is probably the one most widely used, the Cassandra one
>>>> has
>>>>
>>>
>>>  been greatly improved in particular performance-wise , the SQL one is
>>>>
>>>
>>>  broken etc... we need to measure this as this is just a gut feeling at
>>>>
>>>
>>> this
>>>
>>>  stage
>>>>
>>>
>>>
>>>>
>>>  Now for  what is slower and why, again this has to be measured but I
>>>>
>>>
>>> expect
>>>
>>>  2.x to be slower partly because of [1], i.e. the filtering of entries is
>>>>
>>>
>>>  not done by the backends (some might provide a way of doing it) but this
>>>>
>>>
>>> is
>>>
>>>  done on the client side, when we create the input for mapred. In other
>>>>
>>>
>>>  words we pull things from the backend just to discard it. Since 2.x does
>>>>
>>>
>>>  not have segments like 1.x (which the fetch + parse mapreduce jobs take
>>>>
>>> as
>>>
>>>  single input) we scan the whole table even if we want to fetch or parse
>>>> a
>>>>
>>>
>>>  handful of entries.
>>>>
>>>
>>>
>>>>
>>>  On the other hand, 2.x specifies what columns to retrieve for a given
>>>>
>>> job,
>>>
>>>  whereas 1.x will for instance deserialize the crawldatum entirely. The
>>>>
>>>
>>>  metadata objects are costly to read/write so 2.x might have the upper
>>>>
>>> hand
>>>
>>>  from that point of view since it pulls and deserializes only what it
>>>>
>>>
>>> needs.
>>>
>>>
>>>>
>>>  Finally the most costly steps in a large crawl in 1.x are the generation
>>>>
>>>
>>>  and update as we have to read/write the crawldb entirely. The way the
>>>>
>>>
>>>  updates are done in 2.x is different and should be a lot faster.
>>>>
>>>
>>>
>>>>
>>>  Please could anyone correct me if I am wrong. Some of this is based on
>>>> my
>>>>
>>>
>>>  understanding of 2.x which dates back from quite a while and some of the
>>>>
>>>
>>>  stuff might have changed in the meantime. The performance would probably
>>>>
>>>
>>>  vary a lot based on the fine tuning of each backend implementation but
>>>>
>>>
>>>  having some basic comparison would confirm some of the assertions above.
>>>>
>>>
>>>
>>>>
>>>  Julien
>>>>
>>>
>>>
>>>>
>>>
>>>>
>>>  [1] https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119><
>>>>
>>> https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119>
>>> >
>>>
>>>
>>>>
>>>
>>>>
>>>  Julien, could you please elaborate a bit about your comment about speed
>>>>
>>>
>>>  depending on the backend used?
>>>>>
>>>>
>>>
>>>>>
>>>  Yes, you were the person I was referring to :)
>>>>>
>>>>
>>>
>>>>>
>>>  Oh, and *believe* you said it was the fetching speed that was different
>>>>>
>>>>
>>>  between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>>>>>
>>>>
>>> 2.x?
>>>
>>>
>>>>>
>>>  Thanks,
>>>>>
>>>>
>>>  Otis
>>>>>
>>>>
>>>  ----
>>>>>
>>>>
>>>  Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>>>>>
>>>>
>>>  http://sematext.com/spm <http://sematext.com/spm>
>>>>>
>>>>
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>
>>>  ______________________________**__
>>>>>>
>>>>>
>>>  From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
>>>>>>
>>>>> lists.digitalpebble@gmail.com>**>
>>>
>>>  To: "user@nutch.apache.org <ma...@nutch.apache.org>**" <
>>>>>>
>>>>> user@nutch.apache.org <ma...@nutch.apache.org>**>
>>>
>>>  Sent: Tuesday, August 6, 2013 10:54 AM
>>>>>>
>>>>>
>>>  Subject: Re: 2.x vs. 1.x speed
>>>>>>
>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>  Hi Otis,
>>>>>>
>>>>>
>>>
>>>>>>
>>>  That certainly depends on the backend used but on the whole it wouldnt
>>>>>>
>>>>>
>>> be
>>>
>>>  surprising. Would be good to have some data to substantiate it. I am
>>>>>>
>>>>>
>>>  planning to put my intern on the case and have some basic comparison as
>>>>>>
>>>>>
>>>  soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>>>>>>
>>>>>
>>> else
>>>
>>>  wants to do it please go ahead.
>>>>>>
>>>>>
>>>
>>>>>>
>>>  In case I happen to be the person who told you that Otis, well at least
>>>>>>
>>>>>
>>> I
>>>
>>>  am consistent ;-)
>>>>>>
>>>>>
>>>
>>>>>>
>>>  Julien
>>>>>>
>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>  On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com<**
>>>>>> mailto:
>>>>>>
>>>>> otis.gospodnetic@gmail.com>>
>>>
>>>  wrote:
>>>>>
>>>>
>>>
>>>>>>
>>>  Hello,
>>>>>>>
>>>>>>
>>>
>>>>>>>
>>>  At some point earlier this year I spoke to a person who told me 2.x
>>>>>>>
>>>>>> is
>>>
>>>  (a little?) slower than 1.x.  Is that still the case?
>>>>>>>
>>>>>>
>>>
>>>>>>>
>>>  Thanks,
>>>>>>>
>>>>>>
>>>  Otis
>>>>>>>
>>>>>>
>>>  --
>>>>>>>
>>>>>>
>>>  Solr & ElasticSearch Support -- http://sematext.com/ <
>>>>>>>
>>>>>> http://sematext.com/>
>>>
>>>  Performance Monitoring -- http://sematext.com/spm <
>>>>>>>
>>>>>> http://sematext.com/spm>
>>>
>>>
>>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>  --
>>>>>>
>>>>>
>>>  *
>>>>>>
>>>>>
>>>  *Open Source Solutions for Text Engineering
>>>>>>
>>>>>
>>>
>>>>>>
>>>  http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>>>>
>>>>>
>>>>
>>>  http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>>>
>>>>>
>>>  http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>
>>>>>> >
>>>>>>
>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>
>>>
>>>>
>>>
>>>>
>>>
>>>>
>>>  --
>>>>
>>>
>>>  *
>>>>
>>>
>>>  *Open Source Solutions for Text Engineering
>>>>
>>>
>>>
>>>>
>>>  http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>>> >
>>>>
>>>
>>>  http://www.digitalpebble.com <http://www.digitalpebble.com>
>>>>
>>>
>>>  http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>>
>>>
>>>
>>>>
>>> --
>>>
>>> *Lewis*
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
>>> http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>
>>> >
>>> http://www.digitalpebble.com <http://www.digitalpebble.com>
>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
>>> http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>
>>>
>>>
>>>
>>>
>>
> --
> Kaveh Minooie
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Posted by kaveh minooie <ka...@plutoz.com>.

Finally, someone posted some metrics, thanks Julian. I just need to 
point out, in addition to Renato's question, the size of the data that 
you choose to use for the test is not really fair.IMHO, for 2.x to be 
some what realistic, your gonna want to have a crawldb with at least 
afew hundreds of millions of links and fetch list of again at least 1 or 
2 million. what do you guys think?

On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:
> Thanks for sharing Julien! These are indeed interesting results.
> Just a quick question, did you use a single server to run this? or did you
> set up a minimum number of servers for it? this is because HBase or
> Cassandra will improve their latency if we scale them out.
>
>
> Renato M.
>
>
> 2013/9/16 Markus Jelsma <ma...@openindex.io>
>
>> Thanks! That was interesting.
>>
>> -----Original message-----
>> From: Julien Nioche<li...@gmail.com>
>> Sent: Monday 16th September 2013 18:45
>> To: user@nutch.apache.org; dev@nutch.apache.org
>> Cc: Otis Gospodnetic <ot...@yahoo.com>
>> Subject: Re: 2.x vs. 1.x speed
>>
>> Guys,
>>
>> Following the discussion we had some time ago about comparing 1.x with
>> 2.x, we did dome tests and put the results on
>>
>> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
>> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>>
>> Feel free to comment.
>>
>> Best,
>>
>> Julien
>>
>> On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com<mailto:
>> lewis.mcgibbney@gmail.com>> wrote:
>>
>> I am sure that Renato (if he is watching) can plugin maybe as well.
>>
>> We find in Gora that in every sense of the word, native Hadoop stores such
>>
>> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>>
>> via getParitions we retrieve GoraInputSplits natively which means splits
>>
>> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
>>
>> as well. On  the other hand (currently) stores such as Cassandra and Web
>>
>> service stores such as DynamoDB do not support Hadoop out of the box (the
>>
>> former we are working on and hope to  have implemented in Gora soon)
>>
>> therefore it is not as simple to get partitions in the same way we would in
>>
>> a Hadoop native store. We therefore obtain one partition to be used as an
>>
>> InputSplit for the MR job. This is certainly an area for concern and right
>>
>> now a bottleneck for some operations. We continue to work on this.
>>
>> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com<mailto:
>> lists.digitalpebble@gmail.com>>
>>
>> wrote:
>>
>>> Hi Otis
>>
>>>
>>
>>> Definitely *not *the fetching speed. Actually everything but *not* the
>>
>>> fetching speed. The fetcher is pretty much the same as 1.x and anyway the
>>
>>> performance with fetching is pretty much always limited by the politeness
>>
>>> settings, not the implementation.
>>
>>>
>>
>>> Re-backend : some backend implementations are more mature than others.
>> The
>>
>>> one for HBase is probably the one most widely used, the Cassandra one has
>>
>>> been greatly improved in particular performance-wise , the SQL one is
>>
>>> broken etc... we need to measure this as this is just a gut feeling at
>>
>> this
>>
>>> stage
>>
>>>
>>
>>> Now for  what is slower and why, again this has to be measured but I
>>
>> expect
>>
>>> 2.x to be slower partly because of [1], i.e. the filtering of entries is
>>
>>> not done by the backends (some might provide a way of doing it) but this
>>
>> is
>>
>>> done on the client side, when we create the input for mapred. In other
>>
>>> words we pull things from the backend just to discard it. Since 2.x does
>>
>>> not have segments like 1.x (which the fetch + parse mapreduce jobs take
>> as
>>
>>> single input) we scan the whole table even if we want to fetch or parse a
>>
>>> handful of entries.
>>
>>>
>>
>>> On the other hand, 2.x specifies what columns to retrieve for a given
>> job,
>>
>>> whereas 1.x will for instance deserialize the crawldatum entirely. The
>>
>>> metadata objects are costly to read/write so 2.x might have the upper
>> hand
>>
>>> from that point of view since it pulls and deserializes only what it
>>
>> needs.
>>
>>>
>>
>>> Finally the most costly steps in a large crawl in 1.x are the generation
>>
>>> and update as we have to read/write the crawldb entirely. The way the
>>
>>> updates are done in 2.x is different and should be a lot faster.
>>
>>>
>>
>>> Please could anyone correct me if I am wrong. Some of this is based on my
>>
>>> understanding of 2.x which dates back from quite a while and some of the
>>
>>> stuff might have changed in the meantime. The performance would probably
>>
>>> vary a lot based on the fine tuning of each backend implementation but
>>
>>> having some basic comparison would confirm some of the assertions above.
>>
>>>
>>
>>> Julien
>>
>>>
>>
>>>
>>
>>> [1] https://issues.apache.org/jira/browse/GORA-119 <
>> https://issues.apache.org/jira/browse/GORA-119>
>>
>>>
>>
>>>
>>
>>> Julien, could you please elaborate a bit about your comment about speed
>>
>>>> depending on the backend used?
>>
>>>>
>>
>>>> Yes, you were the person I was referring to :)
>>
>>>>
>>
>>>> Oh, and *believe* you said it was the fetching speed that was different
>>
>>>> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>>
>> 2.x?
>>
>>>>
>>
>>>> Thanks,
>>
>>>> Otis
>>
>>>> ----
>>
>>>> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>>
>>>> http://sematext.com/spm <http://sematext.com/spm>
>>
>>>>
>>
>>>>
>>
>>>>
>>
>>>>
>>
>>>>> ________________________________
>>
>>>>> From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
>> lists.digitalpebble@gmail.com>>
>>
>>>>> To: "user@nutch.apache.org <ma...@nutch.apache.org>" <
>> user@nutch.apache.org <ma...@nutch.apache.org>>
>>
>>>>> Sent: Tuesday, August 6, 2013 10:54 AM
>>
>>>>> Subject: Re: 2.x vs. 1.x speed
>>
>>>>>
>>
>>>>>
>>
>>>>> Hi Otis,
>>
>>>>>
>>
>>>>> That certainly depends on the backend used but on the whole it wouldnt
>>
>> be
>>
>>>>> surprising. Would be good to have some data to substantiate it. I am
>>
>>>>> planning to put my intern on the case and have some basic comparison as
>>
>>>>> soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>>
>> else
>>
>>>>> wants to do it please go ahead.
>>
>>>>>
>>
>>>>> In case I happen to be the person who told you that Otis, well at least
>>
>> I
>>
>>>>> am consistent ;-)
>>
>>>>>
>>
>>>>> Julien
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>> On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com<mailto:
>> otis.gospodnetic@gmail.com>>
>>
>>>> wrote:
>>
>>>>>
>>
>>>>>> Hello,
>>
>>>>>>
>>
>>>>>> At some point earlier this year I spoke to a person who told me 2.x
>> is
>>
>>>>>> (a little?) slower than 1.x.  Is that still the case?
>>
>>>>>>
>>
>>>>>> Thanks,
>>
>>>>>> Otis
>>
>>>>>> --
>>
>>>>>> Solr & ElasticSearch Support -- http://sematext.com/ <
>> http://sematext.com/>
>>
>>>>>> Performance Monitoring -- http://sematext.com/spm <
>> http://sematext.com/spm>
>>
>>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>> --
>>
>>>>> *
>>
>>>>> *Open Source Solutions for Text Engineering
>>
>>>>>
>>
>>>>> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/
>>>
>>
>>>>> http://www.digitalpebble.com <http://www.digitalpebble.com>
>>
>>>>> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>>
>>>>>
>>
>>>>>
>>
>>>>>
>>
>>>>
>>
>>>
>>
>>>
>>
>>>
>>
>>> --
>>
>>> *
>>
>>> *Open Source Solutions for Text Engineering
>>
>>>
>>
>>> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
>>
>>> http://www.digitalpebble.com <http://www.digitalpebble.com>
>>
>>> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>>
>>>
>>
>> --
>>
>> *Lewis*
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
>> http://www.digitalpebble.com <http://www.digitalpebble.com>
>> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>>
>>
>>
>

-- 
Kaveh Minooie

Re: 2.x vs. 1.x speed

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks for sharing Julien! These are indeed interesting results.
Just a quick question, did you use a single server to run this? or did you
set up a minimum number of servers for it? this is because HBase or
Cassandra will improve their latency if we scale them out.


Renato M.


2013/9/16 Markus Jelsma <ma...@openindex.io>

> Thanks! That was interesting.
>
> -----Original message-----
> From: Julien Nioche<li...@gmail.com>
> Sent: Monday 16th September 2013 18:45
> To: user@nutch.apache.org; dev@nutch.apache.org
> Cc: Otis Gospodnetic <ot...@yahoo.com>
> Subject: Re: 2.x vs. 1.x speed
>
> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with
> 2.x, we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>
> Feel free to comment.
>
> Best,
>
> Julien
>
> On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com<mailto:
> lewis.mcgibbney@gmail.com>> wrote:
>
> I am sure that Renato (if he is watching) can plugin maybe as well.
>
> We find in Gora that in every sense of the word, native Hadoop stores such
>
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>
> via getParitions we retrieve GoraInputSplits natively which means splits
>
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
>
> as well. On  the other hand (currently) stores such as Cassandra and Web
>
> service stores such as DynamoDB do not support Hadoop out of the box (the
>
> former we are working on and hope to  have implemented in Gora soon)
>
> therefore it is not as simple to get partitions in the same way we would in
>
> a Hadoop native store. We therefore obtain one partition to be used as an
>
> InputSplit for the MR job. This is certainly an area for concern and right
>
> now a bottleneck for some operations. We continue to work on this.
>
> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com<mailto:
> lists.digitalpebble@gmail.com>>
>
> wrote:
>
> > Hi Otis
>
> >
>
> > Definitely *not *the fetching speed. Actually everything but *not* the
>
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
>
> > performance with fetching is pretty much always limited by the politeness
>
> > settings, not the implementation.
>
> >
>
> > Re-backend : some backend implementations are more mature than others.
> The
>
> > one for HBase is probably the one most widely used, the Cassandra one has
>
> > been greatly improved in particular performance-wise , the SQL one is
>
> > broken etc... we need to measure this as this is just a gut feeling at
>
> this
>
> > stage
>
> >
>
> > Now for  what is slower and why, again this has to be measured but I
>
> expect
>
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
>
> > not done by the backends (some might provide a way of doing it) but this
>
> is
>
> > done on the client side, when we create the input for mapred. In other
>
> > words we pull things from the backend just to discard it. Since 2.x does
>
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
>
> > single input) we scan the whole table even if we want to fetch or parse a
>
> > handful of entries.
>
> >
>
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
>
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
>
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
>
> > from that point of view since it pulls and deserializes only what it
>
> needs.
>
> >
>
> > Finally the most costly steps in a large crawl in 1.x are the generation
>
> > and update as we have to read/write the crawldb entirely. The way the
>
> > updates are done in 2.x is different and should be a lot faster.
>
> >
>
> > Please could anyone correct me if I am wrong. Some of this is based on my
>
> > understanding of 2.x which dates back from quite a while and some of the
>
> > stuff might have changed in the meantime. The performance would probably
>
> > vary a lot based on the fine tuning of each backend implementation but
>
> > having some basic comparison would confirm some of the assertions above.
>
> >
>
> > Julien
>
> >
>
> >
>
> > [1] https://issues.apache.org/jira/browse/GORA-119 <
> https://issues.apache.org/jira/browse/GORA-119>
>
> >
>
> >
>
> > Julien, could you please elaborate a bit about your comment about speed
>
> >> depending on the backend used?
>
> >>
>
> >> Yes, you were the person I was referring to :)
>
> >>
>
> >> Oh, and *believe* you said it was the fetching speed that was different
>
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>
> 2.x?
>
> >>
>
> >> Thanks,
>
> >> Otis
>
> >> ----
>
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>
> >> http://sematext.com/spm <http://sematext.com/spm>
>
> >>
>
> >>
>
> >>
>
> >>
>
> >> >________________________________
>
> >> > From: Julien Nioche <lists.digitalpebble@gmail.com <mailto:
> lists.digitalpebble@gmail.com>>
>
> >> >To: "user@nutch.apache.org <ma...@nutch.apache.org>" <
> user@nutch.apache.org <ma...@nutch.apache.org>>
>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
>
> >> >Subject: Re: 2.x vs. 1.x speed
>
> >> >
>
> >> >
>
> >> >Hi Otis,
>
> >> >
>
> >> >That certainly depends on the backend used but on the whole it wouldnt
>
> be
>
> >> >surprising. Would be good to have some data to substantiate it. I am
>
> >> >planning to put my intern on the case and have some basic comparison as
>
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>
> else
>
> >> >wants to do it please go ahead.
>
> >> >
>
> >> >In case I happen to be the person who told you that Otis, well at least
>
> I
>
> >> >am consistent ;-)
>
> >> >
>
> >> >Julien
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com<mailto:
> otis.gospodnetic@gmail.com>>
>
> >> wrote:
>
> >> >
>
> >> >> Hello,
>
> >> >>
>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
>
> >> >> (a little?) slower than 1.x.  Is that still the case?
>
> >> >>
>
> >> >> Thanks,
>
> >> >> Otis
>
> >> >> --
>
> >> >> Solr & ElasticSearch Support -- http://sematext.com/ <
> http://sematext.com/>
>
> >> >> Performance Monitoring -- http://sematext.com/spm <
> http://sematext.com/spm>
>
> >> >>
>
> >> >
>
> >> >
>
> >> >
>
> >> >--
>
> >> >*
>
> >> >*Open Source Solutions for Text Engineering
>
> >> >
>
> >> >http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/
> >
>
> >> >http://www.digitalpebble.com <http://www.digitalpebble.com>
>
> >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
> >> >
>
> >> >
>
> >> >
>
> >>
>
> >
>
> >
>
> >
>
> > --
>
> > *
>
> > *Open Source Solutions for Text Engineering
>
> >
>
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
> >
>
> --
>
> *Lewis*
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> http://www.digitalpebble.com <http://www.digitalpebble.com>
> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
>
>

RE: 2.x vs. 1.x speed

Posted by Markus Jelsma <ma...@openindex.io>.

Thanks! That was interesting.

-----Original message-----
From: Julien Nioche<li...@gmail.com>
Sent: Monday 16th September 2013 18:45
To: user@nutch.apache.org; dev@nutch.apache.org
Cc: Otis Gospodnetic <ot...@yahoo.com>
Subject: Re: 2.x vs. 1.x speed

Guys,

Following the discussion we had some time ago about comparing 1.x with 2.x, we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>

Feel free to comment.

Best,

Julien

On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com <ma...@gmail.com>> wrote:

I am sure that Renato (if he is watching) can plugin maybe as well.

We find in Gora that in every sense of the word, native Hadoop stores such

as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat

via getParitions we retrieve GoraInputSplits natively which means splits

are obtained for MapReduce jobs... such as many of the jobs we run in Nutch

as well. On  the other hand (currently) stores such as Cassandra and Web

service stores such as DynamoDB do not support Hadoop out of the box (the

former we are working on and hope to  have implemented in Gora soon)

therefore it is not as simple to get partitions in the same way we would in

a Hadoop native store. We therefore obtain one partition to be used as an

InputSplit for the MR job. This is certainly an area for concern and right

now a bottleneck for some operations. We continue to work on this.

On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com <ma...@gmail.com>>

wrote:

> Hi Otis

>

> Definitely *not *the fetching speed. Actually everything but *not* the

> fetching speed. The fetcher is pretty much the same as 1.x and anyway the

> performance with fetching is pretty much always limited by the politeness

> settings, not the implementation.

>

> Re-backend : some backend implementations are more mature than others. The

> one for HBase is probably the one most widely used, the Cassandra one has

> been greatly improved in particular performance-wise , the SQL one is

> broken etc... we need to measure this as this is just a gut feeling at

this

> stage

>

> Now for  what is slower and why, again this has to be measured but I

expect

> 2.x to be slower partly because of [1], i.e. the filtering of entries is

> not done by the backends (some might provide a way of doing it) but this

is

> done on the client side, when we create the input for mapred. In other

> words we pull things from the backend just to discard it. Since 2.x does

> not have segments like 1.x (which the fetch + parse mapreduce jobs take as

> single input) we scan the whole table even if we want to fetch or parse a

> handful of entries.

>

> On the other hand, 2.x specifies what columns to retrieve for a given job,

> whereas 1.x will for instance deserialize the crawldatum entirely. The

> metadata objects are costly to read/write so 2.x might have the upper hand

> from that point of view since it pulls and deserializes only what it

needs.

>

> Finally the most costly steps in a large crawl in 1.x are the generation

> and update as we have to read/write the crawldb entirely. The way the

> updates are done in 2.x is different and should be a lot faster.

>

> Please could anyone correct me if I am wrong. Some of this is based on my

> understanding of 2.x which dates back from quite a while and some of the

> stuff might have changed in the meantime. The performance would probably

> vary a lot based on the fine tuning of each backend implementation but

> having some basic comparison would confirm some of the assertions above.

>

> Julien

>

>

> [1] https://issues.apache.org/jira/browse/GORA-119 <https://issues.apache.org/jira/browse/GORA-119>

>

>

> Julien, could you please elaborate a bit about your comment about speed

>> depending on the backend used?

>>

>> Yes, you were the person I was referring to :)

>>

>> Oh, and *believe* you said it was the fetching speed that was different

>> between 1.x and 2.x.  Is that right?  Or is some other phase slower in

2.x?

>>

>> Thanks,

>> Otis

>> ----

>> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -

>> http://sematext.com/spm <http://sematext.com/spm>

>>

>>

>>

>>

>> >________________________________

>> > From: Julien Nioche <lists.digitalpebble@gmail.com <ma...@gmail.com>>

>> >To: "user@nutch.apache.org <ma...@nutch.apache.org>" <user@nutch.apache.org <ma...@nutch.apache.org>>

>> >Sent: Tuesday, August 6, 2013 10:54 AM

>> >Subject: Re: 2.x vs. 1.x speed

>> >

>> >

>> >Hi Otis,

>> >

>> >That certainly depends on the backend used but on the whole it wouldnt

be

>> >surprising. Would be good to have some data to substantiate it. I am

>> >planning to put my intern on the case and have some basic comparison as

>> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone

else

>> >wants to do it please go ahead.

>> >

>> >In case I happen to be the person who told you that Otis, well at least

I

>> >am consistent ;-)

>> >

>> >Julien

>> >

>> >

>> >

>> >

>> >

>> >

>> >

>> >

>> >

>> >

>> >On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com <ma...@gmail.com>>

>> wrote:

>> >

>> >> Hello,

>> >>

>> >> At some point earlier this year I spoke to a person who told me 2.x is

>> >> (a little?) slower than 1.x.  Is that still the case?

>> >>

>> >> Thanks,

>> >> Otis

>> >> --

>> >> Solr & ElasticSearch Support -- http://sematext.com/ <http://sematext.com/>

>> >> Performance Monitoring -- http://sematext.com/spm <http://sematext.com/spm>

>> >>

>> >

>> >

>> >

>> >--

>> >*

>> >*Open Source Solutions for Text Engineering

>> >

>> >http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>

>> >http://www.digitalpebble.com <http://www.digitalpebble.com>

>> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>

>> >

>> >

>> >

>>

>

>

>

> --

> *

> *Open Source Solutions for Text Engineering

>

> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>

> http://www.digitalpebble.com <http://www.digitalpebble.com>

> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>

>

--

*Lewis*

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
http://www.digitalpebble.com <http://www.digitalpebble.com>
http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>