You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Danilo Fernandes <da...@kelsorfernandes.com.br> on 2013/02/25 22:56:02 UTC

Differences between 2.1 and 1.6

Hi everyone,

Somebody can tell me about differences between 2.1 and 1.6?

The SVN trunk is 1.* or 2.*?

Thanks,
Danilo Fernandes


Re: Differences between 2.1 and 1.6

Posted by Tejas Patil <te...@gmail.com>.
Hi Danilo,

On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
danilo@kelsorfernandes.com.br> wrote:

> Hi everyone,
>
> Somebody can tell me about differences between 2.1 and 1.6?
>

[1] and [2] would be informative reads.

>
> The SVN trunk is 1.* or 2.*?
>

Trunk [3] is 1.x.  2.X can be found here [4]

>
> Thanks,
> Danilo Fernandes
>
>
[1] : http://digitalpebble.blogspot.com/2012/07/nutch-20-is-out-at-last.html
[2] :
http://lucene.472066.n3.nabble.com/differences-between-nutch-1-and-nutch-2-td4031548.html
[3] : http://svn.apache.org/repos/asf/nutch/trunk/
[4] : http://svn.apache.org/repos/asf/nutch/branches/2.x/

Thanks,
Tejas Patil

Re: Differences between 2.1 and 1.6

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,
This is very useful thank you.
Lewis

On Mon, Feb 25, 2013 at 3:08 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Something seems to be missing here. It's clear that 1.x has more features
> and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a
> lot better if you are going to crawl on a very large scale but i still
> haven't seen any numbers to support this assumption. Nutch 1.x can easily
> deal with many millions of records and deal with billions if you throw some
> hardware at it.
>
> Most users are not going to crawl millions or records. In that case i
> personally choose 1.x. I prefer the stability and predictabilty above some
> performance you are not likely going to need anyway.
>
> Besides our large 1.x research cluster we still use 1.x in production for
> all our customers, running locally on a 2 core 512MB RAM VPS with a crawldb
> of over 5 million records and it runs fine, fast and keeps up with newly
> discovered URL's. The only significant improvements were a better scoring
> filter and integrating indexing in the fetcher.
>
> -----Original message-----
> > From:Lewis John Mcgibbney <le...@gmail.com>
> > Sent: Mon 25-Feb-2013 23:37
> > To: user@nutch.apache.org
> > Subject: Re: Differences between 2.1 and 1.6
> >
> > Hi Danilo,
> >
> > You can check out the architecture changes here
> > http://wiki.apache.org/nutch/#Nutch_2.x
> >
> > Nutch trunk (1.7-SNAPSHOT) is here
> > http://svn.apache.org/repos/asf/nutch/trunk/
> >
> > 2.x is here
> > http://svn.apache.org/repos/asf/nutch/branches/2.x/
> >
> > On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
> > danilo@kelsorfernandes.com.br> wrote:
> >
> > > Hi everyone,
> > >
> > > Somebody can tell me about differences between 2.1 and 1.6?
> > >
> > > The SVN trunk is 1.* or 2.*?
> > >
> > > Thanks,
> > > Danilo Fernandes
> > >
> > >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Differences between 2.1 and 1.6

Posted by Julien Nioche <li...@gmail.com>.
Just to add to what Markus said : 2.x should be slower than 1.x at least
until https://issues.apache.org/jira/browse/GORA-119 is implemented
regardless of the size of the crawl. What currently happens in 2.x is that
all the entries are taken from the backend then filtered in GORA as part of
the mapreduce jobs, whereas 1.x will do some operations (fetch / parse) on
the content of the segments only which is smaller than the whole crawldb.
This is even more of an issue as the crawl gets larger so 1.x is currently
a better option regardless of the scale of the crawl.

It should be a different story when GORA-119 is done and having numbers to
compare will be very useful, with the added twist that performance will
probably vary a lot depending on the backend and their configuration.

Julien

On 25 February 2013 23:08, Markus Jelsma <ma...@openindex.io> wrote:

> Something seems to be missing here. It's clear that 1.x has more features
> and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a
> lot better if you are going to crawl on a very large scale but i still
> haven't seen any numbers to support this assumption. Nutch 1.x can easily
> deal with many millions of records and deal with billions if you throw some
> hardware at it.
>
> Most users are not going to crawl millions or records. In that case i
> personally choose 1.x. I prefer the stability and predictabilty above some
> performance you are not likely going to need anyway.
>
> Besides our large 1.x research cluster we still use 1.x in production for
> all our customers, running locally on a 2 core 512MB RAM VPS with a crawldb
> of over 5 million records and it runs fine, fast and keeps up with newly
> discovered URL's. The only significant improvements were a better scoring
> filter and integrating indexing in the fetcher.
>
> -----Original message-----
> > From:Lewis John Mcgibbney <le...@gmail.com>
> > Sent: Mon 25-Feb-2013 23:37
> > To: user@nutch.apache.org
> > Subject: Re: Differences between 2.1 and 1.6
> >
> > Hi Danilo,
> >
> > You can check out the architecture changes here
> > http://wiki.apache.org/nutch/#Nutch_2.x
> >
> > Nutch trunk (1.7-SNAPSHOT) is here
> > http://svn.apache.org/repos/asf/nutch/trunk/
> >
> > 2.x is here
> > http://svn.apache.org/repos/asf/nutch/branches/2.x/
> >
> > On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
> > danilo@kelsorfernandes.com.br> wrote:
> >
> > > Hi everyone,
> > >
> > > Somebody can tell me about differences between 2.1 and 1.6?
> > >
> > > The SVN trunk is 1.* or 2.*?
> > >
> > > Thanks,
> > > Danilo Fernandes
> > >
> > >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: Differences between 2.1 and 1.6

Posted by Markus Jelsma <ma...@openindex.io>.
Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you are going to crawl on a very large scale but i still haven't seen any numbers to support this assumption. Nutch 1.x can easily deal with many millions of records and deal with billions if you throw some hardware at it. 

Most users are not going to crawl millions or records. In that case i personally choose 1.x. I prefer the stability and predictabilty above some performance you are not likely going to need anyway. 

Besides our large 1.x research cluster we still use 1.x in production for all our customers, running locally on a 2 core 512MB RAM VPS with a crawldb of over 5 million records and it runs fine, fast and keeps up with newly discovered URL's. The only significant improvements were a better scoring filter and integrating indexing in the fetcher.
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Mon 25-Feb-2013 23:37
> To: user@nutch.apache.org
> Subject: Re: Differences between 2.1 and 1.6
> 
> Hi Danilo,
> 
> You can check out the architecture changes here
> http://wiki.apache.org/nutch/#Nutch_2.x
> 
> Nutch trunk (1.7-SNAPSHOT) is here
> http://svn.apache.org/repos/asf/nutch/trunk/
> 
> 2.x is here
> http://svn.apache.org/repos/asf/nutch/branches/2.x/
> 
> On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
> danilo@kelsorfernandes.com.br> wrote:
> 
> > Hi everyone,
> >
> > Somebody can tell me about differences between 2.1 and 1.6?
> >
> > The SVN trunk is 1.* or 2.*?
> >
> > Thanks,
> > Danilo Fernandes
> >
> >
> 
> 
> -- 
> *Lewis*
> 

Re: Differences between 2.1 and 1.6

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Danilo,

You can check out the architecture changes here
http://wiki.apache.org/nutch/#Nutch_2.x

Nutch trunk (1.7-SNAPSHOT) is here
http://svn.apache.org/repos/asf/nutch/trunk/

2.x is here
http://svn.apache.org/repos/asf/nutch/branches/2.x/

On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes <
danilo@kelsorfernandes.com.br> wrote:

> Hi everyone,
>
> Somebody can tell me about differences between 2.1 and 1.6?
>
> The SVN trunk is 1.* or 2.*?
>
> Thanks,
> Danilo Fernandes
>
>


-- 
*Lewis*