You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Divjot Singh <di...@gmail.com> on 2016/11/01 05:19:46 UTC
Re: Nutch 1.x or 2.x

Hi

I have been using Nutch 2.3 with HBase for over past 1.5 years to crawl
over 30 websites. Yes, it was quite a pain to setup 2.x branch to work with
Hbase as there were many bugs still unresolved in the system.But I was able
to fix them gradually but it took a lot of time and effort to make it
stable.

Because the application needed the data to be directly stored in Hbase so
it was decided to go with 2.x as it supported databases out of the box. I
do love the way the source code is written for 2.x. If you go through it is
quite overwhelming at first.With so much modularity things get quite
confusing.But then when you start debugging the code in an IDE you would
get a good understanding. Also, 2.x is written on the Map-Reduce framework
so that brings some added complexity.

I havn't tried 1.x but what I have heard from my collegues is that it is
quite stable and easy to configure and use. 2.x has a lot of configurations
which would be needed to tweak and test. 2.x is also scalable as we do run
it directly on hadoop clusters but again it has efforts involved becuase it
would fail at first and then you would have to debug again and again.

So it would be better to start with 1.x if you don't have to write data
directly into some database and also if you are not very much proficient in
java and map-reduce frameworks.

Thanks
Divjot

On Tue, Nov 1, 2016 at 12:54 AM, Markus Jelsma <ma...@openindex.io>
wrote:

> It is stable in the sense that it relies on old proven technology. The
> underlying principle of 1.x has not changed much over the years. 2.x, with
> Gora, had trouble years ago, although much less these days.
>
> The point with Gora is that Gora itself, and the chosen storage backend
> could introduce problems. There are simply more points for failure, one
> example is chosing Mongo as backend, with a 512 byte limit in the key
> field. This will cause problems for long URL's, especially 4 byte CJK
> URL's, limiting such a URL to 128 character length. The list is almost
> endless, Cassandra is not very stable out-of-the-box, and HBase has
> peculiar errors sometimes coming from nowhere and recently lead to data
> loss. Does Gora have support for Solr? Solr cloud is finally very stable
> since a few years.
>
> This just illustrates the point that 2.x introduces new pieces the
> developer or your system administrator can worry about. It will hurt you if
> you haven't got the experience and knowledge of these systems. 1.x doesn't
> in the same sense, and it provides more features you probably end up
> porting to 2.x if you want them.
>
> I also would like to take the opportunity again to advice not to use many
> low powered machines versus less high octane machines, it is a very bad
> idea and extremely cost ineffective. This set up will also for certain
> break default Hadoop settings. Settings must change in large scale
> clusters, settings you might not yet know about. The number of needed file
> descriptor alone requires reconfiguring certain settings.
>
>
> -----Original message-----
> > From:Michael Coffey <mc...@yahoo.com.INVALID>
> > Sent: Monday 31st October 2016 19:24
> > To: user@nutch.apache.org
> > Subject: Re: Nutch 1.x or 2.x
> >
> > When you say that 1.x is more stable, what does that mean?
> >
> >
> >       From: Markus Jelsma <ma...@openindex.io>
> >  To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >  Sent: Monday, October 31, 2016 9:39 AM
> >  Subject: RE: Nutch 1.x or 2.x
> >
> > Hello - if you want to crawl big, performance is not really a problem,
> especially using Hadoop output file compression. We chose 1.x, simply
> because it is more stable and feature rich.
> >
> > Using 1.x, it is quite easy to crawl a billion records.
> >
> > Also, do not run on many small machines, your overhead will kill your
> cluster wide performance. It is a complete waste of resources.
> >
> > -----Original message-----
> > > From:Michael Coffey <mc...@yahoo.com.INVALID>
> > > Sent: Sunday 30th October 2016 18:22
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch 1.x or 2.x
> > >
> > > Newbie question: I am trying to decide between Nutch 1.x or 2.x. The
> application is to crawl a large portion of the www using a massive number
> (thousands) of small machines (<= 2GB RAM each). I like the idea of the
> simpler architecture and pluggable storage backend of 2.x. However, I am
> concerned about things I've read about 2.x being less stable and possibly
> less efficient than 1.x. Are these concerns valid at this time?
> > >
> > >
> > >
> > >
> > >
> >
> >
>