You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2014/09/01 11:11:08 UTC

Re: [RELEASE] Apache Nutch 1.9

Hi Guy,

I'm confused as to what are the significant differences between 1.x and
> 2.x.
> Is there a bit of history that I could read about why the development of
> the two parallel to each other happened?
>

See for instance https://www.youtube.com/watch?v=KyHPBtRlo80 (in particular
around 28:00). There are other resources in
http://wiki.apache.org/nutch/Presentations which explain the differences.

As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which
> path would be best for me to follow. So far, 1.x has appeared to be the
> best choice for me, but is that going to change in the next iteration?
> Confused. And a little scared.
>

Don't worry, Nutch 1.x (i.e HDFS-based) will definitely stay. As explained
in the discussion with Lewis, naming Nutch-GORA as '2.x' as probably a bit
of a mistake. Both flavours of Nutch will keep living parallel existences.

Julien

PS: all this and a lot more will be explained at the Nutch workshop at
ApacheCon EU http://sched.co/1pbE15n
<http://wiki.apache.org/nutch/Presentations> as well as Sebastian's talk
http://sched.co/1nyYa7b


>
> Guy McDowell
> guymcdowell@gmail.com
> http://www.GuyMcDowell.com
>
>
>
>
>
> On Fri, Aug 29, 2014 at 11:29 AM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> > +1, great.
> >
> > I'd like to have a conversation about versioning.
> >
> > Since we're at 1.9, my suggestion would be to have the
> > next in the trunk series (1.x) move to version 3.x post
> > 1.9 for the release.
> >
> > Nutch2 remains Nutch and can be worked on there. That
> > would give us a nice split in the diversionary branch
> > paths for Nutch.
> >
> > Cheers,
> > Chris
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <li...@gmail.com>
> > Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > Date: Friday, August 29, 2014 1:35 AM
> > To: "user@nutch.apache.org" <us...@nutch.apache.org>
> > Subject: Re: [RELEASE] Apache Nutch 1.9
> >
> > >Hi Lewis,
> > >
> > >A few comments below.
> > >
> > >I use Nutch 2.x as it enables me to do analytics over the data I am
> > >> crawling. This is my justification for trying to maintain an further
> the
> > >> development on that branch over the last while.
> > >>
> > >
> > >Just out of interest, what sort of analytics do you do and why is it
> > >better
> > >to do it in 2.x than 1.x?
> > >
> > >
> > >> I am also extremely interested in the technologies supported within
> the
> > >> Nutch 2.X stack and I like keeping up with their development and using
> > >>them
> > >> to fix my problems if and when the problems arise.
> > >> I like having fine grained control over my storage architecture. This
> is
> > >> also a pro for me.
> > >>
> > >
> > >Another way to look at it is that having to maintain 2 versions in Nutch
> > >is
> > >an absolute pain, especially given that there aren't very many active
> > >committers.
> > >IMHO the mistake we made a few years ago was to name the GORA-based
> branch
> > >'2.x' as it leads people to think that it is an improvement over 1.x. We
> > >should have called it something like Nutch-GORA or something along these
> > >lines (the original version was called NutchBase) to underline that it
> is
> > >a
> > >different beast, not necessarily a better one.
> > >
> > >Most users are probably not bothered in the underlying technologies so
> > >much
> > >and just want the stuff to work, not fix problems. In my view 2.x is not
> > >production ready, but an experimental branch.
> > >
> > >
> > >
> > >> The performance Julien talks about (and please correct me if I am
> wrong
> > >> Julien) is not so much Nutch related as it is Gora. Different Gora
> > >>backends
> > >> perform differently, this is itself driven by who wishes to maintain
> > >>them.
> > >>
> > >
> > >Not really. The overall performance has improved a bit with the latest
> > >version of GORA but not that different from what we reported in
> > >http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
> > >Some backends are probably better than others indeed but all of them are
> > >atrocious compared to 1.x, I think the reason for that is that these
> NoSQL
> > >tools are optimize to provide random reads/writes to the data and in
> Nutch
> > >we use them mostly in a sequential manner. Whether the functionalities
> we
> > >gain are worth the effort depends on everyone's use case.
> > >
> > >
> > >> On another note, we've identified that for users, Nutch 2.X is a
> bloody
> > >> pain to provision and get running. This is a problem for this branch
> and
> > >> for the people that invest and possibly waste time trying to determine
> > >> revisions, etc.
> > >>
> > >
> > >Could not agree more. That and the fact that it puts additional
> > >constraints
> > >on the hardware and means servers with bigger specs (££££)
> > >
> > >
> > >>
> > >> It is my intention to build different Vagrant flavours for each Nutch
> > >>2.X
> > >> stack.
> > >> https://issues.apache.org/jira/browse/NUTCH-1812
> > >>
> > >> If ANYONE on this list is intersted in helping with this effort them I
> > >> would dedicate some time to document the process on the wiki so that
> it
> > >>can
> > >> be reproduced for everyone's benefit. I feel that this would be a huge
> > >>move
> > >> forward for the 2.X branch.
> > >>
> > >
> > > Thanks for your enthusiasm and efforts Lewis!
> > >
> > >For anyone insterested in 2.x - there are quite a few issues you can
> help
> > >with if you feel so inclined, see
> > >
> >
> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select
> > >edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
> > >
> > >Julien
> > >
> > >--
> > >
> > >Open Source Solutions for Text Engineering
> > >
> > >http://digitalpebble.blogspot.com/
> > >http://www.digitalpebble.com
> > >http://twitter.com/digitalpebble
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: [RELEASE] Apache Nutch 1.9

Posted by Mo Omer <be...@gmail.com>.
All, terribly sorry for MY late replies too! 

I have a docker container set up for 2.2.1; if anyone is interested, I can make an open one available; it's configured with Maestro NG to allow you to start it up with the knowledge of where your solr instance is (ie just passing an env variable).

I'll see if I can carve out some time to help on the issues, but I'm pretty swamped at the moment with work, meetup groups etc. 

Let me know if I can help out in any way that's not super time critical,

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Sep 1, 2014, at 4:11 AM, Julien Nioche <li...@gmail.com> wrote:
> 
> Hi Guy,
> 
> I'm confused as to what are the significant differences between 1.x and
>> 2.x.
>> Is there a bit of history that I could read about why the development of
>> the two parallel to each other happened?
> 
> See for instance https://www.youtube.com/watch?v=KyHPBtRlo80 (in particular
> around 28:00). There are other resources in
> http://wiki.apache.org/nutch/Presentations which explain the differences.
> 
> As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which
>> path would be best for me to follow. So far, 1.x has appeared to be the
>> best choice for me, but is that going to change in the next iteration?
>> Confused. And a little scared.
> 
> Don't worry, Nutch 1.x (i.e HDFS-based) will definitely stay. As explained
> in the discussion with Lewis, naming Nutch-GORA as '2.x' as probably a bit
> of a mistake. Both flavours of Nutch will keep living parallel existences.
> 
> Julien
> 
> PS: all this and a lot more will be explained at the Nutch workshop at
> ApacheCon EU http://sched.co/1pbE15n
> <http://wiki.apache.org/nutch/Presentations> as well as Sebastian's talk
> http://sched.co/1nyYa7b
> 
> 
>> 
>> Guy McDowell
>> guymcdowell@gmail.com
>> http://www.GuyMcDowell.com
>> 
>> 
>> 
>> 
>> 
>> On Fri, Aug 29, 2014 at 11:29 AM, Mattmann, Chris A (3980) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>> 
>>> +1, great.
>>> 
>>> I'd like to have a conversation about versioning.
>>> 
>>> Since we're at 1.9, my suggestion would be to have the
>>> next in the trunk series (1.x) move to version 3.x post
>>> 1.9 for the release.
>>> 
>>> Nutch2 remains Nutch and can be worked on there. That
>>> would give us a nice split in the diversionary branch
>>> paths for Nutch.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Julien Nioche <li...@gmail.com>
>>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>> Date: Friday, August 29, 2014 1:35 AM
>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>> Subject: Re: [RELEASE] Apache Nutch 1.9
>>> 
>>>> Hi Lewis,
>>>> 
>>>> A few comments below.
>>>> 
>>>> I use Nutch 2.x as it enables me to do analytics over the data I am
>>>>> crawling. This is my justification for trying to maintain an further
>> the
>>>>> development on that branch over the last while.
>>>> 
>>>> Just out of interest, what sort of analytics do you do and why is it
>>>> better
>>>> to do it in 2.x than 1.x?
>>>> 
>>>> 
>>>>> I am also extremely interested in the technologies supported within
>> the
>>>>> Nutch 2.X stack and I like keeping up with their development and using
>>>>> them
>>>>> to fix my problems if and when the problems arise.
>>>>> I like having fine grained control over my storage architecture. This
>> is
>>>>> also a pro for me.
>>>> 
>>>> Another way to look at it is that having to maintain 2 versions in Nutch
>>>> is
>>>> an absolute pain, especially given that there aren't very many active
>>>> committers.
>>>> IMHO the mistake we made a few years ago was to name the GORA-based
>> branch
>>>> '2.x' as it leads people to think that it is an improvement over 1.x. We
>>>> should have called it something like Nutch-GORA or something along these
>>>> lines (the original version was called NutchBase) to underline that it
>> is
>>>> a
>>>> different beast, not necessarily a better one.
>>>> 
>>>> Most users are probably not bothered in the underlying technologies so
>>>> much
>>>> and just want the stuff to work, not fix problems. In my view 2.x is not
>>>> production ready, but an experimental branch.
>>>> 
>>>> 
>>>> 
>>>>> The performance Julien talks about (and please correct me if I am
>> wrong
>>>>> Julien) is not so much Nutch related as it is Gora. Different Gora
>>>>> backends
>>>>> perform differently, this is itself driven by who wishes to maintain
>>>>> them.
>>>> 
>>>> Not really. The overall performance has improved a bit with the latest
>>>> version of GORA but not that different from what we reported in
>>>> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
>>>> Some backends are probably better than others indeed but all of them are
>>>> atrocious compared to 1.x, I think the reason for that is that these
>> NoSQL
>>>> tools are optimize to provide random reads/writes to the data and in
>> Nutch
>>>> we use them mostly in a sequential manner. Whether the functionalities
>> we
>>>> gain are worth the effort depends on everyone's use case.
>>>> 
>>>> 
>>>>> On another note, we've identified that for users, Nutch 2.X is a
>> bloody
>>>>> pain to provision and get running. This is a problem for this branch
>> and
>>>>> for the people that invest and possibly waste time trying to determine
>>>>> revisions, etc.
>>>> 
>>>> Could not agree more. That and the fact that it puts additional
>>>> constraints
>>>> on the hardware and means servers with bigger specs (££££)
>>>> 
>>>> 
>>>>> 
>>>>> It is my intention to build different Vagrant flavours for each Nutch
>>>>> 2.X
>>>>> stack.
>>>>> https://issues.apache.org/jira/browse/NUTCH-1812
>>>>> 
>>>>> If ANYONE on this list is intersted in helping with this effort them I
>>>>> would dedicate some time to document the process on the wiki so that
>> it
>>>>> can
>>>>> be reproduced for everyone's benefit. I feel that this would be a huge
>>>>> move
>>>>> forward for the 2.X branch.
>>>> 
>>>> Thanks for your enthusiasm and efforts Lewis!
>>>> 
>>>> For anyone insterested in 2.x - there are quite a few issues you can
>> help
>>>> with if you feel so inclined, see
>> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select
>>>> edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
>>>> 
>>>> Julien
>>>> 
>>>> --
>>>> 
>>>> Open Source Solutions for Text Engineering
>>>> 
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble