You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Iqbal Shaikh <iq...@transformuk.com> on 2014/08/29 13:20:29 UTC

Nutch Confusion

Hi All,

Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version.

We would be indexing our crawled data in ElasticSearch 1.x version.

I know the 2.2.1 version provides OTB support for Elastic 0.x version but to use 2.x I need to change the code (ElasticWriter.java) This means its a customised Nutch installation, which I don't prefer.

However even though 1.9 doesn't provide Elastic as default it does support 1.x OTB which means no code change at all. And this is a big advantage.

I don't really need the flexibility provided by GORA as we're ok to use HBase. Also 2.x doesn't seem to have periodic commits compared to 1.9

Therefore I was wondering what others think as am not sure about the Roadmap going forward, are we going to cease 1.x at some point and migrate the missing functionality to 2.x or we going to continue to have two parallel versions.

Any suggestion to help me make my decision please?

Thanks,

Iqbal Shaikh
Transform is a trading division of Engine Partners UK LLP, a limited liability partnership registered in England & Wales with registered number OC365812. 
Our registered office is at 60 Great Portland Street, London  W1W 7RT, United Kingdom. 
A list of our members is open for inspection at our registered office.

RE: Nutch Confusion

Posted by Iqbal Shaikh <iq...@transformuk.com>.
Thanks for all the suggestions. I think am getting there :)

Personally I think maintaining two versioning causes lot of confusion for newbies like me.

Perhaps as someone suggested earlier, just have a big 2.3 or even 3.x with all functionalities of 1.x and 2.x in one bundle and deprecate 1.x version altogether. That's how the rest of the open source libraries work don't that.

Iqbal Shaikh
________________________________________
From: Ali Nazemian [alinazemian@gmail.com]
Sent: 29 August 2014 15:15
To: user@nutch.apache.org
Subject: Re: Nutch Confusion

Dear Iqbal,
Hi,
As far as I know, If you dont need Gora mapper for using Nutch over Hbase
or MySQL or etc. , it is better to use version 1.x since some of Nutch
functionality are not implemented on version 2.x and Nutch 1.x provides
better performance for crawling web pages. ES is not difficult index-writer
in Nutch 1.x so you should disable solr-index writer and enable ES by
adding that to nutch-site.xml plugins part.
Regards.


On Fri, Aug 29, 2014 at 5:38 PM, Iqbal Shaikh <iq...@transformuk.com>
wrote:

> Thanks Julien for the prompt response.
>
> Actually since the model for 1.9 version is all plugin based I shouldn't
> be expecting an ivy.xml like in 2.x to have a elastic config. So ignore
> that comment.
>
> Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one
> for 1.9 too ?
>
> Perhaps this article is a bit misleading
> http://www.infoq.com/articles/nioche-apache-nutch2 based on your
> clarification. Maybe there should be another follow on to that article.
>
> Thanks,
> Iqbal Shaikh
> ________________________________________
> From: Julien Nioche [lists.digitalpebble@gmail.com]
> Sent: 29 August 2014 12:41
> To: user@nutch.apache.org
> Subject: Re: Nutch Confusion
>
> Hi Iqbal,
>
> Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
> > version.
> >
> > We would be indexing our crawled data in ElasticSearch 1.x version.
> >
> > I know the 2.2.1 version provides OTB support for Elastic 0.x version but
> > to use 2.x I need to change the code (ElasticWriter.java) This means its
> a
> > customised Nutch installation, which I don't prefer.
> >
> > However even though 1.9 doesn't provide Elastic as default it does
> support
> > 1.x OTB which means no code change at all. And this is a big advantage.
> >
>
> what do you mean by '1.9 doesn't provide ES by default'?
>
>
> >
> > I don't really need the flexibility provided by GORA as we're ok to use
> > HBase.
>
>
> do you mean HDFS?
>
>
> > Also 2.x doesn't seem to have periodic commits compared to 1.9
> >
> > Therefore I was wondering what others think as am not sure about the
> > Roadmap going forward, are we going to cease 1.x at some point and
> migrate
> > the missing functionality to 2.x or we going to continue to have two
> > parallel versions.
> >
>
> more likely two parallel versions. 2.x is not making much progress. IMHO of
> the two versions 1.x is not the one which is going to die first ;-)
>
>
> >
> > Any suggestion to help me make my decision please?
> >
>
> See discussion on this list (
> http://www.mail-archive.com/user@nutch.apache.org/msg12550.html). 1.x is
> more robust, faster and more actively maintained. Since it sounds like you
> don't have any need for any specific features from 2.x then I'd recommend
> to use 1.x.
>
> HTH
>
> Julien
>
>
>
>
> >
> > Thanks,
> >
> > Iqbal Shaikh
> > Transform is a trading division of Engine Partners UK LLP, a limited
> > liability partnership registered in England & Wales with registered
> number
> > OC365812.
> > Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> > United Kingdom.
> > A list of our members is open for inspection at our registered office.
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
> Transform is a trading division of Engine Partners UK LLP, a limited
> liability partnership registered in England & Wales with registered number
> OC365812.
> Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> United Kingdom.
> A list of our members is open for inspection at our registered office.
>



--
A.Nazemian

Transform is a trading division of Engine Partners UK LLP, a limited liability partnership registered in England & Wales with registered number OC365812. 
Our registered office is at 60 Great Portland Street, London  W1W 7RT, United Kingdom. 
A list of our members is open for inspection at our registered office.

Re: Nutch Confusion

Posted by Ali Nazemian <al...@gmail.com>.
Dear Iqbal,
Hi,
As far as I know, If you dont need Gora mapper for using Nutch over Hbase
or MySQL or etc. , it is better to use version 1.x since some of Nutch
functionality are not implemented on version 2.x and Nutch 1.x provides
better performance for crawling web pages. ES is not difficult index-writer
in Nutch 1.x so you should disable solr-index writer and enable ES by
adding that to nutch-site.xml plugins part.
Regards.


On Fri, Aug 29, 2014 at 5:38 PM, Iqbal Shaikh <iq...@transformuk.com>
wrote:

> Thanks Julien for the prompt response.
>
> Actually since the model for 1.9 version is all plugin based I shouldn't
> be expecting an ivy.xml like in 2.x to have a elastic config. So ignore
> that comment.
>
> Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one
> for 1.9 too ?
>
> Perhaps this article is a bit misleading
> http://www.infoq.com/articles/nioche-apache-nutch2 based on your
> clarification. Maybe there should be another follow on to that article.
>
> Thanks,
> Iqbal Shaikh
> ________________________________________
> From: Julien Nioche [lists.digitalpebble@gmail.com]
> Sent: 29 August 2014 12:41
> To: user@nutch.apache.org
> Subject: Re: Nutch Confusion
>
> Hi Iqbal,
>
> Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
> > version.
> >
> > We would be indexing our crawled data in ElasticSearch 1.x version.
> >
> > I know the 2.2.1 version provides OTB support for Elastic 0.x version but
> > to use 2.x I need to change the code (ElasticWriter.java) This means its
> a
> > customised Nutch installation, which I don't prefer.
> >
> > However even though 1.9 doesn't provide Elastic as default it does
> support
> > 1.x OTB which means no code change at all. And this is a big advantage.
> >
>
> what do you mean by '1.9 doesn't provide ES by default'?
>
>
> >
> > I don't really need the flexibility provided by GORA as we're ok to use
> > HBase.
>
>
> do you mean HDFS?
>
>
> > Also 2.x doesn't seem to have periodic commits compared to 1.9
> >
> > Therefore I was wondering what others think as am not sure about the
> > Roadmap going forward, are we going to cease 1.x at some point and
> migrate
> > the missing functionality to 2.x or we going to continue to have two
> > parallel versions.
> >
>
> more likely two parallel versions. 2.x is not making much progress. IMHO of
> the two versions 1.x is not the one which is going to die first ;-)
>
>
> >
> > Any suggestion to help me make my decision please?
> >
>
> See discussion on this list (
> http://www.mail-archive.com/user@nutch.apache.org/msg12550.html). 1.x is
> more robust, faster and more actively maintained. Since it sounds like you
> don't have any need for any specific features from 2.x then I'd recommend
> to use 1.x.
>
> HTH
>
> Julien
>
>
>
>
> >
> > Thanks,
> >
> > Iqbal Shaikh
> > Transform is a trading division of Engine Partners UK LLP, a limited
> > liability partnership registered in England & Wales with registered
> number
> > OC365812.
> > Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> > United Kingdom.
> > A list of our members is open for inspection at our registered office.
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
> Transform is a trading division of Engine Partners UK LLP, a limited
> liability partnership registered in England & Wales with registered number
> OC365812.
> Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> United Kingdom.
> A list of our members is open for inspection at our registered office.
>



-- 
A.Nazemian

RE: Nutch Confusion

Posted by Iqbal Shaikh <iq...@transformuk.com>.
Thanks Julien for the prompt response.

Actually since the model for 1.9 version is all plugin based I shouldn't be expecting an ivy.xml like in 2.x to have a elastic config. So ignore that comment.

Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one for 1.9 too ?

Perhaps this article is a bit misleading http://www.infoq.com/articles/nioche-apache-nutch2 based on your clarification. Maybe there should be another follow on to that article.

Thanks,
Iqbal Shaikh
________________________________________
From: Julien Nioche [lists.digitalpebble@gmail.com]
Sent: 29 August 2014 12:41
To: user@nutch.apache.org
Subject: Re: Nutch Confusion

Hi Iqbal,

Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
> version.
>
> We would be indexing our crawled data in ElasticSearch 1.x version.
>
> I know the 2.2.1 version provides OTB support for Elastic 0.x version but
> to use 2.x I need to change the code (ElasticWriter.java) This means its a
> customised Nutch installation, which I don't prefer.
>
> However even though 1.9 doesn't provide Elastic as default it does support
> 1.x OTB which means no code change at all. And this is a big advantage.
>

what do you mean by '1.9 doesn't provide ES by default'?


>
> I don't really need the flexibility provided by GORA as we're ok to use
> HBase.


do you mean HDFS?


> Also 2.x doesn't seem to have periodic commits compared to 1.9
>
> Therefore I was wondering what others think as am not sure about the
> Roadmap going forward, are we going to cease 1.x at some point and migrate
> the missing functionality to 2.x or we going to continue to have two
> parallel versions.
>

more likely two parallel versions. 2.x is not making much progress. IMHO of
the two versions 1.x is not the one which is going to die first ;-)


>
> Any suggestion to help me make my decision please?
>

See discussion on this list (
http://www.mail-archive.com/user@nutch.apache.org/msg12550.html). 1.x is
more robust, faster and more actively maintained. Since it sounds like you
don't have any need for any specific features from 2.x then I'd recommend
to use 1.x.

HTH

Julien




>
> Thanks,
>
> Iqbal Shaikh
> Transform is a trading division of Engine Partners UK LLP, a limited
> liability partnership registered in England & Wales with registered number
> OC365812.
> Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> United Kingdom.
> A list of our members is open for inspection at our registered office.




--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Transform is a trading division of Engine Partners UK LLP, a limited liability partnership registered in England & Wales with registered number OC365812. 
Our registered office is at 60 Great Portland Street, London  W1W 7RT, United Kingdom. 
A list of our members is open for inspection at our registered office.

Re: Nutch Confusion

Posted by Julien Nioche <li...@gmail.com>.
Hi Iqbal,

Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
> version.
>
> We would be indexing our crawled data in ElasticSearch 1.x version.
>
> I know the 2.2.1 version provides OTB support for Elastic 0.x version but
> to use 2.x I need to change the code (ElasticWriter.java) This means its a
> customised Nutch installation, which I don't prefer.
>
> However even though 1.9 doesn't provide Elastic as default it does support
> 1.x OTB which means no code change at all. And this is a big advantage.
>

what do you mean by '1.9 doesn't provide ES by default'?


>
> I don't really need the flexibility provided by GORA as we're ok to use
> HBase.


do you mean HDFS?


> Also 2.x doesn't seem to have periodic commits compared to 1.9
>
> Therefore I was wondering what others think as am not sure about the
> Roadmap going forward, are we going to cease 1.x at some point and migrate
> the missing functionality to 2.x or we going to continue to have two
> parallel versions.
>

more likely two parallel versions. 2.x is not making much progress. IMHO of
the two versions 1.x is not the one which is going to die first ;-)


>
> Any suggestion to help me make my decision please?
>

See discussion on this list (
http://www.mail-archive.com/user@nutch.apache.org/msg12550.html). 1.x is
more robust, faster and more actively maintained. Since it sounds like you
don't have any need for any specific features from 2.x then I'd recommend
to use 1.x.

HTH

Julien




>
> Thanks,
>
> Iqbal Shaikh
> Transform is a trading division of Engine Partners UK LLP, a limited
> liability partnership registered in England & Wales with registered number
> OC365812.
> Our registered office is at 60 Great Portland Street, London  W1W 7RT,
> United Kingdom.
> A list of our members is open for inspection at our registered office.




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble