You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by KK <di...@gmail.com> on 2009/06/06 09:39:50 UTC

Use nutch for crawling purpose?

Hi All,
I've been using Solr and Lucene for some time. I started with Solr then
moved to lucene because of more flexibility/openness in lucene, but I like
both. As per my requirement I want to crawl webpages and add to lucene
indexing. So far I've been doing crawling manually and adding them to lucene
index though lucene APIs. The webpages have content which is a mix of say 5%
english and remaining non-english[indian] content. To handle stemming/stop
word removal for the english part, I wrote a small custom analyzer for use
in lucene and thats working fairly well. Now I was thinking of doing the
crawling part using Nutch. Does this sound OK. I went through the nutch wiki
page and found that it supports a bunch of file types[like html/xml, pdf,
odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says
that it builds distributed indexes using Hadoop[I've used Hadoop a bit] that
uses teh map-reduce architecture. But for my requirement I dont need that
much of things. Distributed inexing is not required, so essentially I dont
need hadoop/map-reduce stuffs. So let me summarize things I want
#. Crawl the webpage, I want nutch to hand me over the content, I dont want
it to directly post that content to lucene by itself. Essentially I want to
interfare in between crawling and indexing, as I've to use custom analyzer
before the contents are indexed by lucene.
#. For me html parsing is good enough[no need of pdf/odf/msword etc]
#. No need of hadoop/map-reduce.

I'ld like the users of nutch to let me know their views. Other option is to
look for Java opensource crawlers that can do the job. I dont find any and
I'm more interested in using something really good/well tested like nutch.
Let me know your opinions.

Thanks,
KK.

Re: Use nutch for crawling purpose?

Posted by Raymond Balmès <ra...@gmail.com>.
If you've done it for Lucene, it shouldn't be that difficult in Nutch, I'm
also no Java guru.
Look at the wiki it's rather well explained.
If you add fields which I did you need both an indexing & query plug-in, but
it is quite straightforward. My only real problem was to find that fields
have to be lowercase otherwise because of the search bean.

-Ray-

2009/6/6 KK <di...@gmail.com>

> Thanks Raymond.
> So, as per your mail if I should use nutch for both crawling and indexing,
> right? And for the pre-processing the input before indexing I've to write
> some plugin that will do the job of customanalyzer that I was using earlier
> with lucene, right? How easy/difficult is writing this analyzer as a plugin
> for nutch, and I must say that I'm a avg Java programmer? Let me know your
> views.
>
> Thanks,
> KK
>
> On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <raymond.balmes@gmail.com
> >wrote:
>
> > I had started your approach initially , ie building my indexing on lucene
> > only... but eventually completely dropped it.
> > Only working out of nutch with custom indexing plug-in and I'm quite
> happy
> > with it, the only downside I found is that the Nutch search bean does not
> > offer as much functionnality as I needed, so I had to build my own
> plug-in
> > too there. Later I decided to build a scoring plug-in too to focus the
> > crawl
> > : works great.
> >
> > I dont see the need for hadoop really either right now, but I like the
> idea
> > that if I need it it will be there because my crawls might become quite
> > big/long.
> >
> > Not sure if I will move to solr indexing in the future, I'm trying to
> avoid
> > at this moment to minimize complexity.
> >
> > -Raymond-
> >
> >
> >
> > 2009/6/6 KK <di...@gmail.com>
> >
> > > Hi All,
> > > I've been using Solr and Lucene for some time. I started with Solr then
> > > moved to lucene because of more flexibility/openness in lucene, but I
> > like
> > > both. As per my requirement I want to crawl webpages and add to lucene
> > > indexing. So far I've been doing crawling manually and adding them to
> > > lucene
> > > index though lucene APIs. The webpages have content which is a mix of
> say
> > > 5%
> > > english and remaining non-english[indian] content. To handle
> > stemming/stop
> > > word removal for the english part, I wrote a small custom analyzer for
> > use
> > > in lucene and thats working fairly well. Now I was thinking of doing
> the
> > > crawling part using Nutch. Does this sound OK. I went through the nutch
> > > wiki
> > > page and found that it supports a bunch of file types[like html/xml,
> pdf,
> > > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki
> > says
> > > that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> > > that
> > > uses teh map-reduce architecture. But for my requirement I dont need
> that
> > > much of things. Distributed inexing is not required, so essentially I
> > dont
> > > need hadoop/map-reduce stuffs. So let me summarize things I want
> > > #. Crawl the webpage, I want nutch to hand me over the content, I dont
> > want
> > > it to directly post that content to lucene by itself. Essentially I
> want
> > to
> > > interfare in between crawling and indexing, as I've to use custom
> > analyzer
> > > before the contents are indexed by lucene.
> > > #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> > > #. No need of hadoop/map-reduce.
> > >
> > > I'ld like the users of nutch to let me know their views. Other option
> is
> > to
> > > look for Java opensource crawlers that can do the job. I dont find any
> > and
> > > I'm more interested in using something really good/well tested like
> > nutch.
> > > Let me know your opinions.
> > >
> > > Thanks,
> > > KK.
> > >
> >
>

Re: Use nutch for crawling purpose?

Posted by KK <di...@gmail.com>.
Thanks Raymond.
So, as per your mail if I should use nutch for both crawling and indexing,
right? And for the pre-processing the input before indexing I've to write
some plugin that will do the job of customanalyzer that I was using earlier
with lucene, right? How easy/difficult is writing this analyzer as a plugin
for nutch, and I must say that I'm a avg Java programmer? Let me know your
views.

Thanks,
KK

On Sat, Jun 6, 2009 at 2:03 PM, Raymond Balmès <ra...@gmail.com>wrote:

> I had started your approach initially , ie building my indexing on lucene
> only... but eventually completely dropped it.
> Only working out of nutch with custom indexing plug-in and I'm quite happy
> with it, the only downside I found is that the Nutch search bean does not
> offer as much functionnality as I needed, so I had to build my own plug-in
> too there. Later I decided to build a scoring plug-in too to focus the
> crawl
> : works great.
>
> I dont see the need for hadoop really either right now, but I like the idea
> that if I need it it will be there because my crawls might become quite
> big/long.
>
> Not sure if I will move to solr indexing in the future, I'm trying to avoid
> at this moment to minimize complexity.
>
> -Raymond-
>
>
>
> 2009/6/6 KK <di...@gmail.com>
>
> > Hi All,
> > I've been using Solr and Lucene for some time. I started with Solr then
> > moved to lucene because of more flexibility/openness in lucene, but I
> like
> > both. As per my requirement I want to crawl webpages and add to lucene
> > indexing. So far I've been doing crawling manually and adding them to
> > lucene
> > index though lucene APIs. The webpages have content which is a mix of say
> > 5%
> > english and remaining non-english[indian] content. To handle
> stemming/stop
> > word removal for the english part, I wrote a small custom analyzer for
> use
> > in lucene and thats working fairly well. Now I was thinking of doing the
> > crawling part using Nutch. Does this sound OK. I went through the nutch
> > wiki
> > page and found that it supports a bunch of file types[like html/xml, pdf,
> > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki
> says
> > that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> > that
> > uses teh map-reduce architecture. But for my requirement I dont need that
> > much of things. Distributed inexing is not required, so essentially I
> dont
> > need hadoop/map-reduce stuffs. So let me summarize things I want
> > #. Crawl the webpage, I want nutch to hand me over the content, I dont
> want
> > it to directly post that content to lucene by itself. Essentially I want
> to
> > interfare in between crawling and indexing, as I've to use custom
> analyzer
> > before the contents are indexed by lucene.
> > #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> > #. No need of hadoop/map-reduce.
> >
> > I'ld like the users of nutch to let me know their views. Other option is
> to
> > look for Java opensource crawlers that can do the job. I dont find any
> and
> > I'm more interested in using something really good/well tested like
> nutch.
> > Let me know your opinions.
> >
> > Thanks,
> > KK.
> >
>

Re: Use nutch for crawling purpose?

Posted by Raymond Balmès <ra...@gmail.com>.
I had started your approach initially , ie building my indexing on lucene
only... but eventually completely dropped it.
Only working out of nutch with custom indexing plug-in and I'm quite happy
with it, the only downside I found is that the Nutch search bean does not
offer as much functionnality as I needed, so I had to build my own plug-in
too there. Later I decided to build a scoring plug-in too to focus the crawl
: works great.

I dont see the need for hadoop really either right now, but I like the idea
that if I need it it will be there because my crawls might become quite
big/long.

Not sure if I will move to solr indexing in the future, I'm trying to avoid
at this moment to minimize complexity.

-Raymond-



2009/6/6 KK <di...@gmail.com>

> Hi All,
> I've been using Solr and Lucene for some time. I started with Solr then
> moved to lucene because of more flexibility/openness in lucene, but I like
> both. As per my requirement I want to crawl webpages and add to lucene
> indexing. So far I've been doing crawling manually and adding them to
> lucene
> index though lucene APIs. The webpages have content which is a mix of say
> 5%
> english and remaining non-english[indian] content. To handle stemming/stop
> word removal for the english part, I wrote a small custom analyzer for use
> in lucene and thats working fairly well. Now I was thinking of doing the
> crawling part using Nutch. Does this sound OK. I went through the nutch
> wiki
> page and found that it supports a bunch of file types[like html/xml, pdf,
> odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says
> that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> that
> uses teh map-reduce architecture. But for my requirement I dont need that
> much of things. Distributed inexing is not required, so essentially I dont
> need hadoop/map-reduce stuffs. So let me summarize things I want
> #. Crawl the webpage, I want nutch to hand me over the content, I dont want
> it to directly post that content to lucene by itself. Essentially I want to
> interfare in between crawling and indexing, as I've to use custom analyzer
> before the contents are indexed by lucene.
> #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> #. No need of hadoop/map-reduce.
>
> I'ld like the users of nutch to let me know their views. Other option is to
> look for Java opensource crawlers that can do the job. I dont find any and
> I'm more interested in using something really good/well tested like nutch.
> Let me know your opinions.
>
> Thanks,
> KK.
>