You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by karl wettin <ka...@snigel.net> on 2006/07/05 17:58:15 UTC

Alternatives

I have never looked at how Nutch works, nor have I used it. My questions
might just be RTFM-related.

Lately people have asked me to help them out with simple domainspecific
webindexing services. The requirements are, as usual when I'm involved,
to run on very limited resources. What I did is to combine my very
simple and minimalistic servlet engine <http://sf.net/project/servlet>
with Lucene and NekoHTML, extracting only the the content "frame" from
the static design of the site.

This made me think of two things:

It would be nice to use the features of Nutch instead of my own hacky
stuff. How bound is Nutch to the J2EE-container? Would it be a big job
to make it run on an alternative GUI? Or is is the container used for
more than GUI? I.e. do all services (crawler, et.c.) run within the
container? Do they have to?

It would be nice to automatically detect the content "frame" by
analyzing the DOM tree of the pages on a site. Is there such a feature
in Nutch, contributed to, or publicly available in some other project?
I'd be more than happy do discuss, write and contribute it back if I end
up making one.

-- 
karl

Re: Alternatives

Posted by karl wettin <ka...@snigel.net>.

On Wed, 2006-07-05 at 20:32 -0700, Stefan Groschupf wrote:
> Crawler & Co. are command line tools.
> The servletcontainer is only used to deliver search results but you  
> can use the servlet that just provides XML.

Ah, excellent. Thanks for letting me avoid reading the manual ;)

> > It would be nice to automatically detect the content "frame" by
> > analyzing the DOM tree of the pages on a site. Is there such a
> feature in Nutch, contributed to, or publicly available in some other
> project?
> 
> I'm not sure clearly understanding your question here.
> Nutch has a html parser plugin that only extract the content from a  
> html page. 

Do you mean all the text in a HTML document, or do you mean the content
area of a HTML document? This is what I mean: news paper X has a static
design with navigation, some ads in text format, et.c. In the middle of
the document is the article. I want to detect the article-area and index
only this information, as all the other information is irrelevant and
more or less reoccurs with the same information in all documents. I
presume it would not be to tough to do based on a HTML DOM.

Re: Alternatives

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,
> It would be nice to use the features of Nutch instead of my own hacky
> stuff. How bound is Nutch to the J2EE-container? Would it be a big job
> to make it run on an alternative GUI? Or is is the container used for
> more than GUI? I.e. do all services (crawler, et.c.) run within the
> container? Do they have to?
Crawler & Co. are command line tools.
The servletcontainer is only used to deliver search results but you  
can use the servlet that just provides XML.
Also you can use the NutchBean API to integrate it without any  
servlet container in a custom application.
>
> It would be nice to automatically detect the content "frame" by
> analyzing the DOM tree of the pages on a site. Is there such a feature
> in Nutch, contributed to, or publicly available in some other project?

I'm not sure clearly understanding your question here.
Nutch has a html parser plugin that only extract the content from a  
html page.

Stefan

Re: [Nutch-general] Alternatives

Posted by Jason Calabrese <ma...@jasoncalabrese.com>.

You can startup a crawler by just creating a job.  You can basicly just 
copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl. 

In my application we are working at a lower level and first create the 
crawl_generate dir, then start a fetch, then we parse the fetched results, 
and then index the parsed results on our own.

I think with a little hacking you can make use of a lot of the nutch/hadoop 
code in any framework.  

The only place I had problems was getting the nutch progress integrated into 
our application so that an admin can see where the fetcher is within our 
application.  In order to do that I added a few hacks to the fetcher and 
parser, but I think there may be a better.

On Wednesday 05 July 2006 8:58 am, karl wettin wrote:
> I have never looked at how Nutch works, nor have I used it. My questions
> might just be RTFM-related.
>
> Lately people have asked me to help them out with simple domainspecific
> webindexing services. The requirements are, as usual when I'm involved,
> to run on very limited resources. What I did is to combine my very
> simple and minimalistic servlet engine <http://sf.net/project/servlet>
> with Lucene and NekoHTML, extracting only the the content "frame" from
> the static design of the site.
>
> This made me think of two things:
>
> It would be nice to use the features of Nutch instead of my own hacky
> stuff. How bound is Nutch to the J2EE-container? Would it be a big job
> to make it run on an alternative GUI? Or is is the container used for
> more than GUI? I.e. do all services (crawler, et.c.) run within the
> container? Do they have to?
>
> It would be nice to automatically detect the content "frame" by
> analyzing the DOM tree of the pages on a site. Is there such a feature
> in Nutch, contributed to, or publicly available in some other project?
> I'd be more than happy do discuss, write and contribute it back if I end
> up making one.