You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Calabrese <ma...@jasoncalabrese.com> on 2006/07/06 17:26:53 UTC
Re: [Nutch-general] Alternatives
You can startup a crawler by just creating a job. You can basicly just
copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl.
In my application we are working at a lower level and first create the
crawl_generate dir, then start a fetch, then we parse the fetched results,
and then index the parsed results on our own.
I think with a little hacking you can make use of a lot of the nutch/hadoop
code in any framework.
The only place I had problems was getting the nutch progress integrated into
our application so that an admin can see where the fetcher is within our
application. In order to do that I added a few hacks to the fetcher and
parser, but I think there may be a better.
On Wednesday 05 July 2006 8:58 am, karl wettin wrote:
> I have never looked at how Nutch works, nor have I used it. My questions
> might just be RTFM-related.
>
> Lately people have asked me to help them out with simple domainspecific
> webindexing services. The requirements are, as usual when I'm involved,
> to run on very limited resources. What I did is to combine my very
> simple and minimalistic servlet engine <http://sf.net/project/servlet>
> with Lucene and NekoHTML, extracting only the the content "frame" from
> the static design of the site.
>
> This made me think of two things:
>
> It would be nice to use the features of Nutch instead of my own hacky
> stuff. How bound is Nutch to the J2EE-container? Would it be a big job
> to make it run on an alternative GUI? Or is is the container used for
> more than GUI? I.e. do all services (crawler, et.c.) run within the
> container? Do they have to?
>
> It would be nice to automatically detect the content "frame" by
> analyzing the DOM tree of the pages on a site. Is there such a feature
> in Nutch, contributed to, or publicly available in some other project?
> I'd be more than happy do discuss, write and contribute it back if I end
> up making one.