You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arthur Pemberton <pe...@gmail.com> on 2010/08/10 13:32:43 UTC

Plug-in for complete user control

Good day,

I'm trying to use Nutch to build a niche search engine, and I would
like to have full control over URLs. I would like to precisely control
which URL get crawled, followed, stored and indexed. Is it possible to
do this as a plug-in? What and where should I be reading to do this?

Coding my side of the logic is trivial, but I have no idea yet how to
interface with Nutch. So far I have just did a basic 'Intranet Crawl'
(with which I had a slight problem which I'll post about later) and
followed that with a command line search using NutchBean.

But I want more control that simply feeding urls/nutch and
conf/crawl-urlfilter.txt

-- 
Fedora 13
(www.pembo13.com)

Re: Plug-in for complete user control

Posted by Scott Gonyea <me...@sgonyea.com>.
Don't forget to backup... A lot.  Eclipse is a godawful, project-eating
monster.

sg

On Tue, Aug 10, 2010 at 5:53 AM, Arthur Pemberton <pe...@gmail.com> wrote:

> On Tue, Aug 10, 2010 at 8:50 AM, Alex McLintock
> <al...@gmail.com> wrote:
> > On 10 August 2010 12:55, Arthur Pemberton <pe...@gmail.com> wrote:
> >> I assume the plugin API is properly documented, I haven't yet looked.
> >> I was waiting for some direction before I went in any one way first.
> >
> > Well this is a bit of a leading question. There is some javadoc, and
> > you can find a fair amount on the wiki
> > http://wiki.apache.org/nutch/AboutPlugins
> >
> > But is it "proper" documentation? If I had bought this software then I
> > would want more than the existing examples. You do have to delve into
> > the code to understand how to use it. But this is OSS.
>
> I understand, thanks.
>
> >> Are there any recommended instructions at least for setting up a dev
> >> environment with one of the popular free Java IDEs?
> >
> > http://wiki.apache.org/nutch/RunNutchInEclipse
>
> Sweet.
>
>
>
> --
> Fedora 13
> (www.pembo13.com)
>

Re: Plug-in for complete user control

Posted by Arthur Pemberton <pe...@gmail.com>.
On Tue, Aug 10, 2010 at 8:50 AM, Alex McLintock
<al...@gmail.com> wrote:
> On 10 August 2010 12:55, Arthur Pemberton <pe...@gmail.com> wrote:
>> I assume the plugin API is properly documented, I haven't yet looked.
>> I was waiting for some direction before I went in any one way first.
>
> Well this is a bit of a leading question. There is some javadoc, and
> you can find a fair amount on the wiki
> http://wiki.apache.org/nutch/AboutPlugins
>
> But is it "proper" documentation? If I had bought this software then I
> would want more than the existing examples. You do have to delve into
> the code to understand how to use it. But this is OSS.

I understand, thanks.

>> Are there any recommended instructions at least for setting up a dev
>> environment with one of the popular free Java IDEs?
>
> http://wiki.apache.org/nutch/RunNutchInEclipse

Sweet.



-- 
Fedora 13
(www.pembo13.com)

Re: Plug-in for complete user control

Posted by Alex McLintock <al...@gmail.com>.
On 10 August 2010 12:55, Arthur Pemberton <pe...@gmail.com> wrote:
> I assume the plugin API is properly documented, I haven't yet looked.
> I was waiting for some direction before I went in any one way first.

Well this is a bit of a leading question. There is some javadoc, and
you can find a fair amount on the wiki
http://wiki.apache.org/nutch/AboutPlugins

But is it "proper" documentation? If I had bought this software then I
would want more than the existing examples. You do have to delve into
the code to understand how to use it. But this is OSS.

> Are there any recommended instructions at least for setting up a dev
> environment with one of the popular free Java IDEs?

http://wiki.apache.org/nutch/RunNutchInEclipse

But nothing for NetBeans. Not sure what other free IDEs exist. Usually
I am an emacs man  :-)


Alex

Re: Plug-in for complete user control

Posted by Arthur Pemberton <pe...@gmail.com>.
On Tue, Aug 10, 2010 at 7:44 AM, Alex McLintock
<al...@gmail.com> wrote:
> On 10 August 2010 12:32, Arthur Pemberton <pe...@gmail.com> wrote:
>> I'm trying to use Nutch to build a niche search engine, and I would
>> like to have full control over URLs. I would like to precisely control
>> which URL get crawled, followed, stored and indexed. Is it possible to
>> do this as a plug-in? What and where should I be reading to do this?
>
> Hello Arthur,
>
> Yes you can do this, but it would require you to learn about the
> plugin system - remove the filter plugins you don't want, and add in
> one that you write which implements the algorithm you want.
>
> Plugins are simple Java classes which implement one of several
> abstract classes - ie comply to the Nutch Plugin API. The best way of
> understanding them is to look at the existing plugin code. There is a
> little in the wiki - but could be more.

I assume the plugin API is properly documented, I haven't yet looked.
I was waiting for some direction before I went in any one way first.

Are there any recommended instructions at least for setting up a dev
environment with one of the popular free Java IDEs?

> You need to specify which plugins are used in config files, and if
> using Hadoop, you may need to do some fancy stuff to make sure they
> are deployed properly. (Sometimes you need to rebuild Nutch in order
> to get it to use plugins. Or so I am told).
>
> I've been slowly learning about plugins and can maybe help you
> off-list if you like. I too have been interested in niche search
> engines. I'm also investigating OpenBixo which is a web mining toolkit
> inspired by Nutch. Your desire for total control may steer you that
> way.

I'll take a look at OpenBixo. I may take you up on that offer of
assistance if my own efforts fail.

Thank you.

-- 
Fedora 13
(www.pembo13.com)

Re: Plug-in for complete user control

Posted by Alex McLintock <al...@gmail.com>.
On 10 August 2010 12:32, Arthur Pemberton <pe...@gmail.com> wrote:
> I'm trying to use Nutch to build a niche search engine, and I would
> like to have full control over URLs. I would like to precisely control
> which URL get crawled, followed, stored and indexed. Is it possible to
> do this as a plug-in? What and where should I be reading to do this?

Hello Arthur,

Yes you can do this, but it would require you to learn about the
plugin system - remove the filter plugins you don't want, and add in
one that you write which implements the algorithm you want.

Plugins are simple Java classes which implement one of several
abstract classes - ie comply to the Nutch Plugin API. The best way of
understanding them is to look at the existing plugin code. There is a
little in the wiki - but could be more.

You need to specify which plugins are used in config files, and if
using Hadoop, you may need to do some fancy stuff to make sure they
are deployed properly. (Sometimes you need to rebuild Nutch in order
to get it to use plugins. Or so I am told).

I've been slowly learning about plugins and can maybe help you
off-list if you like. I too have been interested in niche search
engines. I'm also investigating OpenBixo which is a web mining toolkit
inspired by Nutch. Your desire for total control may steer you that
way.


Alex