You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tor Harald Thorland <li...@strigen.com> on 2007/01/09 10:17:22 UTC

Using Nutch for special content pages

Hello,

I have a question about Nutch..
I'm a total newbi and are wondering:
Is it possible to setup nutch to crawl any address it finds, and only  
store pages where he finds something about a subject...
I'll like to make a search place for ship/engine related material, and  
were thinking to start with .no domains... ( I have lots of time for  
this, ans the pages I'm looking for is not really getting "outdated",  
but i don't like to waste a lot of disk space etc. for pages which  
don't include what I'm looking for

Best Regards
Tor Harald Thorland

Re: Using Nutch for special content pages

Posted by Damian Florczyk <th...@gentoo.org>.

Tor Harald Thorland napisał(a):
> 
> Hello,
> 
> I have a question about Nutch..
> I'm a total newbi and are wondering:
> Is it possible to setup nutch to crawl any address it finds, and only 
> store pages where he finds something about a subject...
> I'll like to make a search place for ship/engine related material, and 
> were thinking to start with .no domains... ( I have lots of time for 
> this, ans the pages I'm looking for is not really getting "outdated", 
> but i don't like to waste a lot of disk space etc. for pages which don't 
> include what I'm looking for
> 
> Best Regards
> Tor Harald Thorland
My company have dones sth like that, but we need to write our own plugin 
for it.

-- 
Damian Florczyk
Gentoo/NetBSD Development Lead

Re: Using Nutch for special content pages

Posted by Zaheed Haque <za...@gmail.com>.

Hi:

In general terms the CC plugin looks for the "CC:license" on web pages
it crawls. You can see that in http://creativecommons.org/ at the end
of the page - there is a "CC logo and some copyright text". If you do
view source will give you the HTML's for that bit of the page .. and
when ever nutch crawler finds such page it index the page otherwise
delete the page and move to the next page. In essance this HTML
snippet could be anything i.e. specific text, group of text and what
not.

Whenever CC plugin finds a CC page it also adds some CC specific
fields in Lucene index for query etc. I think all of the above i.e.
CCparser, CCindexer and CCquery filters are under the CC plugin
directory.

Cheers

On 1/9/07, Justin Hartman <jj...@gmail.com> wrote:
> On 1/9/07, Zaheed Haque <za...@gmail.com> wrote:
> > there is a creative commons plugin in nutch src/plugin/creativecommons .. which
> > does somewhat similar things could be good starting point.
>
> Sorry to change the subject on this one but what exactly does the
> creativecommons plugin do and how would you use it? I've been very
> interested in this plugin but it's not altogether documented that well
> (I don't think).
> --
> Regards
> Justin Hartman
> PGP Key ID: 102CC123
>

Re: Using Nutch for special content pages

Posted by Justin Hartman <jj...@gmail.com>.

On 1/9/07, Zaheed Haque <za...@gmail.com> wrote:
> there is a creative commons plugin in nutch src/plugin/creativecommons .. which
> does somewhat similar things could be good starting point.

Sorry to change the subject on this one but what exactly does the
creativecommons plugin do and how would you use it? I've been very
interested in this plugin but it's not altogether documented that well
(I don't think).
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Using Nutch for special content pages

Posted by Zaheed Haque <za...@gmail.com>.

Hi:

In order to find a specific text or subject or group of text you need
to process the document i.e. you need to download the page to your
disk -- process it -- delete or keep based on rules. But you still need
to download the page. This means you will need a lot of disk space "temporarily"
if you are planning to crawl the world :-)

there is a creative commons plugin in nutch src/plugin/creativecommons .. which
does somewhat similar things could be good starting point. As you have lot
of time then its best you make the new plugin a bit generic :-) So we can all
enjoy it!

Cheers

On 1/9/07, Tor Harald Thorland <li...@strigen.com> wrote:
>
> Hello,
>
> I have a question about Nutch..
> I'm a total newbi and are wondering:
> Is it possible to setup nutch to crawl any address it finds, and only
> store pages where he finds something about a subject...
> I'll like to make a search place for ship/engine related material, and
> were thinking to start with .no domains... ( I have lots of time for
> this, ans the pages I'm looking for is not really getting "outdated",
> but i don't like to waste a lot of disk space etc. for pages which
> don't include what I'm looking for
>
> Best Regards
> Tor Harald Thorland
>
>
>