You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Fadzi Ushewokunze <de...@butterflycluster.com> on 2006/12/03 02:37:10 UTC

Re: Limiting crawl to specific list of URLS

hi k7,

Add the urls you want to crawl into a folder called /urls.
then in conf/regex-urlfilter.txt add the regular expressions
for the url patterns you want included/excluded.

Hope this answers your question.

Fadzi



On Wed, 2006-11-29 at 15:34 -0800, Kevvin Sevvvin wrote:
> Hi Everybody,
> 
> I'm real new to Nutch. I've read through the documentation and many  
> months
> of mailinglist archives and I don't think this question has been  
> answered.
> 
> I have two tasks I would like Nutch to handle. I would like it to  
> crawl and
> index ONLY a specific set of urls. This is a stronger limitation that
> confining to specific sites (so db.ignore.external.links is  
> insufficient): it
> should not follow ANY links on pages in the list of urls.
> 
> Secondly, after creating the crawl and index of specific sites, I  
> would like
> to occasionally add SINGLE urls to the index.
> 
> Is this possible? If so, is it trivially possible with something like  
> '--topN 0'
> (or should that be '--topN 1' ??) ? Or could I create a single local  
> web page
> with all the links on it and run the crawler with '-depth 1' ?
> 
> Apologies if this is an overasked or misguided question; if so I'd  
> appreciate
> pointers to appropriate documentation or code so I can figure it out  
> on my own.
> 
> Thanks!
> -k7

Building Nutch 0.7.x

Posted by Daniel López <D....@uib.es>.

Hi there,

I'm trying to build Nutch 0.7.x from the sources and Im running into 
some problems:
MP3Parser.java gives me some compile time errors:
ParseException is not compatible with throws clause in 
Parser.getParse(Content) 
Nutch-0.7/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java 
line 44

If I check the Parser class, it is indeed true that the Parser interface 
does not declare any exception to be thrown and the MP3Parser class does 
declare a ParseException in the thrown clause.

rg.apache.nutch.parse.rtf.RTFParseFactory has exactly the same error

Both classes also try to create a ParseData object using a constructor 
that does not exist (they are missin the status argument).

TestMP3Parser and TestRTFParser try to use a method 
Protocol.getContent(String) that does not exist (one would guess it is 
now getOutput(String)).

I tried both the packed version from Sourceforge and then the branch-0.7 
from Subversion and both have those compile errors... am I missing 
something? Are those parsers no longer in use?

Cheers!
D.

PD: I found some WinNT scripts (.bat files) on the net to run Nutch in 
windows and I updated them to run with branch 0.7, in case someone is 
interested. I'm still getting some errors when trying to run some 
commands, not the batch files but Nutch internals, and that's why I 
wanted to build Nutch form the sources.



-- 
-------------------------------------------
Daniel Lopez Janariz (D.Lopez@uib.es)
Web Services
Centre for Information and Technology
Balearic Islands University
(SPAIN)
-------------------------------------------

Re: Limiting crawl to specific list of URLS

Posted by Lukas Vlcek <lu...@gmail.com>.

Kevvin,

And as for occasionally added set of new urls you can use inject tool
(search nutch archives for "inject tool"). Those newly added (injected) urls
will be then crawled during the next crawl cycle.

Regards,
Lukas

On 12/3/06, Fadzi Ushewokunze <de...@butterflycluster.com> wrote:
>
> hi k7,
>
> Add the urls you want to crawl into a folder called /urls.
> then in conf/regex-urlfilter.txt add the regular expressions
> for the url patterns you want included/excluded.
>
> Hope this answers your question.
>
> Fadzi
>
>
>
> On Wed, 2006-11-29 at 15:34 -0800, Kevvin Sevvvin wrote:
> > Hi Everybody,
> >
> > I'm real new to Nutch. I've read through the documentation and many
> > months
> > of mailinglist archives and I don't think this question has been
> > answered.
> >
> > I have two tasks I would like Nutch to handle. I would like it to
> > crawl and
> > index ONLY a specific set of urls. This is a stronger limitation that
> > confining to specific sites (so db.ignore.external.links is
> > insufficient): it
> > should not follow ANY links on pages in the list of urls.
> >
> > Secondly, after creating the crawl and index of specific sites, I
> > would like
> > to occasionally add SINGLE urls to the index.
> >
> > Is this possible? If so, is it trivially possible with something like
> > '--topN 0'
> > (or should that be '--topN 1' ??) ? Or could I create a single local
> > web page
> > with all the links on it and run the crawler with '-depth 1' ?
> >
> > Apologies if this is an overasked or misguided question; if so I'd
> > appreciate
> > pointers to appropriate documentation or code so I can figure it out
> > on my own.
> >
> > Thanks!
> > -k7
>
>