You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fadzi Ushewokunze <de...@butterflycluster.com> on 2006/12/03 02:37:10 UTC
Re: Limiting crawl to specific list of URLS
hi k7,
Add the urls you want to crawl into a folder called /urls.
then in conf/regex-urlfilter.txt add the regular expressions
for the url patterns you want included/excluded.
Hope this answers your question.
Fadzi
On Wed, 2006-11-29 at 15:34 -0800, Kevvin Sevvvin wrote:
> Hi Everybody,
>
> I'm real new to Nutch. I've read through the documentation and many
> months
> of mailinglist archives and I don't think this question has been
> answered.
>
> I have two tasks I would like Nutch to handle. I would like it to
> crawl and
> index ONLY a specific set of urls. This is a stronger limitation that
> confining to specific sites (so db.ignore.external.links is
> insufficient): it
> should not follow ANY links on pages in the list of urls.
>
> Secondly, after creating the crawl and index of specific sites, I
> would like
> to occasionally add SINGLE urls to the index.
>
> Is this possible? If so, is it trivially possible with something like
> '--topN 0'
> (or should that be '--topN 1' ??) ? Or could I create a single local
> web page
> with all the links on it and run the crawler with '-depth 1' ?
>
> Apologies if this is an overasked or misguided question; if so I'd
> appreciate
> pointers to appropriate documentation or code so I can figure it out
> on my own.
>
> Thanks!
> -k7
Building Nutch 0.7.x
Posted by Daniel López <D....@uib.es>.
Hi there,
I'm trying to build Nutch 0.7.x from the sources and Im running into
some problems:
MP3Parser.java gives me some compile time errors:
ParseException is not compatible with throws clause in
Parser.getParse(Content)
Nutch-0.7/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java
line 44
If I check the Parser class, it is indeed true that the Parser interface
does not declare any exception to be thrown and the MP3Parser class does
declare a ParseException in the thrown clause.
rg.apache.nutch.parse.rtf.RTFParseFactory has exactly the same error
Both classes also try to create a ParseData object using a constructor
that does not exist (they are missin the status argument).
TestMP3Parser and TestRTFParser try to use a method
Protocol.getContent(String) that does not exist (one would guess it is
now getOutput(String)).
I tried both the packed version from Sourceforge and then the branch-0.7
from Subversion and both have those compile errors... am I missing
something? Are those parsers no longer in use?
Cheers!
D.
PD: I found some WinNT scripts (.bat files) on the net to run Nutch in
windows and I updated them to run with branch 0.7, in case someone is
interested. I'm still getting some errors when trying to run some
commands, not the batch files but Nutch internals, and that's why I
wanted to build Nutch form the sources.
--
-------------------------------------------
Daniel Lopez Janariz (D.Lopez@uib.es)
Web Services
Centre for Information and Technology
Balearic Islands University
(SPAIN)
-------------------------------------------
Re: Limiting crawl to specific list of URLS
Posted by Lukas Vlcek <lu...@gmail.com>.
Kevvin,
And as for occasionally added set of new urls you can use inject tool
(search nutch archives for "inject tool"). Those newly added (injected) urls
will be then crawled during the next crawl cycle.
Regards,
Lukas
On 12/3/06, Fadzi Ushewokunze <de...@butterflycluster.com> wrote:
>
> hi k7,
>
> Add the urls you want to crawl into a folder called /urls.
> then in conf/regex-urlfilter.txt add the regular expressions
> for the url patterns you want included/excluded.
>
> Hope this answers your question.
>
> Fadzi
>
>
>
> On Wed, 2006-11-29 at 15:34 -0800, Kevvin Sevvvin wrote:
> > Hi Everybody,
> >
> > I'm real new to Nutch. I've read through the documentation and many
> > months
> > of mailinglist archives and I don't think this question has been
> > answered.
> >
> > I have two tasks I would like Nutch to handle. I would like it to
> > crawl and
> > index ONLY a specific set of urls. This is a stronger limitation that
> > confining to specific sites (so db.ignore.external.links is
> > insufficient): it
> > should not follow ANY links on pages in the list of urls.
> >
> > Secondly, after creating the crawl and index of specific sites, I
> > would like
> > to occasionally add SINGLE urls to the index.
> >
> > Is this possible? If so, is it trivially possible with something like
> > '--topN 0'
> > (or should that be '--topN 1' ??) ? Or could I create a single local
> > web page
> > with all the links on it and run the crawler with '-depth 1' ?
> >
> > Apologies if this is an overasked or misguided question; if so I'd
> > appreciate
> > pointers to appropriate documentation or code so I can figure it out
> > on my own.
> >
> > Thanks!
> > -k7
>
>