You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by foobar3001 <fo...@yahoo.com> on 2008/05/23 04:46:16 UTC

Problems with indexing sub-section of a site

Hello!

In short:

Is it possible to tell Nutch to follow the links through one larger name
space, but only index (add to its database) the content of links that are in
a sub-name space of that?

The background:

I have started to experiment with crawling my blog with Nutch. The problem
is that this blog doesn't have its own domain. Instead, it it is hosted on a
larger site, which also hosts discussion forums and other people's blogs.

My URL there is "http://www.geekzone.co.nz/foobar", so naturally I thought
that adding something in the crawl-urlfilter.txt file would help. Something
like this:

          +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar

But look at the bottom of that page: The navigation links to the other pages
in my blog - or to 'next' page - actually lead out of my namespace. Thus,
they are not being picked up anymore and Nutch never sees the additional
links that I have on those other pages.

Since eventually I would like this to be a bit more generic (I don't want
anything specific for my blog, that's just a test case), I thought that
maybe I have to open it up to the root URL, making the filter something like
this:

          +^http://([a-z0-9]*\.)*geekzone.co.nz

But then it picks up a ton of other stuff that I am not interested to have
in my database.

So, now I'm wondering whether it is possible to tell Nutch to follow links
through one namespace, but only add those pages into its index database that
are in a specific sub-namespace of the first one?

Thank you very much...
-- 
View this message in context: http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17417650.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Problems with indexing sub-section of a site

Posted by foobar3001 <fo...@yahoo.com>.


Eric J. Christeson-2 wrote:
> 
> On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote:
> Did a quick scan of the page in question, and I noticed the urls are of
> this form:
> 	http://www.geekzone.co.nz/blog.asp?blogid=207
> 
> Could you filter like 
> 
> 	+^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207
> 

Hello!

Thank you very much for the reply. Yes, I had noticed that as well,
but filtering site-specific URL's like that was what I wanted to avoid.
I'm trying to find a generic solution, not something that's specific
to this (or any other site).

Basically, tell the Nutch crawler to work for a certain depth through
non-specified-domain links to see if it comes back to pages belonging
to the specified domain again.

-- 
View this message in context: http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17451041.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Problems with indexing sub-section of a site

Posted by "Eric J. Christeson" <Er...@ndsu.edu>.
On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote:
> 
> Hello!
> 
> In short:
> 
> Is it possible to tell Nutch to follow the links through one larger name
> space, but only index (add to its database) the content of links that are in
> a sub-name space of that?
> 
> The background:
> 
> I have started to experiment with crawling my blog with Nutch. The problem
> is that this blog doesn't have its own domain. Instead, it it is hosted on a
> larger site, which also hosts discussion forums and other people's blogs.
> 
> My URL there is "http://www.geekzone.co.nz/foobar", so naturally I thought
> that adding something in the crawl-urlfilter.txt file would help. Something
> like this:
> 
>           +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar
> 
> But look at the bottom of that page: The navigation links to the other pages
> in my blog - or to 'next' page - actually lead out of my namespace. Thus,
> they are not being picked up anymore and Nutch never sees the additional
> links that I have on those other pages.
> 
> Since eventually I would like this to be a bit more generic (I don't want
> anything specific for my blog, that's just a test case), I thought that
> maybe I have to open it up to the root URL, making the filter something like
> this:
> 
>           +^http://([a-z0-9]*\.)*geekzone.co.nz
> 
> But then it picks up a ton of other stuff that I am not interested to have
> in my database.
> 
> So, now I'm wondering whether it is possible to tell Nutch to follow links
> through one namespace, but only add those pages into its index database that
> are in a specific sub-namespace of the first one?

Did a quick scan of the page in question, and I noticed the urls are of
this form:
	http://www.geekzone.co.nz/blog.asp?blogid=207

Could you filter like 

	+^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207

You'll have to comment out the default ? killer or put this rule before
it.

Maybe there's something I'm missing, though.

Eric

-- 
Eric J. Christeson                      <Er...@ndsu.edu>
Information Technology Services         (701) 231-8693 (Voice)
Room 242C, IACC Building              
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs which
are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law