You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dima Gritsenko <di...@ekreative.com> on 2006/09/04 12:06:22 UTC

adding new URLs to nutch index

Hi, 

We are indexing DMOZ + we want to add too other URLs for indexing and seem to have a problem searching those 2 newly added URLs (no results returned). 
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:
    http://www.newsvine.com/_feeds/rss2/index
    http://www.technorati.com/blogs/

2) Then the following command is run (it should be adding our extra URLs to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/ 3 0
 
We are also using index-url-category plugin that ascribes URLs to different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*

-.

Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*


# skip everything else
-.


Patterns used in index-url-category plugin 

rules.properties file

# News
http://newsrss.bbc.co.uk/rss/*=news
http://www.newsvine.com/*=news
.*rss.*=news
.*\.xml=news

# Blogs
.*technorati\.com\/blogs.*=blogs

# Web
.*=web

Thank you. 
Dima.

Re: adding new URLs to nutch index

Posted by Dima Gritsenko <di...@ekreative.com>.

Thank you, Vishal.
This part is working good now. Still figuring out why URLs have not been
properly categorized though.

Dima.

----- Original Message -----
From: "Vishal Shah" <vi...@rediff.co.in>
To: <nu...@lucene.apache.org>
Sent: Monday, September 04, 2006 5:23 AM
Subject: RE: adding new URLs to nutch index


> Hi Dima,
>
>   Which version of Nutch are you using? From 0.8 onwards, the name of
> the urls file has to be urls.txt, and it's parent dir has to be passed
> to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
> your inject cmd would be:
>
> bin/nutch inject crawl/crawldb NewUrls
>
> Also, check your crawl-urlfilter.txt to make sure that these new URLs
> won't be filtered.
>
> Regards,
>
> -vishal.
>
> -----Original Message-----
> From: Dima Gritsenko [mailto:dima@ekreative.com]
> Sent: Monday, September 04, 2006 3:36 PM
> To: nutch-user@lucene.apache.org
> Subject: adding new URLs to nutch index
>
> Hi,
>
> We are indexing DMOZ + we want to add too other URLs for indexing and
> seem to have a problem searching those 2 newly added URLs (no results
> returned).
> Here's what we do to add new URL to nutch index:
> 1) Created a dir  /url with "url" file that contains these two URLs:
>     http://www.newsvine.com/_feeds/rss2/index
>     http://www.technorati.com/blogs/
>
> 2) Then the following command is run (it should be adding our extra URLs
> to nutch DB/index)
>     bin/nutch inject crawl/crawldb urls
>
> 3) Then start recrawl
>     bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
> 3 0
>
> We are also using index-url-category plugin that ascribes URLs to
> different categories for future filtered search:
> Here's what we do:
>
> Add patterns used in regex-urlfilter.txt
>
> # accept anything else
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
> -.
>
> Add patterns used in crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
>
> # skip everything else
> -.
>
>
> Patterns used in index-url-category plugin
>
> rules.properties file
>
> # News
> http://newsrss.bbc.co.uk/rss/*=news
> http://www.newsvine.com/*=news
> .*rss.*=news
> .*\.xml=news
>
> # Blogs
> .*technorati\.com\/blogs.*=blogs
>
> # Web
> .*=web
>
> Thank you.
> Dima.
>
>
>
>

RE: adding new URLs to nutch index

Posted by Vishal Shah <vi...@rediff.co.in>.

Hi Dima,

  Which version of Nutch are you using? From 0.8 onwards, the name of
the urls file has to be urls.txt, and it's parent dir has to be passed
to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
your inject cmd would be:

bin/nutch inject crawl/crawldb NewUrls

Also, check your crawl-urlfilter.txt to make sure that these new URLs
won't be filtered.

Regards,

-vishal.

-----Original Message-----
From: Dima Gritsenko [mailto:dima@ekreative.com] 
Sent: Monday, September 04, 2006 3:36 PM
To: nutch-user@lucene.apache.org
Subject: adding new URLs to nutch index

Hi, 

We are indexing DMOZ + we want to add too other URLs for indexing and
seem to have a problem searching those 2 newly added URLs (no results
returned). 
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:
    http://www.newsvine.com/_feeds/rss2/index
    http://www.technorati.com/blogs/

2) Then the following command is run (it should be adding our extra URLs
to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
3 0
 
We are also using index-url-category plugin that ascribes URLs to
different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*

-.

Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*


# skip everything else
-.


Patterns used in index-url-category plugin 

rules.properties file

# News
http://newsrss.bbc.co.uk/rss/*=news
http://www.newsvine.com/*=news
.*rss.*=news
.*\.xml=news

# Blogs
.*technorati\.com\/blogs.*=blogs

# Web
.*=web

Thank you. 
Dima.