You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fabian López <fa...@syameses.com> on 2007/08/14 14:11:52 UTC

UBUNTU total hits 0

Hi,
after following the tutorial of Nutch 0.8, when I try to search with

bin/nutch org.apache.nutch.searcher.NutchBean apache

I receive "Total Hits:0"

I have followed all the steps:


   1. Create a directory with a flat file of root urls. For example, to
   crawl the nutch site you might start with a file named
urls/nutchcontaining the url of just the Nutch home page. All other
Nutch pages should
   be reachable from this page. The urls/nutch file would thus contain:

   http://lucene.apache.org/nutch/

   2. Edit the file conf/crawl-urlfilter.txt and replace
MY.DOMAIN.NAMEwith the name of the domain you wish to crawl. For
example, if you wished to
   limit the crawl to the apache.org domain, the line should read:

   +^http://([a-z0-9]*\.)*apache.org/

   This will include any url in the domain apache.org.
   3. Edit the file conf/nutch-site.xml, insert at minimum following
   properties into it and edit in proper values for the properties....

Then I executed:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Maybe the only problem that I find is when fetching, there is a
java.lang.NullpointerException.
Questions are:

1.- Is this the cause of the problem? How can I solution it?
2.- Is this the question why y always find the problem in
http://localhost:8080 the HTTP STATUS 500,
No Context configured to process this request - HTTP Status 500
<http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09150.html>


tHANKS A LOT
Fabian

Re: UBUNTU total hits 0

Posted by Martin Kuen <ma...@gmail.com>.
Hi Fabian,

sorry, but I can only "reply" with a bunch of questions . . .

On 8/14/07, Fabian López <fa...@syameses.com> wrote:
>
> Hi,
> after following the tutorial of Nutch 0.8, when I try to search with
>
> bin/nutch org.apache.nutch.searcher.NutchBean apache
>
> I receive "Total Hits:0"
>
> I have followed all the steps:
>
>
>    1. Create a directory with a flat file of root urls. For example, to
>    crawl the nutch site you might start with a file named
> urls/nutchcontaining the url of just the Nutch home page. All other
> Nutch pages should
>    be reachable from this page. The urls/nutch file would thus contain:
>
>    http://lucene.apache.org/nutch/
>
>    2. Edit the file conf/crawl-urlfilter.txt and replace
> MY.DOMAIN.NAMEwith the name of the domain you wish to crawl. For
> example, if you wished to
>    limit the crawl to the apache.org domain, the line should read:
>
>    +^http://([a-z0-9]*\.)*apache.org/
>
>    This will include any url in the domain apache.org.
>    3. Edit the file conf/nutch-site.xml, insert at minimum following
>    properties into it and edit in proper values for the properties....
>
> Then I executed:
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> Maybe the only problem that I find is when fetching, there is a
> java.lang.NullpointerException.
> Questions are:
>
> 1.- Is this the cause of the problem? How can I solution it?


Can you be a little bit more specific about the NPE? What is it's
stacktrace? Did you have a look at hadoop.log (located in
"path_to_nutch/log/")? Probarbly you can find a hint there . . .

2.- Is this the question why y always find the problem in
> http://localhost:8080 the HTTP STATUS 500,
> No Context configured to process this request - HTTP Status 500
> <http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09150.html>


I don't think so . . . these two errors are not related to each other. The
"crawl" job has no dependencies on tomcat. Did you use the tomcat package
from the ubuntu repository? Probarbly try things out with a downloaded
version from apache.
I tried out nutch with ubuntu as well (tomcat from ubuntu-rep.) and
encountered troubles as well . . . , but too long time ago to remember

tHANKS A LOT
> Fabian
>

hope it helps at least a litttle bit,

martin