You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Imam Nur Ramadhany <ra...@yahoo.com> on 2009/01/09 01:38:54 UTC

Null Indexing

I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem. when I try to search it returns  0 hits, I have configured the crawl folder in the webapp and crawled my localhost (could it be done?)
I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error problem occurs when crawl  based on the crawl.log. but it's return null when indexing.

on the hadoop.log there are some fatal and warn status like these:

FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed first in 'http.robots.agents' property!
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default
.
WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is it related with this problem

Regards,


Ramadhany

Re: AW: Null Indexing

Posted by MEHALA N <me...@gmail.com>.

hai,
 i am getting the following error while running the crawler by
bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50


Dedup: adding indexes in: crawl_NEW1/indexes
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
       at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

Can anyone help me to clear this problem.
-N.Mehala.

On Wed, Sep 23, 2009 at 10:14 AM, Cisek <fa...@mailinator.com> wrote:
>
> I had the same little big problem - everything seemed OK:
>
> - bin/nutch org.apache.nutch.searcher.NutchBean <search query> ... [in my
> case search query = "apache"] in cygwin returns 62 Total hits on cawled
> "+^http://([a-z0-9]*\.)*apache.org/"
>
> - Nutch in Tomcat webapp after deploy seemed fine (no errors)
>
> - I had NOT created a new xml file named nutch-0.9.xml which contains
> <Context path="/nutch-0.9/" debug="5" privileged="true"
> docBase="C:\nutch-0.9"/> and NOT put it in
> C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had
>
> - but still got Hits 0-0 (out of about 0 total matching pages): in
> Tomcat-Nutch web interface.
>
> ... but I have solved it in my case:
>
> - I forgott to configure the searcher.dir in nutch-site.xml at
> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows  - Set Your
> Searcher Directory
>
> - and now it works fine - Tomcat-Nutch interface returns 62 matching pages
> :)
>
>
> Imam Nur Ramadhany wrote:
>>
>> Hello again everyone,
>>
>> My detail configuration is just like what
>> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new
>> to
>> Tomcat  and  Java, so I just followed the instruction.
>>
>> I extracted the release at C:\nutch-0.9, made a directory
>> named urls with a file also named urls (without extention), then added the
>> URLs
>> to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also
>> have
>> crawled  a site (http://localhost/). For
>> web interface search I uploaded the nutch WAR file. And created a new xml
>> file
>> named nutch-0.9.xml which contains <Context path="/nutch-0.9/"
>> debug="5" privileged="true" docBase="C:\nutch-0.9"
>> /> and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
>> my problems are. Is it the correct path and docbase? When I enter
>> http://localhost:8080/nutch-0.9/
>> there is a welcome page but when I put a query and click the search it
>> wasn't
>> returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also
>> have
>> configured the searcher.dir in nutch-site.xml at
>> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
>> anyway.
>>
>> Then like Koch Martina's suggestion I tried to search
>> directly from the command line in cygwin by the command:
>> bin/nutch org.apache.nutch.searcher.NutchBean <search
>> query>.
>>  It works.
>>  I'm still working on
>> the nutch-0.9.xml to make the webapp works, trying some path and docbase.
>> But it would be helpful if you
>> have any other suggestions.
>>
>> Thanks in advance,Ramadhany
>>
>>
>>
>> ________________________________
>> From: Imam Nur Ramadhany <ra...@yahoo.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Tuesday, January 13, 2009 7:27:21 AM
>> Subject: Re: AW: Null Indexing
>>
>> Thanks for your info Martina,
>> it works with the command line but it doesn't when using the webapp
>> (localhost:8080/nutch-0.9)
>> is it enough with only deploy the war file using Tomcat manager?
>> or should we include some other file to the catalina_home?
>>
>>
>>
>>
>>
>> ________________________________
>> From: Koch Martina <Ko...@huberverlag.de>
>> To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
>> Sent: Friday, January 9, 2009 2:57:24 PM
>> Subject: AW: Null Indexing
>>
>> Hi Ramadhany,
>>
>> the mentioned warnings and fatals you see in the log have nothing to do
>> with getting 0 results at searching.
>> The fatal message can be eliminated by setting the property
>> "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
>> The urlnormalizer warn messages just inform you that you have not
>> specified a dedicated urlnormalizer for a certain scope so that the
>> default urlnormalizer is used. If you need more information on this, look
>> at URLNormalizers.java (package org.apache.nutch.net).
>>
>> To narrow down your searching problems, please provide some more details
>> on your configuration.
>> Did you check the content of your index using Luke
>> (http://www.getopt.org/luke/) to make sure that the pages and content you
>> are expecting in the index are really in there?
>> Did you try a search directly from the command line in cygwin by the
>> command:
>> bin/nutch org.apache.nutch.searcher.NutchBean <search query>
>>
>> Kind regards,
>> Martina
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com]
>> Gesendet: 09 January 2009 01:39
>> An: nutch-user@lucene.apache.org
>> Betreff: Null Indexing
>>
>> I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem.
>> when I try to search it returns  0 hits, I have configured the crawl
>> folder in the webapp and crawled my localhost (could it be done?)
>> I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error
>> problem occurs when crawl  based on the crawl.log. but it's return null
>> when indexing.
>>
>> on the hadoop.log there are some fatal and warn status like these:
>>
>> FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed
>> first in 'http.robots.agents' property!
>> .
>> WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject',
>> using default
>> .
>> WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb',
>> using default
>> .
>> WARN  util.NativeCodeLoader - Unable to load native-hadoop library for
>> your platform... using builtin-java classes where applicable
>>
>> is it related with this problem
>>
>> Regards,
>>
>> Ramadhany
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Null-Indexing-tp21364166p25531221.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: AW: Null Indexing

Posted by Cisek <fa...@mailinator.com>.

I had the same little big problem - everything seemed OK:

- bin/nutch org.apache.nutch.searcher.NutchBean <search query> ... [in my
case search query = "apache"] in cygwin returns 62 Total hits on cawled
"+^http://([a-z0-9]*\.)*apache.org/"

- Nutch in Tomcat webapp after deploy seemed fine (no errors)

- I had NOT created a new xml file named nutch-0.9.xml which contains
<Context path="/nutch-0.9/" debug="5" privileged="true"
docBase="C:\nutch-0.9"/> and NOT put it in
C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had

- but still got Hits 0-0 (out of about 0 total matching pages): in
Tomcat-Nutch web interface.

... but I have solved it in my case:

- I forgott to configure the searcher.dir in nutch-site.xml at
C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in 
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows  - Set Your
Searcher Directory

- and now it works fine - Tomcat-Nutch interface returns 62 matching pages
:)


Imam Nur Ramadhany wrote:
> 
> Hello again everyone,
> 
> My detail configuration is just like what
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new
> to
> Tomcat  and  Java, so I just followed the instruction. 
> 
> I extracted the release at C:\nutch-0.9, made a directory
> named urls with a file also named urls (without extention), then added the
> URLs
> to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also
> have
> crawled  a site (http://localhost/). For
> web interface search I uploaded the nutch WAR file. And created a new xml
> file
> named nutch-0.9.xml which contains <Context path="/nutch-0.9/"
> debug="5" privileged="true" docBase="C:\nutch-0.9"
> /> and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
> my problems are. Is it the correct path and docbase? When I enter
> http://localhost:8080/nutch-0.9/
> there is a welcome page but when I put a query and click the search it
> wasn't
> returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also
> have
> configured the searcher.dir in nutch-site.xml at
> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
> anyway.
>  
> Then like Koch Martina's suggestion I tried to search
> directly from the command line in cygwin by the command: 
> bin/nutch org.apache.nutch.searcher.NutchBean <search
> query>.
>  It works.
>  I'm still working on
> the nutch-0.9.xml to make the webapp works, trying some path and docbase.
> But it would be helpful if you
> have any other suggestions.
>  
> Thanks in advance,Ramadhany
> 
> 
> 
> ________________________________
> From: Imam Nur Ramadhany <ra...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, January 13, 2009 7:27:21 AM
> Subject: Re: AW: Null Indexing
> 
> Thanks for your info Martina,
> it works with the command line but it doesn't when using the webapp
> (localhost:8080/nutch-0.9)
> is it enough with only deploy the war file using Tomcat manager?
> or should we include some other file to the catalina_home?
> 
> 
> 
> 
> 
> ________________________________
> From: Koch Martina <Ko...@huberverlag.de>
> To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
> Sent: Friday, January 9, 2009 2:57:24 PM
> Subject: AW: Null Indexing
> 
> Hi Ramadhany,
> 
> the mentioned warnings and fatals you see in the log have nothing to do
> with getting 0 results at searching.
> The fatal message can be eliminated by setting the property
> "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
> The urlnormalizer warn messages just inform you that you have not
> specified a dedicated urlnormalizer for a certain scope so that the
> default urlnormalizer is used. If you need more information on this, look
> at URLNormalizers.java (package org.apache.nutch.net).
> 
> To narrow down your searching problems, please provide some more details
> on your configuration.
> Did you check the content of your index using Luke
> (http://www.getopt.org/luke/) to make sure that the pages and content you
> are expecting in the index are really in there?
> Did you try a search directly from the command line in cygwin by the
> command:
> bin/nutch org.apache.nutch.searcher.NutchBean <search query>
> 
> Kind regards,
> Martina
> 
> -----Ursprüngliche Nachricht-----
> Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com] 
> Gesendet: 09 January 2009 01:39
> An: nutch-user@lucene.apache.org
> Betreff: Null Indexing
> 
> I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem.
> when I try to search it returns  0 hits, I have configured the crawl
> folder in the webapp and crawled my localhost (could it be done?)
> I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error
> problem occurs when crawl  based on the crawl.log. but it's return null
> when indexing.
> 
> on the hadoop.log there are some fatal and warn status like these:
> 
> FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed
> first in 'http.robots.agents' property!
> .
> WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject',
> using default 
> .
> WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb',
> using default
> .
> WARN  util.NativeCodeLoader - Unable to load native-hadoop library for
> your platform... using builtin-java classes where applicable
> 
> is it related with this problem
> 
> Regards,
> 
> Ramadhany
> 
> 

-- 
View this message in context: http://www.nabble.com/Null-Indexing-tp21364166p25531221.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: AW: Null Indexing

Posted by Imam Nur Ramadhany <ra...@yahoo.com>.

Hello again everyone,

My detail configuration is just like what
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new to
Tomcat  and  Java, so I just followed the instruction. 

I extracted the release at C:\nutch-0.9, made a directory
named urls with a file also named urls (without extention), then added the URLs
to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also have
crawled  a site (http://localhost/). For
web interface search I uploaded the nutch WAR file. And created a new xml file
named nutch-0.9.xml which contains <Context path="/nutch-0.9/"
debug="5" privileged="true" docBase="C:\nutch-0.9"
/> and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
my problems are. Is it the correct path and docbase? When I enter http://localhost:8080/nutch-0.9/
there is a welcome page but when I put a query and click the search it wasn't
returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also have
configured the searcher.dir in nutch-site.xml at C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
anyway.
 
Then like Koch Martina's suggestion I tried to search
directly from the command line in cygwin by the command: 
bin/nutch org.apache.nutch.searcher.NutchBean <search
query>.
 It works.
 I'm still working on
the nutch-0.9.xml to make the webapp works, trying some path and docbase. But it would be helpful if you
have any other suggestions.
 
Thanks in advance,Ramadhany



________________________________
From: Imam Nur Ramadhany <ra...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Tuesday, January 13, 2009 7:27:21 AM
Subject: Re: AW: Null Indexing

Thanks for your info Martina,
it works with the command line but it doesn't when using the webapp (localhost:8080/nutch-0.9)
is it enough with only deploy the war file using Tomcat manager?
or should we include some other file to the catalina_home?





________________________________
From: Koch Martina <Ko...@huberverlag.de>
To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
Sent: Friday, January 9, 2009 2:57:24 PM
Subject: AW: Null Indexing

Hi Ramadhany,

the mentioned warnings and fatals you see in the log have nothing to do with getting 0 results at searching.
The fatal message can be eliminated by setting the property "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
The urlnormalizer warn messages just inform you that you have not specified a dedicated urlnormalizer for a certain scope so that the default urlnormalizer is used. If you need more information on this, look at URLNormalizers.java (package org.apache.nutch.net).

To narrow down your searching problems, please provide some more details on your configuration.
Did you check the content of your index using Luke (http://www.getopt.org/luke/) to make sure that the pages and content you are expecting in the index are really in there?
Did you try a search directly from the command line in cygwin by the command:
bin/nutch org.apache.nutch.searcher.NutchBean <search query>

Kind regards,
Martina

-----Ursprüngliche Nachricht-----
Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com] 
Gesendet: 09 January 2009 01:39
An: nutch-user@lucene.apache.org
Betreff: Null Indexing

I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem. when I try to search it returns  0 hits, I have configured the crawl folder in the webapp and crawled my localhost (could it be done?)
I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error problem occurs when crawl  based on the crawl.log. but it's return null when indexing.

on the hadoop.log there are some fatal and warn status like these:

FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed first in 'http.robots.agents' property!
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default
.
WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is it related with this problem

Regards,


Ramadhany

Re: AW: Null Indexing

Posted by Imam Nur Ramadhany <ra...@yahoo.com>.

Thanks for your info Martina,
it works with the command line but it doesn't when using the webapp (localhost:8080/nutch-0.9)
is it enough with only deploy the war file using Tomcat manager?
or should we include some other file to the catalina_home?





________________________________
From: Koch Martina <Ko...@huberverlag.de>
To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
Sent: Friday, January 9, 2009 2:57:24 PM
Subject: AW: Null Indexing

Hi Ramadhany,

the mentioned warnings and fatals you see in the log have nothing to do with getting 0 results at searching.
The fatal message can be eliminated by setting the property "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
The urlnormalizer warn messages just inform you that you have not specified a dedicated urlnormalizer for a certain scope so that the default urlnormalizer is used. If you need more information on this, look at URLNormalizers.java (package org.apache.nutch.net).

To narrow down your searching problems, please provide some more details on your configuration.
Did you check the content of your index using Luke (http://www.getopt.org/luke/) to make sure that the pages and content you are expecting in the index are really in there?
Did you try a search directly from the command line in cygwin by the command:
bin/nutch org.apache.nutch.searcher.NutchBean <search query>

Kind regards,
Martina

-----Ursprüngliche Nachricht-----
Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com] 
Gesendet: 09 January 2009 01:39
An: nutch-user@lucene.apache.org
Betreff: Null Indexing

I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem. when I try to search it returns  0 hits, I have configured the crawl folder in the webapp and crawled my localhost (could it be done?)
I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error problem occurs when crawl  based on the crawl.log. but it's return null when indexing.

on the hadoop.log there are some fatal and warn status like these:

FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed first in 'http.robots.agents' property!
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 
.
WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default
.
WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is it related with this problem

Regards,


Ramadhany

AW: Null Indexing

Posted by Koch Martina <Ko...@huberverlag.de>.

Hi Ramadhany,

the mentioned warnings and fatals you see in the log have nothing to do with getting 0 results at searching.
The fatal message can be eliminated by setting the property "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
The urlnormalizer warn messages just inform you that you have not specified a dedicated urlnormalizer for a certain scope so that the default urlnormalizer is used. If you need more information on this, look at URLNormalizers.java (package org.apache.nutch.net).

To narrow down your searching problems, please provide some more details on your configuration.
Did you check the content of your index using Luke (http://www.getopt.org/luke/) to make sure that the pages and content you are expecting in the index are really in there?
Did you try a search directly from the command line in cygwin by the command:
bin/nutch org.apache.nutch.searcher.NutchBean <search query>

Kind regards,
Martina

-----Ursprüngliche Nachricht-----
Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com]
Gesendet: 09 January 2009 01:39
An: nutch-user@lucene.apache.org
Betreff: Null Indexing

I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem. when I try to search it returns 0 hits, I have configured the crawl folder in the webapp and crawled my localhost (could it be done?)
I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error problem occurs when crawl based on the crawl.log. but it's return null when indexing.

on the hadoop.log there are some fatal and warn status like these:

FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed first in 'http.robots.agents' property!
.
WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
.
WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default
.
WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

is it related with this problem

Regards,

Ramadhany