You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cisek <fa...@mailinator.com> on 2009/09/23 19:14:00 UTC

Re: AW: Null Indexing

I had the same little big problem - everything seemed OK:

- bin/nutch org.apache.nutch.searcher.NutchBean <search query> ... [in my
case search query = "apache"] in cygwin returns 62 Total hits on cawled
"+^http://([a-z0-9]*\.)*apache.org/"

- Nutch in Tomcat webapp after deploy seemed fine (no errors)

- I had NOT created a new xml file named nutch-0.9.xml which contains
<Context path="/nutch-0.9/" debug="5" privileged="true"
docBase="C:\nutch-0.9"/> and NOT put it in
C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had

- but still got Hits 0-0 (out of about 0 total matching pages): in
Tomcat-Nutch web interface.

... but I have solved it in my case:

- I forgott to configure the searcher.dir in nutch-site.xml at
C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in 
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows  - Set Your
Searcher Directory

- and now it works fine - Tomcat-Nutch interface returns 62 matching pages
:)


Imam Nur Ramadhany wrote:
> 
> Hello again everyone,
> 
> My detail configuration is just like what
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new
> to
> Tomcat  and  Java, so I just followed the instruction. 
> 
> I extracted the release at C:\nutch-0.9, made a directory
> named urls with a file also named urls (without extention), then added the
> URLs
> to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also
> have
> crawled  a site (http://localhost/). For
> web interface search I uploaded the nutch WAR file. And created a new xml
> file
> named nutch-0.9.xml which contains <Context path="/nutch-0.9/"
> debug="5" privileged="true" docBase="C:\nutch-0.9"
> /> and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
> my problems are. Is it the correct path and docbase? When I enter
> http://localhost:8080/nutch-0.9/
> there is a welcome page but when I put a query and click the search it
> wasn't
> returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also
> have
> configured the searcher.dir in nutch-site.xml at
> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
> anyway.
>  
> Then like Koch Martina's suggestion I tried to search
> directly from the command line in cygwin by the command: 
> bin/nutch org.apache.nutch.searcher.NutchBean <search
> query>.
>  It works.
>  I'm still working on
> the nutch-0.9.xml to make the webapp works, trying some path and docbase.
> But it would be helpful if you
> have any other suggestions.
>  
> Thanks in advance,Ramadhany
> 
> 
> 
> ________________________________
> From: Imam Nur Ramadhany <ra...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, January 13, 2009 7:27:21 AM
> Subject: Re: AW: Null Indexing
> 
> Thanks for your info Martina,
> it works with the command line but it doesn't when using the webapp
> (localhost:8080/nutch-0.9)
> is it enough with only deploy the war file using Tomcat manager?
> or should we include some other file to the catalina_home?
> 
> 
> 
> 
> 
> ________________________________
> From: Koch Martina <Ko...@huberverlag.de>
> To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
> Sent: Friday, January 9, 2009 2:57:24 PM
> Subject: AW: Null Indexing
> 
> Hi Ramadhany,
> 
> the mentioned warnings and fatals you see in the log have nothing to do
> with getting 0 results at searching.
> The fatal message can be eliminated by setting the property
> "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
> The urlnormalizer warn messages just inform you that you have not
> specified a dedicated urlnormalizer for a certain scope so that the
> default urlnormalizer is used. If you need more information on this, look
> at URLNormalizers.java (package org.apache.nutch.net).
> 
> To narrow down your searching problems, please provide some more details
> on your configuration.
> Did you check the content of your index using Luke
> (http://www.getopt.org/luke/) to make sure that the pages and content you
> are expecting in the index are really in there?
> Did you try a search directly from the command line in cygwin by the
> command:
> bin/nutch org.apache.nutch.searcher.NutchBean <search query>
> 
> Kind regards,
> Martina
> 
> -----Ursprüngliche Nachricht-----
> Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com] 
> Gesendet: 09 January 2009 01:39
> An: nutch-user@lucene.apache.org
> Betreff: Null Indexing
> 
> I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem.
> when I try to search it returns  0 hits, I have configured the crawl
> folder in the webapp and crawled my localhost (could it be done?)
> I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error
> problem occurs when crawl  based on the crawl.log. but it's return null
> when indexing.
> 
> on the hadoop.log there are some fatal and warn status like these:
> 
> FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed
> first in 'http.robots.agents' property!
> .
> WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject',
> using default 
> .
> WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb',
> using default
> .
> WARN  util.NativeCodeLoader - Unable to load native-hadoop library for
> your platform... using builtin-java classes where applicable
> 
> is it related with this problem
> 
> Regards,
> 
> Ramadhany
> 
> 

-- 
View this message in context: http://www.nabble.com/Null-Indexing-tp21364166p25531221.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: AW: Null Indexing

Posted by MEHALA N <me...@gmail.com>.

hai,
 i am getting the following error while running the crawler by
bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50


Dedup: adding indexes in: crawl_NEW1/indexes
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
       at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

Can anyone help me to clear this problem.
-N.Mehala.

On Wed, Sep 23, 2009 at 10:14 AM, Cisek <fa...@mailinator.com> wrote:
>
> I had the same little big problem - everything seemed OK:
>
> - bin/nutch org.apache.nutch.searcher.NutchBean <search query> ... [in my
> case search query = "apache"] in cygwin returns 62 Total hits on cawled
> "+^http://([a-z0-9]*\.)*apache.org/"
>
> - Nutch in Tomcat webapp after deploy seemed fine (no errors)
>
> - I had NOT created a new xml file named nutch-0.9.xml which contains
> <Context path="/nutch-0.9/" debug="5" privileged="true"
> docBase="C:\nutch-0.9"/> and NOT put it in
> C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had
>
> - but still got Hits 0-0 (out of about 0 total matching pages): in
> Tomcat-Nutch web interface.
>
> ... but I have solved it in my case:
>
> - I forgott to configure the searcher.dir in nutch-site.xml at
> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows  - Set Your
> Searcher Directory
>
> - and now it works fine - Tomcat-Nutch interface returns 62 matching pages
> :)
>
>
> Imam Nur Ramadhany wrote:
>>
>> Hello again everyone,
>>
>> My detail configuration is just like what
>> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new
>> to
>> Tomcat  and  Java, so I just followed the instruction.
>>
>> I extracted the release at C:\nutch-0.9, made a directory
>> named urls with a file also named urls (without extention), then added the
>> URLs
>> to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also
>> have
>> crawled  a site (http://localhost/). For
>> web interface search I uploaded the nutch WAR file. And created a new xml
>> file
>> named nutch-0.9.xml which contains <Context path="/nutch-0.9/"
>> debug="5" privileged="true" docBase="C:\nutch-0.9"
>> /> and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where
>> my problems are. Is it the correct path and docbase? When I enter
>> http://localhost:8080/nutch-0.9/
>> there is a welcome page but when I put a query and click the search it
>> wasn't
>> returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also
>> have
>> configured the searcher.dir in nutch-site.xml at
>> C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes
>> anyway.
>>
>> Then like Koch Martina's suggestion I tried to search
>> directly from the command line in cygwin by the command:
>> bin/nutch org.apache.nutch.searcher.NutchBean <search
>> query>.
>>  It works.
>>  I'm still working on
>> the nutch-0.9.xml to make the webapp works, trying some path and docbase.
>> But it would be helpful if you
>> have any other suggestions.
>>
>> Thanks in advance,Ramadhany
>>
>>
>>
>> ________________________________
>> From: Imam Nur Ramadhany <ra...@yahoo.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Tuesday, January 13, 2009 7:27:21 AM
>> Subject: Re: AW: Null Indexing
>>
>> Thanks for your info Martina,
>> it works with the command line but it doesn't when using the webapp
>> (localhost:8080/nutch-0.9)
>> is it enough with only deploy the war file using Tomcat manager?
>> or should we include some other file to the catalina_home?
>>
>>
>>
>>
>>
>> ________________________________
>> From: Koch Martina <Ko...@huberverlag.de>
>> To: "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
>> Sent: Friday, January 9, 2009 2:57:24 PM
>> Subject: AW: Null Indexing
>>
>> Hi Ramadhany,
>>
>> the mentioned warnings and fatals you see in the log have nothing to do
>> with getting 0 results at searching.
>> The fatal message can be eliminated by setting the property
>> "http.robots.agents" in the nutch-site.xml to "Imam Spider,*".
>> The urlnormalizer warn messages just inform you that you have not
>> specified a dedicated urlnormalizer for a certain scope so that the
>> default urlnormalizer is used. If you need more information on this, look
>> at URLNormalizers.java (package org.apache.nutch.net).
>>
>> To narrow down your searching problems, please provide some more details
>> on your configuration.
>> Did you check the content of your index using Luke
>> (http://www.getopt.org/luke/) to make sure that the pages and content you
>> are expecting in the index are really in there?
>> Did you try a search directly from the command line in cygwin by the
>> command:
>> bin/nutch org.apache.nutch.searcher.NutchBean <search query>
>>
>> Kind regards,
>> Martina
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Imam Nur Ramadhany [mailto:ramadhanyovski@yahoo.com]
>> Gesendet: 09 January 2009 01:39
>> An: nutch-user@lucene.apache.org
>> Betreff: Null Indexing
>>
>> I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem.
>> when I try to search it returns  0 hits, I have configured the crawl
>> folder in the webapp and crawled my localhost (could it be done?)
>> I use Windows with cygwin, Tomcat6.0, and jdk1.6.0_10. There is no error
>> problem occurs when crawl  based on the crawl.log. but it's return null
>> when indexing.
>>
>> on the hadoop.log there are some fatal and warn status like these:
>>
>> FATAL api.RobotRulesParser - Agent we advertise ('Imam Spider') not listed
>> first in 'http.robots.agents' property!
>> .
>> WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject',
>> using default
>> .
>> WARN  regex.RegexURLNormalizer - can't find rules for scope 'crawldb',
>> using default
>> .
>> WARN  util.NativeCodeLoader - Unable to load native-hadoop library for
>> your platform... using builtin-java classes where applicable
>>
>> is it related with this problem
>>
>> Regards,
>>
>> Ramadhany
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Null-Indexing-tp21364166p25531221.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>