You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jefferson <je...@msn.com> on 2011/06/16 16:03:48 UTC

Problem with Nutch Search

Hi
I'm testing the nutch, I followed the tutorial in the nutch,
but I found a problem. I ran the command bin / nutch crawl
6 sites in plain text that contains only about 400 lines of text, so far so
normal. When I do a search with Nutch, he sweeps up about 50 lines after
that he does not sweep over the text. If I look, for example by "church" and
this word is beyond the first 50 lines of text, it returns 0 results.
Anyone have any solution for this?

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-Nutch-Search-tp3072077p3072077.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem with Nutch Search

Posted by Jefferson <je...@msn.com>.
I use version 1.2
and increased the http.content.limit no avail.
crawldb analyzed and page content is ok.
In hadoop.log found the following WARN:
Mapred.JobClient WARN - Use GenericOptionsParser for parsing the arguments.
Applications Should Implement Tool for the same.
WARN regex.RegexURLNormalizer - can not find rules for scope 'inject', using
default
Util.NativeCodeLoader WARN - Unable to load native-hadoop library for your
platform ... using builtin-java classes where Applicable

continue with the same problem. Any suggestions?

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-Nutch-Search-tp3072077p3076475.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem with Nutch Search

Posted by lewis john mcgibbney <le...@gmail.com>.
Off the top of my head one property springs to mind. Which you may or may
not have configured in nutch-site

http.content.limit

However I think that this is not the source of the problem.
I would advise you to have a look at your hadoop log file for any obvious
warnings... how do you know "he sweeps up about 50 lines after
that he does not sweep over the text"? Have you looked at a dump of the
crawldb to see what content the database is aware of?

Without verifying answers to some of the above it is hard to decouple the
errors in nutch from the legacy architecture of < Nutch 1.3


On Thu, Jun 16, 2011 at 3:03 PM, Jefferson <je...@msn.com> wrote:

> Hi
> I'm testing the nutch, I followed the tutorial in the nutch,
> but I found a problem. I ran the command bin / nutch crawl
> 6 sites in plain text that contains only about 400 lines of text, so far so
> normal. When I do a search with Nutch, he sweeps up about 50 lines after
> that he does not sweep over the text. If I look, for example by "church"
> and
> this word is beyond the first 50 lines of text, it returns 0 results.
> Anyone have any solution for this?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-with-Nutch-Search-tp3072077p3072077.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*