You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jefferson <je...@msn.com> on 2011/06/24 16:40:27 UTC

Problem in search

My problem is in the search.
I made the site crawler http://en.wikipedia.org/wiki/Albert_Einstein
When I access the http://localhost:8080/nutch-1.1/
and digit <Adolf Hitler> returns me a result, ok.
When I type <phenomena> returns 0 results, not ok.

Attached is my config files and logging.
thanks

http://lucene.472066.n3.nabble.com/file/n3104461/nutch-site.xml
nutch-site.xml 
http://lucene.472066.n3.nabble.com/file/n3104461/nutch-default.xml
nutch-default.xml 
http://lucene.472066.n3.nabble.com/file/n3104461/hadoop.log hadoop.log 
http://lucene.472066.n3.nabble.com/file/n3104461/crawl.log crawl.log 

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3104461.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by Markus Jelsma <ma...@openindex.io>.
No idea. Perhaps try to dig in the code yourself or try your luck on the 
Lucene ML. 

> Hello,
> Does anyone know how to solve this problem?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3121565.htm
> l Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by Joey <ma...@gmail.com>.
Hi Jefferson,

If you use the lucene-summary plugin, the varible 
"summary.lucene.fragments.num" defined in nutch-default.xml limits the 
number of "best" fragments displayed in summary. The default value is 1.

Regards,
Joey

On 06/29/2011 09:17 PM, Jefferson wrote:
> Hello,
> Does anyone know how to solve this problem?
>
> --
> View this message in context:http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3121565.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Problem in search

Posted by Jefferson <je...@msn.com>.
Hello,
Does anyone know how to solve this problem?

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3121565.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by Jefferson <je...@msn.com>.
Hi Markus
Is there any limitation in Lucene?
If so, does anyone know how to remove this limitation?

My configuration files are attached in my first post.
Could someone please try?

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3113894.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by Markus Jelsma <ma...@openindex.io>.
I don't know about the old highlighter in Nutch but there may be a hardcoded 
setting limited the number of chars to analyze for highlighter. Solr has this, 
don't know if it's in Lucene as well. 

On Monday 27 June 2011 15:09:29 Jefferson wrote:
> Hi lewis,
> 
> My concern is that Nutch return the stretch where I contains the key words
> searched, just like what Google does.
> Until the middle of the site, Nutch can I return the parts referring to the
> keywords searched.
> But the middle of the page down, he did not return the part that contains
> the key words, he returns the beginning of the text of the page. I realized
> that he finds the word in the text, but I can not return the snippet of
> text where it lies. I need that piece, because I'm doing work for college
> and need the Nutch work equal the Google.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3113565.htm
> l Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Problem in search

Posted by Jefferson <je...@msn.com>.
Hi lewis,

My concern is that Nutch return the stretch where I contains the key words
searched, just like what Google does.
Until the middle of the site, Nutch can I return the parts referring to the
keywords searched.
But the middle of the page down, he did not return the part that contains
the key words, he returns the beginning of the text of the page. I realized
that he finds the word in the text, but I can not return the snippet of text
where it lies. I need that piece, because I'm doing work for college and
need the Nutch work equal the Google.

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3113565.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by lewis john mcgibbney <le...@gmail.com>.
I see within you're nutch-site file that you have set an http.content.limit
value of 340,671. Is there any reason for this value? I'm assuming you are
not indexing this page so you can merely search for the term phenomena, and
that there is other textual content within the page that you are interested
in...would this assumption be right?

As Markus explained the page has a http content length of some 600,000, and
from looking at where the first occourance of the term phenomena is, it is
located roughly half way through the page.

When crawling large sites such as wikipedia (which we all know contains
large http content within its webpages), I have found that a safe guard
measure to ensure we get all page content is to set the http.content.limit
to a negative value e.g. -1. This way we are guaranteed that we get all page
content. Another useful tool which is widely used is LUKE [1], this will
enable you to search you Lucene index and confirm whether or not Nutch has
fetched and sent the content you wish to be stored within your index.

[1] http://code.google.com/p/luke/

On Sat, Jun 25, 2011 at 7:42 AM, Jefferson <je...@msn.com> wrote:

> The problem is that he returns the beginning of the text section of the
> website. The correct he is returning the passage in which the word
> <phenomena> is found.
> Sorry my english...
>
>
> Jefferson
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Problem in search

Posted by Jefferson <je...@msn.com>.
The problem is that he returns the beginning of the text section of the
website. The correct he is returning the passage in which the word
<phenomena> is found.
Sorry my english...


Jefferson

--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by lewis john mcgibbney <le...@gmail.com>.
Can you expand on this? I am not understanding your description of the
problem.

On Fri, Jun 24, 2011 at 12:52 PM, Jefferson <je...@msn.com> wrote:

> ready.
> Now I have another problem:
> digit <phenomena> and he returns this:
> -
> Albert Einstein - Wikipedia, the free encyclopedia Albert Einstein From
> Wikipedia, the free encyclopedia Jump ...
> -
> what might be happening? Thanks for the help
>
> below my configuration files:
>
> http://lucene.472066.n3.nabble.com/file/n3105976/nutch-default.txt
> nutch-default.txt
> http://lucene.472066.n3.nabble.com/file/n3105976/nutch-site.txt
> nutch-site.txt
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3105976.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Problem in search

Posted by Jefferson <je...@msn.com>.
ready.
Now I have another problem:
digit <phenomena> and he returns this:
-
Albert Einstein - Wikipedia, the free encyclopedia Albert Einstein From
Wikipedia, the free encyclopedia Jump ...
-
what might be happening? Thanks for the help

below my configuration files:

http://lucene.472066.n3.nabble.com/file/n3105976/nutch-default.txt
nutch-default.txt 
http://lucene.472066.n3.nabble.com/file/n3105976/nutch-site.txt
nutch-site.txt 


--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3105976.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by Markus Jelsma <ma...@openindex.io>.
That might be it. The page says Content-Length: 340671

> Hi Jefferson,
> 
> I cannot access either your nutch-site or nutch-default but I see that your
> http.content.limit is  INFO http.Http - http.content.limit = 65536
> 
> It is a fairly large page so maybe this can be the cause. I'm sorrry I
> don't have access to my linux worktop so I can't test myself can you
> please advise if this has been accounted for in your nutch-site. Anything
> over the default 65536 limit is truncated therefore you may not be able to
> search for it.
> 
> Further to this it seems that the hadoop.log does not show any eratic
> bahaviour.
> 
> On Fri, Jun 24, 2011 at 7:40 AM, Jefferson <je...@msn.com> wrote:
> > My problem is in the search.
> > I made the site crawler http://en.wikipedia.org/wiki/Albert_Einstein
> > When I access the http://localhost:8080/nutch-1.1/
> > and digit <Adolf Hitler> returns me a result, ok.
> > When I type <phenomena> returns 0 results, not ok.
> > 
> > Attached is my config files and logging.
> > thanks
> > 
> > http://lucene.472066.n3.nabble.com/file/n3104461/nutch-site.xml
> > nutch-site.xml
> > http://lucene.472066.n3.nabble.com/file/n3104461/nutch-default.xml
> > nutch-default.xml
> > http://lucene.472066.n3.nabble.com/file/n3104461/hadoop.log hadoop.log
> > http://lucene.472066.n3.nabble.com/file/n3104461/crawl.log crawl.log
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3104461.ht
> > ml Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem in search

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Jefferson,

I cannot access either your nutch-site or nutch-default but I see that your
http.content.limit is  INFO http.Http - http.content.limit = 65536

It is a fairly large page so maybe this can be the cause. I'm sorrry I don't
have access to my linux worktop so I can't test myself can you please advise
if this has been accounted for in your nutch-site. Anything over the default
65536 limit is truncated therefore you may not be able to search for it.

Further to this it seems that the hadoop.log does not show any eratic
bahaviour.

On Fri, Jun 24, 2011 at 7:40 AM, Jefferson <je...@msn.com> wrote:

> My problem is in the search.
> I made the site crawler http://en.wikipedia.org/wiki/Albert_Einstein
> When I access the http://localhost:8080/nutch-1.1/
> and digit <Adolf Hitler> returns me a result, ok.
> When I type <phenomena> returns 0 results, not ok.
>
> Attached is my config files and logging.
> thanks
>
> http://lucene.472066.n3.nabble.com/file/n3104461/nutch-site.xml
> nutch-site.xml
> http://lucene.472066.n3.nabble.com/file/n3104461/nutch-default.xml
> nutch-default.xml
> http://lucene.472066.n3.nabble.com/file/n3104461/hadoop.log hadoop.log
> http://lucene.472066.n3.nabble.com/file/n3104461/crawl.log crawl.log
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3104461.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*