You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Pierre VANNIER <th...@gmail.com> on 2005/02/15 09:39:48 UTC

Lucene "cuts" the search results ?

  Hi all,

I'm quite a newbie for Lucene, but I bought "Lucene In Action" and I'm 
trying to customize few examples caught from there.

I Have this sample code of JSP (bad JSP caus' I'm also a jsp newbie - :-)) :

Here's the code
------------------------------------------------------------------------------------------------

.....html head body ....
<%
long start = new Date().getTime();
Iterator myIterator = vIndexDir.iterator();

while(myIterator.hasNext())
{
     IndexSearcher searcher = new IndexSearcher((String)myIterator.next());
     Query query = new TermQuery(new Term("introduction", queryString));
     Hits hits = searcher.search(query);
     QueryScorer  scorer = new QueryScorer(query);
     Highlighter highlighter = new Highlighter(scorer);
     %>
<table width="70%" cellpadding="2" cellspacing="2">
     <%
      out.println("<tr><td><hr><br/>NUMBER OF MATCHING NEWS FOR \""+ 
(String)myIterator.next() + "\" -->" +hits.length() + "    </td></tr>");
     for (int i = 0; i < hits.length(); i++)
     {
         String introduction = hits.doc(i).get("introduction");
         TokenStream stream = new 
SimpleAnalyzer().tokenStream("introduction", new 
StringReader(introduction));
         String fragment = highlighter.getBestFragment(stream, 
introduction);
         String pubDate = hits.doc(i).get("pubDate").substring(0, 
hits.doc(i).get("pubDate").length()-13);
         String link = hits.doc(i).get("link");
         float score =  hits.score(i);
         String title = hits.doc(i).get("title");
         %>
         <tr>
              <td>
              Scoring : <b><%=score%></b><br/>
              <%=pubDate +
              " <a href=\"#\"  onClick=\"window.open('" +
              link + "', 'news', 'width=760;height=600')\">" +
              title +
              "</a>"
              %>
              <br/>
              <%= fragment%>
              <br/><br/>
              </td>
              </tr>
     <%}%>
         </table>
<%
    }
long end = new Date().getTime();
long interval  = end - start;
%>
<br><br><div align="right"><b>System time for query : <%= interval%> 
milliseconds</b></div>

</body>
</html>

-------------------------------------------------------------------------------

The output is all right, but at the en of this result page, the last 
"hit" is cut (I mean for example) :

Scoring : 0.9210043
Fri, 28 Jan 2005

---------------------------------------------

I'm running all this in tomcat 5.0.28 and last nightly fresh build of 
lucene.

So, Could it be a caching problem ? Could this come from JSP or Lucene ?

Thanks, and please I do apologise for my poor english ;-)


Pierre VANNIER


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene "cuts" the search results ?

Posted by Doug Cutting <cu...@apache.org>.
markharw00d wrote:
> The highlighter uses a number of "pluggable" services, one of which is the
> choice of "Fragmenter" implementation. This interface is for classes which
> decide the boundaries where to cut the original text into snippets. The 
> default
> implementation used simply breaks up text into evenly sized chunks. A more
> intelligent implementation could be made to detect sentence boundaries.

Also note that paragraph boundaries alone would help a lot and are 
easier to reliably detect.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene "cuts" the search results ?

Posted by markharw00d <ma...@yahoo.co.uk>.
Hi Pierre,
Here's the response I gave the last time this question was raised::

The highlighter uses a number of "pluggable" services, one of which is the
choice of "Fragmenter" implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The 
default
implementation used simply breaks up text into evenly sized chunks. A more
intelligent implementation could be made to detect sentence boundaries.
What you are asking for requires that the Fragmenter would know where the
upcoming query matches are and decides on fragment boundaries with this in
mind. To have this foresight would require a preliminary pass over the
TokenStream to identify the match points before calling the highlighter.

This Fragmenter implementation does not exist but it does not sound
unachievable. I would suggest that some knowledge of sentence boundaries
probably would probably help here too. I dont have any plans to write such a
Fragmenter now but this is how it could be done.

Hope this helps,
Cheers,
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene "cuts" the search results ?

Posted by Pierre VANNIER <th...@gmail.com>.
Thank for reply Daniel,

But is there anything to do then to avoid such a thing to happen ?

Regards

Daniel Naber a écrit :

>On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:
>
>  
>
>>         String fragment = highlighter.getBestFragment(stream,
>>introduction);
>>    
>>
>
>The highlighter breaks up text into same-size chunks (100 characters by 
>default). If the matching term now appears just at the end or at the start of 
>such a chunk you'll get no context and it looks as if text was cut off.
>
>Regards
> Daniel
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene "cuts" the search results ?

Posted by Daniel Naber <da...@t-online.de>.
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:

>          String fragment = highlighter.getBestFragment(stream,
> introduction);

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
 Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org