You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stelios Eliakis <el...@gmail.com> on 2006/09/23 11:39:27 UTC

highlighting

Hi,
I'm new to lucene and I'm interesting in highlighting.
I want to extract the Best Fragment (passage) from a text file.
When I use the following code I take the first fragment that contains my
query. Nevertheless, the JavaDoc says that the function getBestFragment
returns the best fragment. Do I something wrong?

    QueryScorer scorer = new QueryScorer(query);

    SimpleHTMLFormatter formatter =new SimpleHTMLFormatter("<span
class=\"highlight\">","</span>");

    Highlighter highlighter = new Highlighter(formatter, scorer);

    Fragmenter fragmenter = new SimpleFragmenter(50);

    QueryScorer fragmentScore=(QueryScorer) highlighter.getFragmentScorer();

    TokenStream tokenStream = new StandardAnalyzer().tokenStream("contents",
new StringReader(text));

    String result = highlighter.getBestFragment(tokenStream,text);

    System.out.println(result);


Thanks in advance

-- 
Stelios Eliakis

Re: searching in social networks

Posted by mark harwood <ma...@yahoo.co.uk>.
Finding the connected elements which make up the neighbourhood is just straightforward lookups of connected IDs on the graph. This can be done using either a database or Lucene - your choice, although I suspect the database is the better choice given the structured nature of the data and any potential volatility in connections. Once you have the IDs which define "the neighbourhood" of nodes you want to search, these IDs can be built into a Lucene filter very fast (see org.apache.lucene.search.TermsFilter.addTerm in the "contrib\queries" section).
Using this class I've found Lucene capable of searching neighbourhoods of thousands of nodes very quickly.

The biggest problem you are likely to face is shortlisting the nodes you want to search when "3 degrees of connectivity" leads you at step 1 or 2 to a highly connected node, exploding the list of IDs under consideration.

Cheers
Mark


----- Original Message ----
From: Sharad Agarwal <sh...@aol.com>
To: java-user@lucene.apache.org
Sent: Monday, 25 September, 2006 10:50:03 AM
Subject: searching in social networks

I am using lucene for simple flat searches. Now I have a requirement to 
do searches based on the object's connectivity with other objects. The 
way the searches are done in "social networks". Lets say I want to 
search for a query in only those objects which are within 3 degrees of 
connectivity to a given object.

Has any body tried this kind of feature with lucene? Any pointers will 
be appreciated.

thanks
sharad



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


searching in social networks

Posted by Sharad Agarwal <sh...@aol.com>.
I am using lucene for simple flat searches. Now I have a requirement to 
do searches based on the object's connectivity with other objects. The 
way the searches are done in "social networks". Lets say I want to 
search for a query in only those objects which are within 3 degrees of 
connectivity to a given object.

Has any body tried this kind of feature with lucene? Any pointers will 
be appreciated.

thanks
sharad



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting

Posted by govind bhardwaj <go...@gmail.com>.
Hi Sabeer,

I used Lucene 3.3.0 for testing your code. (I doubt that Lucene 4.0 has been
released as version 3.3.0 was released recently in July).

In the second case, due to exact-matching there is no output i.e. there is
no
"transport" (no exact match)  , but "transportation" in sourceText. One
could try
modifying the query to "transport*" like I did, but I got some error like
this :
*
MemoryIndex class-not-found error (Exception in thread "main"
java.lang.NoClassDefFoundError: org/apache/lucene/index/memory/MemoryIndex)*

Also, regarding highlighting and regular expression, I found this bug (i'm
not sure if this exactly relates to the problem you've asked)
http://exist.2174344.n4.nabble.com/exist-Bugs-3038780-match-highlighting-for-lucene-wildcard-and-regex-search-td2317647.html

Pretty much helpless after this :(

Govind

On Mon, Jul 18, 2011 at 4:50 PM, Sabeer Hussain <sh...@del.aithent.com>wrote:

> I am using Lucene 4.0 and trying to use its highlighting feature. I am not
> getting the desired result due to some mistake that I am not able to
> identify. My source code looks like
>
> String sourceText  = "liver disease kidney transplant";
> String termString ="\"transplant\"";
>
> SimpleAnalyzer simpleAnalyzer = new SimpleAnalyzer(Version.LUCENE_40);
> Query query = new QueryParser(Version.LUCENE_40,"contents",
> simpleAnalyzer).parse(termString);
>
> TokenStream tokenStream = simpleAnalyzer.tokenStream("contents", new
> StringReader(sourceText));
> QueryScorer scorer = new QueryScorer(query,"contents");
> scorer.setExpandMultiTermQuery(true);
> Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>
> SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter( "*",
> "*") ;
> Highlighter highlighter = new Highlighter(simpleHTMLFormatter, scorer );
> highlighter.setTextFragmenter(fragmenter);
> highlighter.setMaxDocCharsToAnalyze(10000);
> String resultString =
> highlighter.getBestFragments(tokenStream,sourceText,1000, "...");
> System.out.println("Source Text1 = "+sourceText);
> System.out.println("Result Text1 = "+resultString);
>
> sourceText = "for liver transplantation.";
> tokenStream = simpleAnalyzer.tokenStream("contents", new
> StringReader(sourceText));
> resultString = highlighter.getBestFragments(tokenStream,sourceText,1000,
> "...");
>
> System.out.println("Source Text2 = "+sourceText);
> System.out.println("Result Text2 = "+resultString);
>
> For the first text, I am getting the result properly but not for the second
> one
>
> Source Text1 = liver disease kidney transplant
> Result Text1 = liver disease kidney *transplant*
>
> Source Text2 = for liver transplantation.
> Result Text2 =
>
> I am expecting the result for second one like
> for liver *transplant*ation
>
> or
> for liver *transplantation*
>
> What is wrong in my code?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/highlighting-tp542569p3178841.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.

Re: highlighting

Posted by Sabeer Hussain <sh...@del.aithent.com>.
I am using Lucene 4.0 and trying to use its highlighting feature. I am not
getting the desired result due to some mistake that I am not able to
identify. My source code looks like 

String sourceText  = "liver disease kidney transplant";
String termString ="\"transplant\"";
			
SimpleAnalyzer simpleAnalyzer = new SimpleAnalyzer(Version.LUCENE_40);
Query query = new QueryParser(Version.LUCENE_40,"contents",
simpleAnalyzer).parse(termString);

TokenStream tokenStream = simpleAnalyzer.tokenStream("contents", new
StringReader(sourceText));
QueryScorer scorer = new QueryScorer(query,"contents");
scorer.setExpandMultiTermQuery(true);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter( "*", 
"*") ;
Highlighter highlighter = new Highlighter(simpleHTMLFormatter, scorer );
highlighter.setTextFragmenter(fragmenter);
highlighter.setMaxDocCharsToAnalyze(10000);
String resultString =
highlighter.getBestFragments(tokenStream,sourceText,1000, "...");
System.out.println("Source Text1 = "+sourceText);
System.out.println("Result Text1 = "+resultString);
			
sourceText = "for liver transplantation.";
tokenStream = simpleAnalyzer.tokenStream("contents", new
StringReader(sourceText));
resultString = highlighter.getBestFragments(tokenStream,sourceText,1000,
"...");

System.out.println("Source Text2 = "+sourceText);
System.out.println("Result Text2 = "+resultString);
			
For the first text, I am getting the result properly but not for the second
one

Source Text1 = liver disease kidney transplant
Result Text1 = liver disease kidney *transplant*
			
Source Text2 = for liver transplantation.
Result Text2 = 

I am expecting the result for second one like  
for liver *transplant*ation

or
for liver *transplantation*

What is wrong in my code?



--
View this message in context: http://lucene.472066.n3.nabble.com/highlighting-tp542569p3178841.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting

Posted by Doron Cohen <DO...@il.ibm.com>.
See below...

"Stelios Eliakis" <el...@gmail.com> wrote on 25/09/2006 15:48:10:
> You are right!
> 1)As far as Example 1 is concerned, I don't want these 2 fragments to
have
> the same score.Do you know how could I do this?

This behavior is not configurable, as far as I can understand, at least not
without changing the code of the QueryScorer (that you are using). I will
raise this question in the developers mailing list.

>
> 2)Furthemore, if a try to take fragment score:
>
> Scorer fragmentScore= highlighter.getFragmentScorer();
> float fragmentScoreFloat=fragmentScore.getFragmentScore();
>
> I take 0.0. why?
>

The QueryScorer maintains the score of the 'currently handled fragment', as
it process text fragments serially. Each time a new fragment starts, the
score maintained for it by the scorer is initialized to zero. So it really
depends how and when you call this API. If you invoke it after the call to
getBestFragments*() it would reflects the last processed fragment, so it
could be 0 or not. This makes sense with the javadoc:
  /** Called when the highlighter has no more tokens for
   * the current fragment - the scorer returns
   * the weighting it has derived for the most
   * recent fragment, typically based on the tokens
   * passed to getTokenScore().
   **/

> 3)Moreover,  for some docs lucene don't returns any fragment even if the
> query exist in the document. why? :)

I can't see how this happens... Do you have a sample - doc text and query -
that demonstrate this behavior?

>
> Thanks in advance
> Stelios Eliakis
>
>
> On 9/26/06, Doron Cohen <DO...@il.ibm.com> wrote:
> >
> >
> > "Stelios Eliakis" <el...@gmail.com> wrote on 23/09/2006 02:39:27:
> > > I want to extract the Best Fragment (passage) from a text file.
> > > When I use the following code I take the first fragment that contains
my
> > > query. Nevertheless, the JavaDoc says that the function
getBestFragment
> > > returns the best fragment. Do I something wrong?
> >
> > That code seems fine to me.
> >
> > A possible explanation (which I think might be the case here but not
sure)
> > is that getBestFragment*() only accumulates fragments scores for
matches
> > of
> > "unique terms" in the fragment.
> >
> > Example 1: query = "xy", and the term "xy" appears once in an early
> > fragment but 3 times in a later fragment. In this case both fragments
> > would
> > be scored equally, and hence the early fragment would be selected
"best"
> > just because of how the sorting works.
> >
> > Example 2: query = "xy zw", and the early fragment contains "xy" but a
> > later fragment contains both "xy" and "zw". In this case the later
> > fragment
> > would be selected "best".
> >
> > Does this explain what you see in your program?
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Stelios Eliakis


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting

Posted by Stelios Eliakis <el...@gmail.com>.
You are right!
1)As far as Example 1 is concerned, I don't want these 2 fragments to have
the same score.Do you know how could I do this?

2)Furthemore, if a try to take fragment score:

Scorer fragmentScore= highlighter.getFragmentScorer();
float fragmentScoreFloat=fragmentScore.getFragmentScore();

I take 0.0. why?

3)Moreover,  for some docs lucene don't returns any fragment even if the
query exist in the document. why? :)

Thanks in advance
Stelios Eliakis


On 9/26/06, Doron Cohen <DO...@il.ibm.com> wrote:
>
>
> "Stelios Eliakis" <el...@gmail.com> wrote on 23/09/2006 02:39:27:
> > I want to extract the Best Fragment (passage) from a text file.
> > When I use the following code I take the first fragment that contains my
> > query. Nevertheless, the JavaDoc says that the function getBestFragment
> > returns the best fragment. Do I something wrong?
>
> That code seems fine to me.
>
> A possible explanation (which I think might be the case here but not sure)
> is that getBestFragment*() only accumulates fragments scores for matches
> of
> "unique terms" in the fragment.
>
> Example 1: query = "xy", and the term "xy" appears once in an early
> fragment but 3 times in a later fragment. In this case both fragments
> would
> be scored equally, and hence the early fragment would be selected "best"
> just because of how the sorting works.
>
> Example 2: query = "xy zw", and the early fragment contains "xy" but a
> later fragment contains both "xy" and "zw". In this case the later
> fragment
> would be selected "best".
>
> Does this explain what you see in your program?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Stelios Eliakis

Re: highlighting

Posted by Doron Cohen <DO...@il.ibm.com>.
"Stelios Eliakis" <el...@gmail.com> wrote on 23/09/2006 02:39:27:
> I want to extract the Best Fragment (passage) from a text file.
> When I use the following code I take the first fragment that contains my
> query. Nevertheless, the JavaDoc says that the function getBestFragment
> returns the best fragment. Do I something wrong?

That code seems fine to me.

A possible explanation (which I think might be the case here but not sure)
is that getBestFragment*() only accumulates fragments scores for matches of
"unique terms" in the fragment.

Example 1: query = "xy", and the term "xy" appears once in an early
fragment but 3 times in a later fragment. In this case both fragments would
be scored equally, and hence the early fragment would be selected "best"
just because of how the sorting works.

Example 2: query = "xy zw", and the early fragment contains "xy" but a
later fragment contains both "xy" and "zw". In this case the later fragment
would be selected "best".

Does this explain what you see in your program?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org