You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/05/21 21:09:11 UTC

Re: Phrase Highlighting

On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch <ih...@gmail.com> wrote:
> > You should switch to the SpanScorer (in o.a.l.search.highlighter).
> >> That fragment scorer should only match true phrase matches.
> >>
> >> Mike
> >>
> >
> > Thanks Mike.  I gave it a try and it wasn't working how I expected.  I am
> > using pylucene right now so I can ask them if the implementation is
> > different.  I'm messing around with the lucene unit tests to see exactly
> how
> > the scorer should work.
>
> Can you give more details on what's not working right?


Sorry, the following code is in python, but I can hack a Java thing together
if necessary. HighlighterSpanScorer is the SpanScorer from the highlight
package just renamed to avoid conflict with the other SpanScorer object.

Well what happens is if I use a SpanScorer instead, and allocate it like
such:

            analyzer = StandardAnalyzer([])
            tokenStream = analyzer.tokenStream("contents",
lucene.StringReader(text))
            ctokenStream = lucene.CachingTokenFilter(tokenStream)
            highlighter = lucene.Highlighter(formatter,
lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
            ctokenStream.reset()

            result = highlighter.getBestFragments(ctokenStream, text,
                    2, "...")

 My highlighter is still breaking up words inside of a span.  For example,
if I search for \"John Smith\", instead of the highlighter being called for
the whole "John Smith", it gets called for "John" and then "Smith".



>
> > In the mean time, If I am interested in finding out exactly how many
> times a
> > term was found in a document, what is the best way to go about this?  The
> > way I am doing it right now is using a highlighter and just incrementing
> > counters when a word is found that I'm interested.  I just came across
> > FieldSortedTermVectorMapper that could do something similar.  Is
> > FieldSortedTermVectorMapper something I could use for this?  Is there a
> > better option?
>
>
> Is it really just single terms you need to measure?  (eg, not "how
> many times did phrase XYZ occur in the doc").  If so, then getting the
> term vectors and locating your term in there, should work.  This is
> probably OK if you just do it for each of the hits on the page (like
> 10 hits), but will be way too slow if you try to do it for say all
> docs that matched the query.
>

I see how the term vector might be used.  I can't really tell if there is a
way for me to do a Span check on the words as easily as the highlighter
would do.


-max

Re: Phrase Highlighting

Posted by Max Lynch <ih...@gmail.com>.

On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller <ma...@gmail.com> wrote:

> Max Lynch wrote:
>
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>
>>>
>>
>>
>>
>>> such:
>>>>
>>>>           analyzer = StandardAnalyzer([])
>>>>           tokenStream = analyzer.tokenStream("contents",
>>>> lucene.StringReader(text))
>>>>           ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>>           highlighter = lucene.Highlighter(formatter,
>>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>>           ctokenStream.reset()
>>>>
>>>>           result = highlighter.getBestFragments(ctokenStream, text,
>>>>                   2, "...")
>>>>
>>>>  My highlighter is still breaking up words inside of a span.  For
>>>>
>>>>
>>> example,
>>>
>>>
>>>> if I search for \"John Smith\", instead of the highlighter being called
>>>>
>>>>
>>> for
>>>
>>>
>>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>>
>>>>
>>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>>> which is the default used by Highlighter) to ensure that each fragment
>>> contains a full match for the query.  EG something like this (copied
>>> from LIA 2nd edition):
>>>
>>>   TermQuery query = new TermQuery(new Term("field", "fox"));
>>>
>>>   TokenStream tokenStream =
>>>       new SimpleAnalyzer().tokenStream("field",
>>>           new StringReader(text));
>>>
>>>   SpanScorer scorer = new SpanScorer(query, "field",
>>>                                      new
>>> CachingTokenFilter(tokenStream));
>>>   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>>   Highlighter highlighter = new Highlighter(scorer);
>>>   highlighter.setTextFragmenter(fragmenter);
>>>
>>>
>>
>>
>>
>> Okay, I hacked something up in Java that illustrates my issue.
>>
>>
>> import org.apache.lucene.search.*;
>> import org.apache.lucene.analysis.*;
>> import org.apache.lucene.document.*;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.search.highlight.*;
>> import org.apache.lucene.search.spans.SpanTermQuery;
>> import java.io.Reader;
>> import java.io.StringReader;
>>
>> public class PhraseTest {
>>    private IndexSearcher searcher;
>>    private RAMDirectory directory;
>>
>>    public PhraseTest() throws Exception {
>>        directory = new RAMDirectory();
>>
>>        Analyzer analyzer = new Analyzer() {
>>            public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>>                return new WhitespaceTokenizer(reader);
>>            }
>>
>>            public int getPositionIncrementGap(String fieldName) {
>>                return 100;
>>            }
>>        };
>>
>>        IndexWriter writer = new IndexWriter(directory, analyzer, true,
>>                IndexWriter.MaxFieldLength.LIMITED);
>>
>>        Document doc = new Document();
>>        String text = "Jimbo John is his name";
>>        doc.add(new Field("contents", text, Field.Store.YES,
>> Field.Index.ANALYZED));
>>        writer.addDocument(doc);
>>
>>        writer.optimize();
>>        writer.close();
>>
>>        searcher = new IndexSearcher(directory);
>>
>>        // Try a phrase query
>>        PhraseQuery phraseQuery = new PhraseQuery();
>>        phraseQuery.add(new Term("contents", "Jimbo"));
>>        phraseQuery.add(new Term("contents", "John"));
>>
>>        // Try a SpanTermQuery
>>        SpanTermQuery spanTermQuery = new SpanTermQuery(new
>> Term("contents",
>> "Jimbo John"));
>>
>>        // Try a parsed query
>>        Query parsedQuery = new QueryParser("contents",
>> analyzer).parse("\"Jimbo John\"");
>>
>>        Hits hits = searcher.search(parsedQuery);
>>        System.out.println("We found " + hits.length() + " hits.");
>>
>>        // Highlight the results
>>        CachingTokenFilter tokenStream = new
>> CachingTokenFilter(analyzer.tokenStream( "contents", new
>> StringReader(text)));
>>
>>        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>>
>>        SpanScorer sc = new SpanScorer(parsedQuery, "contents",
>> tokenStream,
>> "contents");
>>
>>        Highlighter highlighter = new Highlighter(formatter, sc);
>>        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>>        tokenStream.reset();
>>
>>        String rv = highlighter.getBestFragments(tokenStream, text, 1,
>> "...");
>>        System.out.println(rv);
>>
>>    }
>>    public static void main(String[] args) {
>>        System.out.println("Starting...");
>>        try {
>>            PhraseTest pt = new PhraseTest();
>>        } catch(Exception ex) {
>>            ex.printStackTrace();
>>        }
>>    }
>> }
>>
>>
>>
>> The output I'm getting is instead of highlighting <B>Jimbo John</B> it
>> does
>> <B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
>> several different query types (they are declared in the code, but only the
>> parsed version is being used).
>>
>> Thanks
>> -max
>>
>>
>>
> Sorry, not much you can do at the moment. The change is non trivial for
> sure (its probably easier to write some regex that merges them). This
> limitation was accepted because with most markup, it will display the same
> anyway. An option to merge would be great, and while I don't remember the
> details, the last time I looked, it just ain't easy to do based on the
> implementation. The highlighter highlights by running through and scoring
> tokens, not phrases, and the Span highlighter asks if a given token is in a
> given span to see if it should get a score over 0. Token by token handed off
> to the SpanScorer to be scored. I looked into adding the option at one point
> (back when I was putting the SpanScorer together) and didn't find it worth
> the effort after getting blocked a couple times.
>
>
Thanks anyways Mark.  Yea what I gathered from the results is that I will
only get hits and highlights for phrases if the whole phrase was found, but
they will be separated.  I just combine them now but was hoping for a more
elegant solution.   At least I know that what I'm highlighting aren't random
parts of the text, but the actual phrase, so all is not lost.

-max

Re: Phrase Highlighting

Posted by Mark Miller <ma...@gmail.com>.

Yeah, the highlighter framework as is is certainly limiting. When I 
first did the SpanHighlighter without trying to fit it into the old 
Highlighter (an early incomplete prototype type thing anyway) I made 
them merge right off the bat because it was very easy. That was because 
I could just use the span positions I got back in any manner I wanted to 
work with the tokens and create the text. To get things to work a token 
at a time though (you give me a token, I score it), I did things kind of 
differently where I collect all the valid spans for each token, and if a 
token falls in a valid span for that token (calculated ahead of time), I 
highlight it. I think that just makes it harder to deal with getting 
things right with overlap and what not. Its also difficult to talk to 
the Formatter from the Scorer to do the markup right without weird hacks 
where they talk to each other hard coded style or something. Its 
certainly not impossible, but it just ended up being much harder to get 
right with the current framework. Of course, I wasn't considering 
changing the framework at the time (wasn't even a contrib committer at 
the time), so perhaps there is something that could be done to ease 
things there (eg a way for the Scorer to communicate with the 
Formatter). I don't have a complete memory of all the issues really though.

I also don't want to discourage anyone from trying to get something 
going. Its not impossible, but it was just darn hard to get right with 
the current API. I've always just recommended post processing.

- Mark

Michael McCandless wrote:
> Mark, is this because the highlighter package doesn't include enough
> information as to why the fragmenter picked a given fragment?
>
> Because... the SpanScorer is in fact doing all the work to properly
> locate the full span for the phrase (I think?), so it's ashame that
> because there's no way for it to "communicate" this information to the
> formatter.  The strong decoupling of fragmenting from highlighting is
> hurting us here...
>
> Mike
>
> On Wed, Jun 3, 2009 at 8:34 PM, Mark Miller <ma...@gmail.com> wrote:
>   
>> Max Lynch wrote:
>>     
>>>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>>
>>>>         
>>>       
>>>>> such:
>>>>>
>>>>>           analyzer = StandardAnalyzer([])
>>>>>           tokenStream = analyzer.tokenStream("contents",
>>>>> lucene.StringReader(text))
>>>>>           ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>>>           highlighter = lucene.Highlighter(formatter,
>>>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>>>           ctokenStream.reset()
>>>>>
>>>>>           result = highlighter.getBestFragments(ctokenStream, text,
>>>>>                   2, "...")
>>>>>
>>>>>  My highlighter is still breaking up words inside of a span.  For
>>>>>
>>>>>           
>>>> example,
>>>>
>>>>         
>>>>> if I search for \"John Smith\", instead of the highlighter being called
>>>>>
>>>>>           
>>>> for
>>>>
>>>>         
>>>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>>>
>>>>>           
>>>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>>>> which is the default used by Highlighter) to ensure that each fragment
>>>> contains a full match for the query.  EG something like this (copied
>>>> from LIA 2nd edition):
>>>>
>>>>   TermQuery query = new TermQuery(new Term("field", "fox"));
>>>>
>>>>   TokenStream tokenStream =
>>>>       new SimpleAnalyzer().tokenStream("field",
>>>>           new StringReader(text));
>>>>
>>>>   SpanScorer scorer = new SpanScorer(query, "field",
>>>>                                      new
>>>> CachingTokenFilter(tokenStream));
>>>>   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>>>   Highlighter highlighter = new Highlighter(scorer);
>>>>   highlighter.setTextFragmenter(fragmenter);
>>>>
>>>>         
>>>
>>> Okay, I hacked something up in Java that illustrates my issue.
>>>
>>>
>>> import org.apache.lucene.search.*;
>>> import org.apache.lucene.analysis.*;
>>> import org.apache.lucene.document.*;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.index.Term;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.RAMDirectory;
>>> import org.apache.lucene.search.highlight.*;
>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>> import java.io.Reader;
>>> import java.io.StringReader;
>>>
>>> public class PhraseTest {
>>>    private IndexSearcher searcher;
>>>    private RAMDirectory directory;
>>>
>>>    public PhraseTest() throws Exception {
>>>        directory = new RAMDirectory();
>>>
>>>        Analyzer analyzer = new Analyzer() {
>>>            public TokenStream tokenStream(String fieldName, Reader reader)
>>> {
>>>                return new WhitespaceTokenizer(reader);
>>>            }
>>>
>>>            public int getPositionIncrementGap(String fieldName) {
>>>                return 100;
>>>            }
>>>        };
>>>
>>>        IndexWriter writer = new IndexWriter(directory, analyzer, true,
>>>                IndexWriter.MaxFieldLength.LIMITED);
>>>
>>>        Document doc = new Document();
>>>        String text = "Jimbo John is his name";
>>>        doc.add(new Field("contents", text, Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>        writer.addDocument(doc);
>>>
>>>        writer.optimize();
>>>        writer.close();
>>>
>>>        searcher = new IndexSearcher(directory);
>>>
>>>        // Try a phrase query
>>>        PhraseQuery phraseQuery = new PhraseQuery();
>>>        phraseQuery.add(new Term("contents", "Jimbo"));
>>>        phraseQuery.add(new Term("contents", "John"));
>>>
>>>        // Try a SpanTermQuery
>>>        SpanTermQuery spanTermQuery = new SpanTermQuery(new
>>> Term("contents",
>>> "Jimbo John"));
>>>
>>>        // Try a parsed query
>>>        Query parsedQuery = new QueryParser("contents",
>>> analyzer).parse("\"Jimbo John\"");
>>>
>>>        Hits hits = searcher.search(parsedQuery);
>>>        System.out.println("We found " + hits.length() + " hits.");
>>>
>>>        // Highlight the results
>>>        CachingTokenFilter tokenStream = new
>>> CachingTokenFilter(analyzer.tokenStream( "contents", new
>>> StringReader(text)));
>>>
>>>        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>>>
>>>        SpanScorer sc = new SpanScorer(parsedQuery, "contents",
>>> tokenStream,
>>> "contents");
>>>
>>>        Highlighter highlighter = new Highlighter(formatter, sc);
>>>        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>>>        tokenStream.reset();
>>>
>>>        String rv = highlighter.getBestFragments(tokenStream, text, 1,
>>> "...");
>>>        System.out.println(rv);
>>>
>>>    }
>>>    public static void main(String[] args) {
>>>        System.out.println("Starting...");
>>>        try {
>>>            PhraseTest pt = new PhraseTest();
>>>        } catch(Exception ex) {
>>>            ex.printStackTrace();
>>>        }
>>>    }
>>> }
>>>
>>>
>>>
>>> The output I'm getting is instead of highlighting <B>Jimbo John</B> it
>>> does
>>> <B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
>>> several different query types (they are declared in the code, but only the
>>> parsed version is being used).
>>>
>>> Thanks
>>> -max
>>>
>>>
>>>       
>> Sorry, not much you can do at the moment. The change is non trivial for sure
>> (its probably easier to write some regex that merges them). This limitation
>> was accepted because with most markup, it will display the same anyway. An
>> option to merge would be great, and while I don't remember the details, the
>> last time I looked, it just ain't easy to do based on the implementation.
>> The highlighter highlights by running through and scoring tokens, not
>> phrases, and the Span highlighter asks if a given token is in a given span
>> to see if it should get a score over 0. Token by token handed off to the
>> SpanScorer to be scored. I looked into adding the option at one point (back
>> when I was putting the SpanScorer together) and didn't find it worth the
>> effort after getting blocked a couple times.
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Phrase Highlighting

Posted by Michael McCandless <lu...@mikemccandless.com>.

Mark, is this because the highlighter package doesn't include enough
information as to why the fragmenter picked a given fragment?

Because... the SpanScorer is in fact doing all the work to properly
locate the full span for the phrase (I think?), so it's ashame that
because there's no way for it to "communicate" this information to the
formatter.  The strong decoupling of fragmenting from highlighting is
hurting us here...

Mike

On Wed, Jun 3, 2009 at 8:34 PM, Mark Miller <ma...@gmail.com> wrote:
> Max Lynch wrote:
>>>
>>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>
>>
>>
>>>>
>>>> such:
>>>>
>>>>           analyzer = StandardAnalyzer([])
>>>>           tokenStream = analyzer.tokenStream("contents",
>>>> lucene.StringReader(text))
>>>>           ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>>           highlighter = lucene.Highlighter(formatter,
>>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>>           ctokenStream.reset()
>>>>
>>>>           result = highlighter.getBestFragments(ctokenStream, text,
>>>>                   2, "...")
>>>>
>>>>  My highlighter is still breaking up words inside of a span.  For
>>>>
>>>
>>> example,
>>>
>>>>
>>>> if I search for \"John Smith\", instead of the highlighter being called
>>>>
>>>
>>> for
>>>
>>>>
>>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>>
>>>
>>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>>> which is the default used by Highlighter) to ensure that each fragment
>>> contains a full match for the query.  EG something like this (copied
>>> from LIA 2nd edition):
>>>
>>>   TermQuery query = new TermQuery(new Term("field", "fox"));
>>>
>>>   TokenStream tokenStream =
>>>       new SimpleAnalyzer().tokenStream("field",
>>>           new StringReader(text));
>>>
>>>   SpanScorer scorer = new SpanScorer(query, "field",
>>>                                      new
>>> CachingTokenFilter(tokenStream));
>>>   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>>   Highlighter highlighter = new Highlighter(scorer);
>>>   highlighter.setTextFragmenter(fragmenter);
>>>
>>
>>
>>
>> Okay, I hacked something up in Java that illustrates my issue.
>>
>>
>> import org.apache.lucene.search.*;
>> import org.apache.lucene.analysis.*;
>> import org.apache.lucene.document.*;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.search.highlight.*;
>> import org.apache.lucene.search.spans.SpanTermQuery;
>> import java.io.Reader;
>> import java.io.StringReader;
>>
>> public class PhraseTest {
>>    private IndexSearcher searcher;
>>    private RAMDirectory directory;
>>
>>    public PhraseTest() throws Exception {
>>        directory = new RAMDirectory();
>>
>>        Analyzer analyzer = new Analyzer() {
>>            public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>>                return new WhitespaceTokenizer(reader);
>>            }
>>
>>            public int getPositionIncrementGap(String fieldName) {
>>                return 100;
>>            }
>>        };
>>
>>        IndexWriter writer = new IndexWriter(directory, analyzer, true,
>>                IndexWriter.MaxFieldLength.LIMITED);
>>
>>        Document doc = new Document();
>>        String text = "Jimbo John is his name";
>>        doc.add(new Field("contents", text, Field.Store.YES,
>> Field.Index.ANALYZED));
>>        writer.addDocument(doc);
>>
>>        writer.optimize();
>>        writer.close();
>>
>>        searcher = new IndexSearcher(directory);
>>
>>        // Try a phrase query
>>        PhraseQuery phraseQuery = new PhraseQuery();
>>        phraseQuery.add(new Term("contents", "Jimbo"));
>>        phraseQuery.add(new Term("contents", "John"));
>>
>>        // Try a SpanTermQuery
>>        SpanTermQuery spanTermQuery = new SpanTermQuery(new
>> Term("contents",
>> "Jimbo John"));
>>
>>        // Try a parsed query
>>        Query parsedQuery = new QueryParser("contents",
>> analyzer).parse("\"Jimbo John\"");
>>
>>        Hits hits = searcher.search(parsedQuery);
>>        System.out.println("We found " + hits.length() + " hits.");
>>
>>        // Highlight the results
>>        CachingTokenFilter tokenStream = new
>> CachingTokenFilter(analyzer.tokenStream( "contents", new
>> StringReader(text)));
>>
>>        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>>
>>        SpanScorer sc = new SpanScorer(parsedQuery, "contents",
>> tokenStream,
>> "contents");
>>
>>        Highlighter highlighter = new Highlighter(formatter, sc);
>>        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>>        tokenStream.reset();
>>
>>        String rv = highlighter.getBestFragments(tokenStream, text, 1,
>> "...");
>>        System.out.println(rv);
>>
>>    }
>>    public static void main(String[] args) {
>>        System.out.println("Starting...");
>>        try {
>>            PhraseTest pt = new PhraseTest();
>>        } catch(Exception ex) {
>>            ex.printStackTrace();
>>        }
>>    }
>> }
>>
>>
>>
>> The output I'm getting is instead of highlighting <B>Jimbo John</B> it
>> does
>> <B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
>> several different query types (they are declared in the code, but only the
>> parsed version is being used).
>>
>> Thanks
>> -max
>>
>>
>
> Sorry, not much you can do at the moment. The change is non trivial for sure
> (its probably easier to write some regex that merges them). This limitation
> was accepted because with most markup, it will display the same anyway. An
> option to merge would be great, and while I don't remember the details, the
> last time I looked, it just ain't easy to do based on the implementation.
> The highlighter highlights by running through and scoring tokens, not
> phrases, and the Span highlighter asks if a given token is in a given span
> to see if it should get a score over 0. Token by token handed off to the
> SpanScorer to be scored. I looked into adding the option at one point (back
> when I was putting the SpanScorer together) and didn't find it worth the
> effort after getting blocked a couple times.
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Phrase Highlighting

Posted by Mark Miller <ma...@gmail.com>.

Max Lynch wrote:
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>     
>
>   
>>> such:
>>>
>>>            analyzer = StandardAnalyzer([])
>>>            tokenStream = analyzer.tokenStream("contents",
>>> lucene.StringReader(text))
>>>            ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>            highlighter = lucene.Highlighter(formatter,
>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>            ctokenStream.reset()
>>>
>>>            result = highlighter.getBestFragments(ctokenStream, text,
>>>                    2, "...")
>>>
>>>  My highlighter is still breaking up words inside of a span.  For
>>>       
>> example,
>>     
>>> if I search for \"John Smith\", instead of the highlighter being called
>>>       
>> for
>>     
>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>       
>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>> which is the default used by Highlighter) to ensure that each fragment
>> contains a full match for the query.  EG something like this (copied
>> from LIA 2nd edition):
>>
>>    TermQuery query = new TermQuery(new Term("field", "fox"));
>>
>>    TokenStream tokenStream =
>>        new SimpleAnalyzer().tokenStream("field",
>>            new StringReader(text));
>>
>>    SpanScorer scorer = new SpanScorer(query, "field",
>>                                       new CachingTokenFilter(tokenStream));
>>    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>    Highlighter highlighter = new Highlighter(scorer);
>>    highlighter.setTextFragmenter(fragmenter);
>>     
>
>
>
> Okay, I hacked something up in Java that illustrates my issue.
>
>
> import org.apache.lucene.search.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.search.highlight.*;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import java.io.Reader;
> import java.io.StringReader;
>
> public class PhraseTest {
>     private IndexSearcher searcher;
>     private RAMDirectory directory;
>
>     public PhraseTest() throws Exception {
>         directory = new RAMDirectory();
>
>         Analyzer analyzer = new Analyzer() {
>             public TokenStream tokenStream(String fieldName, Reader reader)
> {
>                 return new WhitespaceTokenizer(reader);
>             }
>
>             public int getPositionIncrementGap(String fieldName) {
>                 return 100;
>             }
>         };
>
>         IndexWriter writer = new IndexWriter(directory, analyzer, true,
>                 IndexWriter.MaxFieldLength.LIMITED);
>
>         Document doc = new Document();
>         String text = "Jimbo John is his name";
>         doc.add(new Field("contents", text, Field.Store.YES,
> Field.Index.ANALYZED));
>         writer.addDocument(doc);
>
>         writer.optimize();
>         writer.close();
>
>         searcher = new IndexSearcher(directory);
>
>         // Try a phrase query
>         PhraseQuery phraseQuery = new PhraseQuery();
>         phraseQuery.add(new Term("contents", "Jimbo"));
>         phraseQuery.add(new Term("contents", "John"));
>
>         // Try a SpanTermQuery
>         SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
> "Jimbo John"));
>
>         // Try a parsed query
>         Query parsedQuery = new QueryParser("contents",
> analyzer).parse("\"Jimbo John\"");
>
>         Hits hits = searcher.search(parsedQuery);
>         System.out.println("We found " + hits.length() + " hits.");
>
>         // Highlight the results
>         CachingTokenFilter tokenStream = new
> CachingTokenFilter(analyzer.tokenStream( "contents", new
> StringReader(text)));
>
>         SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>
>         SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
> "contents");
>
>         Highlighter highlighter = new Highlighter(formatter, sc);
>         highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>         tokenStream.reset();
>
>         String rv = highlighter.getBestFragments(tokenStream, text, 1,
> "...");
>         System.out.println(rv);
>
>     }
>     public static void main(String[] args) {
>         System.out.println("Starting...");
>         try {
>             PhraseTest pt = new PhraseTest();
>         } catch(Exception ex) {
>             ex.printStackTrace();
>         }
>     }
> }
>
>
>
> The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
> <B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
> several different query types (they are declared in the code, but only the
> parsed version is being used).
>
> Thanks
> -max
>
>   
Sorry, not much you can do at the moment. The change is non trivial for 
sure (its probably easier to write some regex that merges them). This 
limitation was accepted because with most markup, it will display the 
same anyway. An option to merge would be great, and while I don't 
remember the details, the last time I looked, it just ain't easy to do 
based on the implementation. The highlighter highlights by running 
through and scoring tokens, not phrases, and the Span highlighter asks 
if a given token is in a given span to see if it should get a score over 
0. Token by token handed off to the SpanScorer to be scored. I looked 
into adding the option at one point (back when I was putting the 
SpanScorer together) and didn't find it worth the effort after getting 
blocked a couple times.


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Phrase Highlighting

Posted by Max Lynch <ih...@gmail.com>.

> Well what happens is if I use a SpanScorer instead, and allocate it like

> > such:
> >
> >            analyzer = StandardAnalyzer([])
> >            tokenStream = analyzer.tokenStream("contents",
> > lucene.StringReader(text))
> >            ctokenStream = lucene.CachingTokenFilter(tokenStream)
> >            highlighter = lucene.Highlighter(formatter,
> > lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
> >            ctokenStream.reset()
> >
> >            result = highlighter.getBestFragments(ctokenStream, text,
> >                    2, "...")
> >
> >  My highlighter is still breaking up words inside of a span.  For
> example,
> > if I search for \"John Smith\", instead of the highlighter being called
> for
> > the whole "John Smith", it gets called for "John" and then "Smith".
>
> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
> which is the default used by Highlighter) to ensure that each fragment
> contains a full match for the query.  EG something like this (copied
> from LIA 2nd edition):
>
>    TermQuery query = new TermQuery(new Term("field", "fox"));
>
>    TokenStream tokenStream =
>        new SimpleAnalyzer().tokenStream("field",
>            new StringReader(text));
>
>    SpanScorer scorer = new SpanScorer(query, "field",
>                                       new CachingTokenFilter(tokenStream));
>    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>    Highlighter highlighter = new Highlighter(scorer);
>    highlighter.setTextFragmenter(fragmenter);



Okay, I hacked something up in Java that illustrates my issue.


import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;

public class PhraseTest {
    private IndexSearcher searcher;
    private RAMDirectory directory;

    public PhraseTest() throws Exception {
        directory = new RAMDirectory();

        Analyzer analyzer = new Analyzer() {
            public TokenStream tokenStream(String fieldName, Reader reader)
{
                return new WhitespaceTokenizer(reader);
            }

            public int getPositionIncrementGap(String fieldName) {
                return 100;
            }
        };

        IndexWriter writer = new IndexWriter(directory, analyzer, true,
                IndexWriter.MaxFieldLength.LIMITED);

        Document doc = new Document();
        String text = "Jimbo John is his name";
        doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
        writer.addDocument(doc);

        writer.optimize();
        writer.close();

        searcher = new IndexSearcher(directory);

        // Try a phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("contents", "Jimbo"));
        phraseQuery.add(new Term("contents", "John"));

        // Try a SpanTermQuery
        SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));

        // Try a parsed query
        Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");

        Hits hits = searcher.search(parsedQuery);
        System.out.println("We found " + hits.length() + " hits.");

        // Highlight the results
        CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));

        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();

        SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");

        Highlighter highlighter = new Highlighter(formatter, sc);
        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
        tokenStream.reset();

        String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
        System.out.println(rv);

    }
    public static void main(String[] args) {
        System.out.println("Starting...");
        try {
            PhraseTest pt = new PhraseTest();
        } catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}



The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
several different query types (they are declared in the code, but only the
parsed version is being used).

Thanks
-max

Re: Phrase Highlighting

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, May 21, 2009 at 3:09 PM, Max Lynch <ih...@gmail.com> wrote:
> Sorry, the following code is in python, but I can hack a Java thing together
> if necessary.

I'm a big Python fan :)

> HighlighterSpanScorer is the SpanScorer from the highlight
> package just renamed to avoid conflict with the other SpanScorer object.
>
> Well what happens is if I use a SpanScorer instead, and allocate it like
> such:
>
>            analyzer = StandardAnalyzer([])
>            tokenStream = analyzer.tokenStream("contents",
> lucene.StringReader(text))
>            ctokenStream = lucene.CachingTokenFilter(tokenStream)
>            highlighter = lucene.Highlighter(formatter,
> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>            ctokenStream.reset()
>
>            result = highlighter.getBestFragments(ctokenStream, text,
>                    2, "...")
>
>  My highlighter is still breaking up words inside of a span.  For example,
> if I search for \"John Smith\", instead of the highlighter being called for
> the whole "John Smith", it gets called for "John" and then "Smith".

I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
which is the default used by Highlighter) to ensure that each fragment
contains a full match for the query.  EG something like this (copied
from LIA 2nd edition):

    TermQuery query = new TermQuery(new Term("field", "fox"));

    TokenStream tokenStream =
        new SimpleAnalyzer().tokenStream("field",
            new StringReader(text));

    SpanScorer scorer = new SpanScorer(query, "field",
                                       new CachingTokenFilter(tokenStream));
    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
    Highlighter highlighter = new Highlighter(scorer);
    highlighter.setTextFragmenter(fragmenter);


>> > In the mean time, If I am interested in finding out exactly how many
>> times a
>> > term was found in a document, what is the best way to go about this?  The
>> > way I am doing it right now is using a highlighter and just incrementing
>> > counters when a word is found that I'm interested.  I just came across
>> > FieldSortedTermVectorMapper that could do something similar.  Is
>> > FieldSortedTermVectorMapper something I could use for this?  Is there a
>> > better option?
>>
>>
>> Is it really just single terms you need to measure?  (eg, not "how
>> many times did phrase XYZ occur in the doc").  If so, then getting the
>> term vectors and locating your term in there, should work.  This is
>> probably OK if you just do it for each of the hits on the page (like
>> 10 hits), but will be way too slow if you try to do it for say all
>> docs that matched the query.
>>
>
> I see how the term vector might be used.  I can't really tell if there is a
> way for me to do a Span check on the words as easily as the highlighter
> would do.

TermVectors won't let you do a span check -- they just return the
terms & their frequencies (and optionally positions & offsets, if you
indexed them).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org