You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Pál Barnabás <pb...@gmail.com> on 2009/03/09 16:14:07 UTC

Highlighter withField.Store.NO

Hi,
I'm trying to highlight the keyword in the search result.
This is my code:
------------------------------------------------------------------
string indexdir = @"D:\temp\index_testing";
            if (System.IO.Directory.Exists(indexdir))
                System.IO.Directory.Delete(indexdir, true);

            IndexWriter writer = new IndexWriter(indexdir, new
Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
            // demo text
            string scontent = "First, we parse the user-entered query string
indicating that we want to match ...";

            for (int i = 0; i < 100; i++)
            {
                Document doc = new Document();

                doc.Add(new Field("ID", i.ToString(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
                doc.Add(new Field("CONTENT", scontent, Field.Store.YES,
Field.Index.TOKENIZED));

                writer.AddDocument(doc);
            }

            writer.Close();

            IndexReader reader = IndexReader.Open(indexdir);
            Searcher searcher = new IndexSearcher(reader);
            Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();

            MultiFieldQueryParser parser = new MultiFieldQueryParser(new
string[] { "CONTENT" }, analyzer);

            Query query = parser.Parse("indicating");
            query = query.Rewrite(reader);
            Trace.WriteLine("Searching for: " + query.ToString());

            Lucene.Net.Search.Hits hits = searcher.Search(query);

            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b
class='term'>", "</b>");

            QueryScorer scorer = new QueryScorer(query);

            Highlighter highlighter = new Highlighter(formatter, scorer);
            highlighter.SetTextFragmenter(new SimpleFragmenter(2000));

            for (int i = 0; i < hits.Length(); i++)
            {
                Document resdoc = hits.Doc(i);

                string s = resdoc.Get("CONTENT");
                // s is null if Field.Store is NO
                TokenStream tsTitle = analyzer.TokenStream("CONTENT", new
System.IO.StringReader(s));
                string hl = highlighter.GetBestFragment(tsTitle, s);
            }
------------------------------------------------------------------

The problem is when the content is not stored in the index
(Field.Store.NO), the result document does not contain the value. Is
it possible to use the
Highlighter class in this case ? or what's the best way to highlight the
search result? is it possible to get all tokens for the hits.Doc(i)?

Re: Highlighter withField.Store.NO

Posted by Shashi Kant <sk...@sloan.mit.edu>.
Interesting problem. As other have pointed out Highlighter is an expensive
operation, so use it sparingly.

A hack approach might be to create a faux highlight at index time, for
example extract the title and sections of the  body + attachment. There are
approaches to find it such as summarization techniques etc and store this
pseudo-highlight. Obviously this would *not* be  a search specific
highlight.



On Tue, Mar 10, 2009 at 7:08 AM, Moray McConnachie <mmcconna@
oxford-analytica.com> wrote:

> Assuming the % of documents hit by search in any particular time period is
> very low, as I would expect in a mail system, then it will be more effective
> for such a large database to keep the Lucene index size down by not storing
> the complete contents - so you need Field Store NO, as you already
> established.
>
> Highlighter has no magic way to retrieve the content, so when you use
> Highlighter you will need to pass it the full content for each search
> result, as described by Ben below. I think perhaps when Ben says cache he
> just means load the content from your main content store.
>
> So in your code, instead of
>
> > string s = resdoc.Get("CONTENT");
>
> you need
>
>  string s=sContent;
>
> Obviously a trivial example!
>
> In the real world it might be
>
>  string s=GetEmailBody(EmailID);
>
> Or whatever.
>
> The use of a cache in another sense - i.e. a place to temporarily store
> data as it is retrieved from your main store in case the same content is
> needed again soon - might be advisable depending on how expensive it is to
> retrieve the full contents from your store, and how good a job of caching
> your content retrieval system does. Most databases and filesystems will do a
> good job of caching, so if the content is stored in a simple way and there
> is not significant latency between search and content store, you do not need
> a local cache.
>
> Yours,
> Moray
>
> -------------------------------------
> Moray McConnachie
> Head of IS        +44 1865 261 600
> Oxford Analytica  http://www.oxan.com
>
> -----Original Message-----
> From: Pál Barnabás [mailto:pbarni@gmail.com]
> Sent: 10 March 2009 10:27
> To: lucene-net-user@incubator.apache.org
> Subject: Re: Highlighter withField.Store.NO
>
> thx for quick answer,
> This solution is not possible for me. I want to index millions of e-mails
> with attachments (doc, pdf, etc). The mails and the files are stored
> already, saving the text content in a separate cache is not acceptable.
> I tried to save the with with Field.Store.COMPRESS option, but the
> performance was very low (3x indexing time).
>
> 2009/3/9 Ben Martz <be...@gmail.com>
>
> > I use the Highlighter class in a shipping product in which I do not
> > store values in the index. Instead I independently load the contents
> > from my own cache and pass that to Highlighter.GetBestFragments(). The
> > only disadvantage is that depending on the size of your contents and
> > the speed of your contents cache this can make Highlighting a very
> > expensive operation so pay very careful attention to how and when you
> > load your contents data.
> >
> > On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <pb...@gmail.com> wrote:
> >
> > > Hi,
> > > I'm trying to highlight the keyword in the search result.
> > > This is my code:
> > > ------------------------------------------------------------------
> > > string indexdir = @"D:\temp\index_testing";
> > >            if (System.IO.Directory.Exists(indexdir))
> > >                System.IO.Directory.Delete(indexdir, true);
> > >
> > >            IndexWriter writer = new IndexWriter(indexdir, new
> > > Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
> > >            // demo text
> > >            string scontent = "First, we parse the user-entered query
> > string
> > > indicating that we want to match ...";
> > >
> > >            for (int i = 0; i < 100; i++)
> > >            {
> > >                Document doc = new Document();
> > >
> > >                doc.Add(new Field("ID", i.ToString(),
> > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > >                doc.Add(new Field("CONTENT", scontent,
> > > Field.Store.YES, Field.Index.TOKENIZED));
> > >
> > >                writer.AddDocument(doc);
> > >            }
> > >
> > >            writer.Close();
> > >
> > >            IndexReader reader = IndexReader.Open(indexdir);
> > >            Searcher searcher = new IndexSearcher(reader);
> > >            Analyzer analyzer = new
> > > Lucene.Net.Analysis.Standard.StandardAnalyzer();
> > >
> > >            MultiFieldQueryParser parser = new
> > > MultiFieldQueryParser(new string[] { "CONTENT" }, analyzer);
> > >
> > >            Query query = parser.Parse("indicating");
> > >            query = query.Rewrite(reader);
> > >            Trace.WriteLine("Searching for: " + query.ToString());
> > >
> > >            Lucene.Net.Search.Hits hits = searcher.Search(query);
> > >
> > >            SimpleHTMLFormatter formatter = new
> > > SimpleHTMLFormatter("<b class='term'>", "</b>");
> > >
> > >            QueryScorer scorer = new QueryScorer(query);
> > >
> > >            Highlighter highlighter = new Highlighter(formatter,
> scorer);
> > >            highlighter.SetTextFragmenter(new
> > > SimpleFragmenter(2000));
> > >
> > >            for (int i = 0; i < hits.Length(); i++)
> > >            {
> > >                Document resdoc = hits.Doc(i);
> > >
> > >                string s = resdoc.Get("CONTENT");
> > >                // s is null if Field.Store is NO
> > >                TokenStream tsTitle = analyzer.TokenStream("CONTENT",
> > > new System.IO.StringReader(s));
> > >                string hl = highlighter.GetBestFragment(tsTitle, s);
> > >            }
> > > ------------------------------------------------------------------
> > >
> > > The problem is when the content is not stored in the index
> > > (Field.Store.NO), the result document does not contain the value. Is
> > > it possible to use the Highlighter class in this case ? or what's
> > > the best way to highlight the search result? is it possible to get
> > > all tokens for the hits.Doc(i)?
> > >
> >
> >
> >
> > --
> > 13:37 - Someone stole the precinct toilet. The cops have nothing to go
> on.
> > 14:37 - Officers dispatched to a daycare where a three-year-old was
> > resisting a rest.
> > 21:11 - Hole found in nudist camp wall. Officers are looking into it.
> >
>
>

RE: Highlighter withField.Store.NO

Posted by Moray McConnachie <mm...@oxford-analytica.com>.
Assuming the % of documents hit by search in any particular time period is very low, as I would expect in a mail system, then it will be more effective for such a large database to keep the Lucene index size down by not storing the complete contents - so you need Field Store NO, as you already established. 

Highlighter has no magic way to retrieve the content, so when you use Highlighter you will need to pass it the full content for each search result, as described by Ben below. I think perhaps when Ben says cache he just means load the content from your main content store.

So in your code, instead of 

> string s = resdoc.Get("CONTENT");

you need 

  string s=sContent;

Obviously a trivial example!

In the real world it might be

  string s=GetEmailBody(EmailID);

Or whatever.

The use of a cache in another sense - i.e. a place to temporarily store data as it is retrieved from your main store in case the same content is needed again soon - might be advisable depending on how expensive it is to retrieve the full contents from your store, and how good a job of caching your content retrieval system does. Most databases and filesystems will do a good job of caching, so if the content is stored in a simple way and there is not significant latency between search and content store, you do not need a local cache.

Yours,
Moray

------------------------------------- 
Moray McConnachie
Head of IS        +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Pál Barnabás [mailto:pbarni@gmail.com] 
Sent: 10 March 2009 10:27
To: lucene-net-user@incubator.apache.org
Subject: Re: Highlighter withField.Store.NO

thx for quick answer,
This solution is not possible for me. I want to index millions of e-mails with attachments (doc, pdf, etc). The mails and the files are stored already, saving the text content in a separate cache is not acceptable.
I tried to save the with with Field.Store.COMPRESS option, but the performance was very low (3x indexing time).

2009/3/9 Ben Martz <be...@gmail.com>

> I use the Highlighter class in a shipping product in which I do not 
> store values in the index. Instead I independently load the contents 
> from my own cache and pass that to Highlighter.GetBestFragments(). The 
> only disadvantage is that depending on the size of your contents and 
> the speed of your contents cache this can make Highlighting a very 
> expensive operation so pay very careful attention to how and when you 
> load your contents data.
>
> On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <pb...@gmail.com> wrote:
>
> > Hi,
> > I'm trying to highlight the keyword in the search result.
> > This is my code:
> > ------------------------------------------------------------------
> > string indexdir = @"D:\temp\index_testing";
> >            if (System.IO.Directory.Exists(indexdir))
> >                System.IO.Directory.Delete(indexdir, true);
> >
> >            IndexWriter writer = new IndexWriter(indexdir, new 
> > Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
> >            // demo text
> >            string scontent = "First, we parse the user-entered query
> string
> > indicating that we want to match ...";
> >
> >            for (int i = 0; i < 100; i++)
> >            {
> >                Document doc = new Document();
> >
> >                doc.Add(new Field("ID", i.ToString(), 
> > Field.Store.YES, Field.Index.UN_TOKENIZED));
> >                doc.Add(new Field("CONTENT", scontent, 
> > Field.Store.YES, Field.Index.TOKENIZED));
> >
> >                writer.AddDocument(doc);
> >            }
> >
> >            writer.Close();
> >
> >            IndexReader reader = IndexReader.Open(indexdir);
> >            Searcher searcher = new IndexSearcher(reader);
> >            Analyzer analyzer = new
> > Lucene.Net.Analysis.Standard.StandardAnalyzer();
> >
> >            MultiFieldQueryParser parser = new 
> > MultiFieldQueryParser(new string[] { "CONTENT" }, analyzer);
> >
> >            Query query = parser.Parse("indicating");
> >            query = query.Rewrite(reader);
> >            Trace.WriteLine("Searching for: " + query.ToString());
> >
> >            Lucene.Net.Search.Hits hits = searcher.Search(query);
> >
> >            SimpleHTMLFormatter formatter = new 
> > SimpleHTMLFormatter("<b class='term'>", "</b>");
> >
> >            QueryScorer scorer = new QueryScorer(query);
> >
> >            Highlighter highlighter = new Highlighter(formatter, scorer);
> >            highlighter.SetTextFragmenter(new 
> > SimpleFragmenter(2000));
> >
> >            for (int i = 0; i < hits.Length(); i++)
> >            {
> >                Document resdoc = hits.Doc(i);
> >
> >                string s = resdoc.Get("CONTENT");
> >                // s is null if Field.Store is NO
> >                TokenStream tsTitle = analyzer.TokenStream("CONTENT", 
> > new System.IO.StringReader(s));
> >                string hl = highlighter.GetBestFragment(tsTitle, s);
> >            }
> > ------------------------------------------------------------------
> >
> > The problem is when the content is not stored in the index 
> > (Field.Store.NO), the result document does not contain the value. Is 
> > it possible to use the Highlighter class in this case ? or what's 
> > the best way to highlight the search result? is it possible to get 
> > all tokens for the hits.Doc(i)?
> >
>
>
>
> --
> 13:37 - Someone stole the precinct toilet. The cops have nothing to go on.
> 14:37 - Officers dispatched to a daycare where a three-year-old was 
> resisting a rest.
> 21:11 - Hole found in nudist camp wall. Officers are looking into it.
>


Re: Highlighter withField.Store.NO

Posted by Pál Barnabás <pb...@gmail.com>.
thx for quick answer,
This solution is not possible for me. I want to index millions of e-mails
with attachments (doc, pdf, etc). The mails and the files are stored
already, saving the text content in a separate cache is not acceptable.
I tried to save the with with Field.Store.COMPRESS option, but the
performance was very low (3x indexing time).

2009/3/9 Ben Martz <be...@gmail.com>

> I use the Highlighter class in a shipping product in which I do not store
> values in the index. Instead I independently load the contents from my own
> cache and pass that to Highlighter.GetBestFragments(). The only
> disadvantage
> is that depending on the size of your contents and the speed of your
> contents cache this can make Highlighting a very expensive operation so pay
> very careful attention to how and when you load your contents data.
>
> On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <pb...@gmail.com> wrote:
>
> > Hi,
> > I'm trying to highlight the keyword in the search result.
> > This is my code:
> > ------------------------------------------------------------------
> > string indexdir = @"D:\temp\index_testing";
> >            if (System.IO.Directory.Exists(indexdir))
> >                System.IO.Directory.Delete(indexdir, true);
> >
> >            IndexWriter writer = new IndexWriter(indexdir, new
> > Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
> >            // demo text
> >            string scontent = "First, we parse the user-entered query
> string
> > indicating that we want to match ...";
> >
> >            for (int i = 0; i < 100; i++)
> >            {
> >                Document doc = new Document();
> >
> >                doc.Add(new Field("ID", i.ToString(), Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> >                doc.Add(new Field("CONTENT", scontent, Field.Store.YES,
> > Field.Index.TOKENIZED));
> >
> >                writer.AddDocument(doc);
> >            }
> >
> >            writer.Close();
> >
> >            IndexReader reader = IndexReader.Open(indexdir);
> >            Searcher searcher = new IndexSearcher(reader);
> >            Analyzer analyzer = new
> > Lucene.Net.Analysis.Standard.StandardAnalyzer();
> >
> >            MultiFieldQueryParser parser = new MultiFieldQueryParser(new
> > string[] { "CONTENT" }, analyzer);
> >
> >            Query query = parser.Parse("indicating");
> >            query = query.Rewrite(reader);
> >            Trace.WriteLine("Searching for: " + query.ToString());
> >
> >            Lucene.Net.Search.Hits hits = searcher.Search(query);
> >
> >            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b
> > class='term'>", "</b>");
> >
> >            QueryScorer scorer = new QueryScorer(query);
> >
> >            Highlighter highlighter = new Highlighter(formatter, scorer);
> >            highlighter.SetTextFragmenter(new SimpleFragmenter(2000));
> >
> >            for (int i = 0; i < hits.Length(); i++)
> >            {
> >                Document resdoc = hits.Doc(i);
> >
> >                string s = resdoc.Get("CONTENT");
> >                // s is null if Field.Store is NO
> >                TokenStream tsTitle = analyzer.TokenStream("CONTENT", new
> > System.IO.StringReader(s));
> >                string hl = highlighter.GetBestFragment(tsTitle, s);
> >            }
> > ------------------------------------------------------------------
> >
> > The problem is when the content is not stored in the index
> > (Field.Store.NO), the result document does not contain the value. Is
> > it possible to use the
> > Highlighter class in this case ? or what's the best way to highlight the
> > search result? is it possible to get all tokens for the hits.Doc(i)?
> >
>
>
>
> --
> 13:37 - Someone stole the precinct toilet. The cops have nothing to go on.
> 14:37 - Officers dispatched to a daycare where a three-year-old was
> resisting a rest.
> 21:11 - Hole found in nudist camp wall. Officers are looking into it.
>

Re: Highlighter withField.Store.NO

Posted by Ben Martz <be...@gmail.com>.
I use the Highlighter class in a shipping product in which I do not store
values in the index. Instead I independently load the contents from my own
cache and pass that to Highlighter.GetBestFragments(). The only disadvantage
is that depending on the size of your contents and the speed of your
contents cache this can make Highlighting a very expensive operation so pay
very careful attention to how and when you load your contents data.

On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <pb...@gmail.com> wrote:

> Hi,
> I'm trying to highlight the keyword in the search result.
> This is my code:
> ------------------------------------------------------------------
> string indexdir = @"D:\temp\index_testing";
>            if (System.IO.Directory.Exists(indexdir))
>                System.IO.Directory.Delete(indexdir, true);
>
>            IndexWriter writer = new IndexWriter(indexdir, new
> Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
>            // demo text
>            string scontent = "First, we parse the user-entered query string
> indicating that we want to match ...";
>
>            for (int i = 0; i < 100; i++)
>            {
>                Document doc = new Document();
>
>                doc.Add(new Field("ID", i.ToString(), Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>                doc.Add(new Field("CONTENT", scontent, Field.Store.YES,
> Field.Index.TOKENIZED));
>
>                writer.AddDocument(doc);
>            }
>
>            writer.Close();
>
>            IndexReader reader = IndexReader.Open(indexdir);
>            Searcher searcher = new IndexSearcher(reader);
>            Analyzer analyzer = new
> Lucene.Net.Analysis.Standard.StandardAnalyzer();
>
>            MultiFieldQueryParser parser = new MultiFieldQueryParser(new
> string[] { "CONTENT" }, analyzer);
>
>            Query query = parser.Parse("indicating");
>            query = query.Rewrite(reader);
>            Trace.WriteLine("Searching for: " + query.ToString());
>
>            Lucene.Net.Search.Hits hits = searcher.Search(query);
>
>            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b
> class='term'>", "</b>");
>
>            QueryScorer scorer = new QueryScorer(query);
>
>            Highlighter highlighter = new Highlighter(formatter, scorer);
>            highlighter.SetTextFragmenter(new SimpleFragmenter(2000));
>
>            for (int i = 0; i < hits.Length(); i++)
>            {
>                Document resdoc = hits.Doc(i);
>
>                string s = resdoc.Get("CONTENT");
>                // s is null if Field.Store is NO
>                TokenStream tsTitle = analyzer.TokenStream("CONTENT", new
> System.IO.StringReader(s));
>                string hl = highlighter.GetBestFragment(tsTitle, s);
>            }
> ------------------------------------------------------------------
>
> The problem is when the content is not stored in the index
> (Field.Store.NO), the result document does not contain the value. Is
> it possible to use the
> Highlighter class in this case ? or what's the best way to highlight the
> search result? is it possible to get all tokens for the hits.Doc(i)?
>



-- 
13:37 - Someone stole the precinct toilet. The cops have nothing to go on.
14:37 - Officers dispatched to a daycare where a three-year-old was
resisting a rest.
21:11 - Hole found in nudist camp wall. Officers are looking into it.