You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Amin Mohammed-Coleman <am...@gmail.com> on 2009/03/07 10:38:23 UTC

Lucene Highlighting and Dynamic Summaries

Hi
I am currently indexing documents (pdf, ms word, etc) that are uploaded,
these documents can be searched and what the search returns to the user are
summaries of the documents.  Currently the summaries are extracted when
indexing the file (summary constructed by taking the first 10 lines of the
document and stored in the index as field).  This is not ideal (static
summary), and I was wondering if it would be possible to create a dynamic
summary when a hit is found and highlight the terms found.  The content of
the document is not stored in the index.

So basically what I'm looking to do is:

1) PDF indexed
2) PDF body contains the word "search"
3) Do a search and return the hit
4) Construct a summary with the term "search" included.

I'm not sure how to go about doing this (I presume it is possible).  I would
be grateful for any advice.


Cheers
Amin

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

JIRA raised:

https://issues.apache.org/jira/browse/LUCENE-1559

Thanks

On Thu, Mar 12, 2009 at 11:29 AM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi
>
> Did both attachments not come through?
>
> Cheers
> Amin
>
>
> On Thu, Mar 12, 2009 at 9:52 AM, mark harwood <ma...@yahoo.co.uk>wrote:
>
>> The attachment didn't make it through here. Can you add it as an
>> attachment to a new JIRA issue?
>>
>> Thanks,
>> Mark
>>
>>
>>
>>
>>
>> ________________________________
>> From: Amin Mohammed-Coleman <am...@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, 12 March, 2009 7:47:20
>> Subject: Re: Lucene Highlighting and Dynamic Summaries
>>
>> Hi
>>
>> Please find attadched a test case plus a document.  Just to mention this
>> occurs sometimes for other files.
>>
>>
>> Cheers
>> Amin
>>
>>
>> On Wed, Mar 11, 2009 at 6:11 PM, markharw00d <ma...@yahoo.co.uk>
>> wrote:
>>
>> If you can supply a Junit test that recreates the problem I think we can
>> start to make progress on this.
>>
>>
>>
>> Amin Mohammed-Coleman wrote:
>>
>> Hi
>>
>> Apologies for re sending this mail. Just wondering if anyone has
>> experienced the below.. I'm not sure if this could happen due nature of
>> document. It does seem strange one term search returns summary while another
>> does not even though same document is being returned.
>>
>> I'm asking this so I can code around this if is normal.
>>
>>
>> Apologies again for re sending this mail
>>
>> Cheers
>>
>> Amin
>>
>> Sent from my iPhone
>>
>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:
>>
>>
>> Hi
>>
>> I am seeing some strange behaviour with the highlighter and I'm wondering
>> if anyone else is experiencing this.  In certain instances I don't get a
>> summary being generated.  I perform the search and the search returns the
>> correct document.  I can see that the lucene document contains the text in
>> the field.  However after doing:
>>
>>   SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
>> class=\"highlight\"><b>", "</b></span>");
>>           //required for highlighting
>>           Query query2 = multiSearcher.rewrite(query);
>>           Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
>> new QueryScorer(query2));
>> ...
>>
>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>               TokenStream tokenStream =
>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>> StringReader(text));
>>               String result = highlighter.getBestFragments(tokenStream,
>> text, 3, "...");
>>
>>
>> the string result is empty.  This is very strange, if i try a different
>> term that exists in the document then I get a summary.  For example I have a
>> word document that contains the term "document" and "aspectj".  If I search
>> for "document" I get the correct document but no highlighted summary.
>>  However if I search using "aspectj" I get the same doucment with
>> highlighted summary.
>>
>> Just to mentioned I do rewrite the original query before performing the
>> highlighting.
>>
>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>
>> Cheers
>> Amin
>>
>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com>
>> wrote:
>> Hi
>>
>> Got it working!  Thanks again for your help!
>>
>>
>> Amin
>>
>>
>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com>
>> wrote:
>> Thanks!  The final piece that I needed to do for the project!
>>
>> Cheers
>>
>> Amin
>>
>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> > cool.  i will use compression and store in index. is there anything
>> > special
>> > i need to for decompressing the text? i presume i can just do
>> > doc.get("content")?
>> > thanks for your advice all!
>>
>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>> when fetching. The decompression is automatically done.
>>
>> You may think, why not enable compression for all fields? The case is,
>> that
>> this is an overhead for very small and short fields. So you should only
>> use
>> it for large contents (it's the same like compressing very small files as
>> ZIP/GZIP: These files mostly get larger than without compression).
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

Did both attachments not come through?

Cheers
Amin

On Thu, Mar 12, 2009 at 9:52 AM, mark harwood <ma...@yahoo.co.uk>wrote:

> The attachment didn't make it through here. Can you add it as an attachment
> to a new JIRA issue?
>
> Thanks,
> Mark
>
>
>
>
>
> ________________________________
> From: Amin Mohammed-Coleman <am...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, 12 March, 2009 7:47:20
> Subject: Re: Lucene Highlighting and Dynamic Summaries
>
> Hi
>
> Please find attadched a test case plus a document.  Just to mention this
> occurs sometimes for other files.
>
>
> Cheers
> Amin
>
>
> On Wed, Mar 11, 2009 at 6:11 PM, markharw00d <ma...@yahoo.co.uk>
> wrote:
>
> If you can supply a Junit test that recreates the problem I think we can
> start to make progress on this.
>
>
>
> Amin Mohammed-Coleman wrote:
>
> Hi
>
> Apologies for re sending this mail. Just wondering if anyone has
> experienced the below.. I'm not sure if this could happen due nature of
> document. It does seem strange one term search returns summary while another
> does not even though same document is being returned.
>
> I'm asking this so I can code around this if is normal.
>
>
> Apologies again for re sending this mail
>
> Cheers
>
> Amin
>
> Sent from my iPhone
>
> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:
>
>
> Hi
>
> I am seeing some strange behaviour with the highlighter and I'm wondering
> if anyone else is experiencing this.  In certain instances I don't get a
> summary being generated.  I perform the search and the search returns the
> correct document.  I can see that the lucene document contains the text in
> the field.  However after doing:
>
>   SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
> class=\"highlight\"><b>", "</b></span>");
>           //required for highlighting
>           Query query2 = multiSearcher.rewrite(query);
>           Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
> new QueryScorer(query2));
> ...
>
> String text= doc.get(FieldNameEnum.BODY.getDescription());
>               TokenStream tokenStream =
> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
> StringReader(text));
>               String result = highlighter.getBestFragments(tokenStream,
> text, 3, "...");
>
>
> the string result is empty.  This is very strange, if i try a different
> term that exists in the document then I get a summary.  For example I have a
> word document that contains the term "document" and "aspectj".  If I search
> for "document" I get the correct document but no highlighted summary.
>  However if I search using "aspectj" I get the same doucment with
> highlighted summary.
>
> Just to mentioned I do rewrite the original query before performing the
> highlighting.
>
> I'm not sure what i'm missing here.  Any help would be appreciated.
>
> Cheers
> Amin
>
> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com>
> wrote:
> Hi
>
> Got it working!  Thanks again for your help!
>
>
> Amin
>
>
> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com>
> wrote:
> Thanks!  The final piece that I needed to do for the project!
>
> Cheers
>
> Amin
>
> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > cool.  i will use compression and store in index. is there anything
> > special
> > i need to for decompressing the text? i presume i can just do
> > doc.get("content")?
> > thanks for your advice all!
>
> No just use Field.Store.COMPRESS when adding to index and Document.get()
> when fetching. The decompression is automatically done.
>
> You may think, why not enable compression for all fields? The case is, that
> this is an overhead for very small and short fields. So you should only use
> it for large contents (it's the same like compressing very small files as
> ZIP/GZIP: These files mostly get larger than without compression).
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Ok.  I tried to apply the patch(s) and completely messed it up (user
error).  Is there a full example of the highlighter that is available that I
can apply and test?

Cheers
Amin


On Fri, Mar 13, 2009 at 12:09 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Absolutely!  I have received considerable help from the community and there
> are so many more stuff I want to ask!
>
> Cheers!
>
> Amin
>
>
> On Fri, Mar 13, 2009 at 10:41 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>>
>> Well, it's not yet committed.
>>
>> You can use it now by pulling the patch attached to the issue & testing it
>> yourself.  If you do so, please report back!  This is how Lucene improves.
>>
>> I'm hoping we can include it in 2.9...
>>
>> Mike
>>
>>
>> On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:
>>
>>  Sweet!  When will this highlighter be available?  Can I use this now?
>>>
>>> Cheers!
>>>
>>>
>>> On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>
>>>> Amin Mohammed-Coleman wrote:
>>>>
>>>> I think that would be good.
>>>>
>>>>>
>>>>>
>>>> I'll open an issue.
>>>>
>>>> Probably a silly thing to ask but I guess there is a performance
>>>>
>>>>> implication by setting it to max value.
>>>>>
>>>>>
>>>> Right.  And it's tough choosing a default in situations like this --
>>>> performance vs losing stuff.
>>>>
>>>> However, there's a new highlighter:
>>>>
>>>>  https://issues.apache.org/jira/browse/LUCENE-1522
>>>>
>>>> which looks like it may have promising performance and no default "loses
>>>> highlighted terms" limit, I think.
>>>>
>>>> Mike
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Absolutely!  I have received considerable help from the community and there
are so many more stuff I want to ask!

Cheers!

Amin

On Fri, Mar 13, 2009 at 10:41 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Well, it's not yet committed.
>
> You can use it now by pulling the patch attached to the issue & testing it
> yourself.  If you do so, please report back!  This is how Lucene improves.
>
> I'm hoping we can include it in 2.9...
>
> Mike
>
>
> On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:
>
>  Sweet!  When will this highlighter be available?  Can I use this now?
>>
>> Cheers!
>>
>>
>> On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>> I think that would be good.
>>>
>>>>
>>>>
>>> I'll open an issue.
>>>
>>> Probably a silly thing to ask but I guess there is a performance
>>>
>>>> implication by setting it to max value.
>>>>
>>>>
>>> Right.  And it's tough choosing a default in situations like this --
>>> performance vs losing stuff.
>>>
>>> However, there's a new highlighter:
>>>
>>>  https://issues.apache.org/jira/browse/LUCENE-1522
>>>
>>> which looks like it may have promising performance and no default "loses
>>> highlighted terms" limit, I think.
>>>
>>> Mike
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Michael McCandless <lu...@mikemccandless.com>.

Well, it's not yet committed.

You can use it now by pulling the patch attached to the issue &  
testing it yourself.  If you do so, please report back!  This is how  
Lucene improves.

I'm hoping we can include it in 2.9...

Mike

On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:

> Sweet!  When will this highlighter be available?  Can I use this now?
>
> Cheers!
>
>
> On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>>
>> Amin Mohammed-Coleman wrote:
>>
>> I think that would be good.
>>>
>>
>> I'll open an issue.
>>
>> Probably a silly thing to ask but I guess there is a performance
>>> implication by setting it to max value.
>>>
>>
>> Right.  And it's tough choosing a default in situations like this --
>> performance vs losing stuff.
>>
>> However, there's a new highlighter:
>>
>>   https://issues.apache.org/jira/browse/LUCENE-1522
>>
>> which looks like it may have promising performance and no default  
>> "loses
>> highlighted terms" limit, I think.
>>
>> Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Sweet!  When will this highlighter be available?  Can I use this now?

Cheers!


On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Amin Mohammed-Coleman wrote:
>
>  I think that would be good.
>>
>
> I'll open an issue.
>
>  Probably a silly thing to ask but I guess there is a performance
>> implication by setting it to max value.
>>
>
> Right.  And it's tough choosing a default in situations like this --
> performance vs losing stuff.
>
> However, there's a new highlighter:
>
>    https://issues.apache.org/jira/browse/LUCENE-1522
>
> which looks like it may have promising performance and no default "loses
> highlighted terms" limit, I think.
>
> Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Michael McCandless <lu...@mikemccandless.com>.

Amin Mohammed-Coleman wrote:

> I think that would be good.

I'll open an issue.

> Probably a silly thing to ask but I guess there is a performance  
> implication by setting it to max value.

Right.  And it's tough choosing a default in situations like this --  
performance vs losing stuff.

However, there's a new highlighter:

     https://issues.apache.org/jira/browse/LUCENE-1522

which looks like it may have promising performance and no default  
"loses highlighted terms" limit, I think.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

I think that would be good. Probably a silly thing to ask but I guess  
there is a performance implication by setting it to max value.

Is there a general setting that other developers use?

Cheers

Amin



On 12 Mar 2009, at 22:03, Michael McCandless  
<lu...@mikemccandless.com> wrote:

>
> IndexWriter has such behavior too, and because it was such a common  
> trap
> (developers could not understand why their content was being  
> truncated), we
> made that setting explicit, up front so you were aware of it.
>
> I think this in general is a reasonable approach for settings that  
> "lose" stuff (content,
> highlighted terms, etc.).
>
> Maybe we should do the same for highlighter?
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>> I did the following:
>>
>> highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
>>
>>
>> which works.
>>
>> On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>> >wrote:
>>
>>> JIRA updated.  Includes new testcase which shows highlighter not  
>>> working as
>>> expected.
>>>
>>>
>>> On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>>> >wrote:
>>>
>>>> Hi
>>>>
>>>> I have found that it is not issue with POI. I extracted text  
>>>> using PoI but
>>>> differenlty and the term is extracted properly.  When I store the  
>>>> text and
>>>> retrieve it the term exists. However running the text through  
>>>> highlighter
>>>> doesn't work
>>>>
>>>> I will post test case with plain text file on JIRA. Currently on  
>>>> a cramped
>>>> train!
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> On 11 Mar 2009, at 18:11, markharw00d <ma...@yahoo.co.uk>  
>>>> wrote:
>>>>
>>>> If you can supply a Junit test that recreates the problem I think  
>>>> we can
>>>>> start to make progress on this.
>>>>>
>>>>>
>>>>>
>>>>> Amin Mohammed-Coleman wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Apologies for re sending this mail. Just wondering if anyone has
>>>>>> experienced the below. I'm not sure if this could happen due  
>>>>>> nature of
>>>>>> document. It does seem strange one term search returns summary  
>>>>>> while another
>>>>>> does not even though same document is being returned.
>>>>>>
>>>>>> I'm asking this so I can code around this if is normal.
>>>>>>
>>>>>>
>>>>>> Apologies again for re sending this mail
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Amin
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi
>>>>>>>
>>>>>>> I am seeing some strange behaviour with the highlighter and I'm
>>>>>>> wondering if anyone else is experiencing this.  In certain  
>>>>>>> instances I don't
>>>>>>> get a summary being generated.  I perform the search and the  
>>>>>>> search returns
>>>>>>> the correct document.  I can see that the lucene document  
>>>>>>> contains the text
>>>>>>> in the field.  However after doing:
>>>>>>>
>>>>>>> SimpleHTMLFormatter simpleHTMLFormatter = new
>>>>>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></ 
>>>>>>> span>");
>>>>>>>         //required for highlighting
>>>>>>>         Query query2 = multiSearcher.rewrite(query);
>>>>>>>         Highlighter highlighter = new
>>>>>>> Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
>>>>>>> ...
>>>>>>>
>>>>>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>>>>>             TokenStream tokenStream =
>>>>>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>>>>>>> StringReader(text));
>>>>>>>             String result =  
>>>>>>> highlighter.getBestFragments(tokenStream,
>>>>>>> text, 3, "...");
>>>>>>>
>>>>>>>
>>>>>>> the string result is empty.  This is very strange, if i try a  
>>>>>>> different
>>>>>>> term that exists in the document then I get a summary.  For  
>>>>>>> example I have a
>>>>>>> word document that contains the term "document" and  
>>>>>>> "aspectj".  If I search
>>>>>>> for "document" I get the correct document but no highlighted  
>>>>>>> summary.
>>>>>>> However if I search using "aspectj" I get the same doucment with
>>>>>>> highlighted summary.
>>>>>>>
>>>>>>> Just to mentioned I do rewrite the original query before  
>>>>>>> performing the
>>>>>>> highlighting.
>>>>>>>
>>>>>>> I'm not sure what i'm missing here.  Any help would be  
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Amin
>>>>>>>
>>>>>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Got it working!  Thanks again for your help!
>>>>>>>
>>>>>>>
>>>>>>> Amin
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com> wrote:
>>>>>>> Thanks!  The final piece that I needed to do for the project!
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Amin
>>>>>>>
>>>>>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de>
>>>>>>> wrote:
>>>>>>>> cool.  i will use compression and store in index. is there  
>>>>>>>> anything
>>>>>>>> special
>>>>>>>> i need to for decompressing the text? i presume i can just do
>>>>>>>> doc.get("content")?
>>>>>>>> thanks for your advice all!
>>>>>>>
>>>>>>> No just use Field.Store.COMPRESS when adding to index and
>>>>>>> Document.get()
>>>>>>> when fetching. The decompression is automatically done.
>>>>>>>
>>>>>>> You may think, why not enable compression for all fields? The  
>>>>>>> case is,
>>>>>>> that
>>>>>>> this is an overhead for very small and short fields. So you  
>>>>>>> should only
>>>>>>> use
>>>>>>> it for large contents (it's the same like compressing very  
>>>>>>> small files
>>>>>>> as
>>>>>>> ZIP/GZIP: These files mostly get larger than without  
>>>>>>> compression).
>>>>>>>
>>>>>>> Uwe
>>>>>>>
>>>>>>>
>>>>>>> --- 
>>>>>>> --- 
>>>>>>> ---------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --- 
>>>>>> --- 
>>>>>> --- 
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> No virus found in this incoming message.
>>>>>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>>>>>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Michael McCandless <lu...@mikemccandless.com>.

IndexWriter has such behavior too, and because it was such a common trap
(developers could not understand why their content was being  
truncated), we
made that setting explicit, up front so you were aware of it.

I think this in general is a reasonable approach for settings that  
"lose" stuff (content,
highlighted terms, etc.).

Maybe we should do the same for highlighter?

Mike

Amin Mohammed-Coleman wrote:

> I did the following:
>
> highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
>
>
> which works.
>
> On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> JIRA updated.  Includes new testcase which shows highlighter not  
>> working as
>> expected.
>>
>>
>> On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>> >wrote:
>>
>>> Hi
>>>
>>> I have found that it is not issue with POI. I extracted text using  
>>> PoI but
>>> differenlty and the term is extracted properly.  When I store the  
>>> text and
>>> retrieve it the term exists. However running the text through  
>>> highlighter
>>> doesn't work
>>>
>>> I will post test case with plain text file on JIRA. Currently on a  
>>> cramped
>>> train!
>>>
>>> Cheers
>>>
>>>
>>>
>>> On 11 Mar 2009, at 18:11, markharw00d <ma...@yahoo.co.uk>  
>>> wrote:
>>>
>>> If you can supply a Junit test that recreates the problem I think  
>>> we can
>>>> start to make progress on this.
>>>>
>>>>
>>>>
>>>> Amin Mohammed-Coleman wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Apologies for re sending this mail. Just wondering if anyone has
>>>>> experienced the below. I'm not sure if this could happen due  
>>>>> nature of
>>>>> document. It does seem strange one term search returns summary  
>>>>> while another
>>>>> does not even though same document is being returned.
>>>>>
>>>>> I'm asking this so I can code around this if is normal.
>>>>>
>>>>>
>>>>> Apologies again for re sending this mail
>>>>>
>>>>> Cheers
>>>>>
>>>>> Amin
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi
>>>>>>
>>>>>> I am seeing some strange behaviour with the highlighter and I'm
>>>>>> wondering if anyone else is experiencing this.  In certain  
>>>>>> instances I don't
>>>>>> get a summary being generated.  I perform the search and the  
>>>>>> search returns
>>>>>> the correct document.  I can see that the lucene document  
>>>>>> contains the text
>>>>>> in the field.  However after doing:
>>>>>>
>>>>>>  SimpleHTMLFormatter simpleHTMLFormatter = new
>>>>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></ 
>>>>>> span>");
>>>>>>          //required for highlighting
>>>>>>          Query query2 = multiSearcher.rewrite(query);
>>>>>>          Highlighter highlighter = new
>>>>>> Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
>>>>>> ...
>>>>>>
>>>>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>>>>              TokenStream tokenStream =
>>>>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>>>>>> StringReader(text));
>>>>>>              String result =  
>>>>>> highlighter.getBestFragments(tokenStream,
>>>>>> text, 3, "...");
>>>>>>
>>>>>>
>>>>>> the string result is empty.  This is very strange, if i try a  
>>>>>> different
>>>>>> term that exists in the document then I get a summary.  For  
>>>>>> example I have a
>>>>>> word document that contains the term "document" and "aspectj".   
>>>>>> If I search
>>>>>> for "document" I get the correct document but no highlighted  
>>>>>> summary.
>>>>>> However if I search using "aspectj" I get the same doucment with
>>>>>> highlighted summary.
>>>>>>
>>>>>> Just to mentioned I do rewrite the original query before  
>>>>>> performing the
>>>>>> highlighting.
>>>>>>
>>>>>> I'm not sure what i'm missing here.  Any help would be  
>>>>>> appreciated.
>>>>>>
>>>>>> Cheers
>>>>>> Amin
>>>>>>
>>>>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com> wrote:
>>>>>> Hi
>>>>>>
>>>>>> Got it working!  Thanks again for your help!
>>>>>>
>>>>>>
>>>>>> Amin
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com> wrote:
>>>>>> Thanks!  The final piece that I needed to do for the project!
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Amin
>>>>>>
>>>>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de>
>>>>>> wrote:
>>>>>>> cool.  i will use compression and store in index. is there  
>>>>>>> anything
>>>>>>> special
>>>>>>> i need to for decompressing the text? i presume i can just do
>>>>>>> doc.get("content")?
>>>>>>> thanks for your advice all!
>>>>>>
>>>>>> No just use Field.Store.COMPRESS when adding to index and
>>>>>> Document.get()
>>>>>> when fetching. The decompression is automatically done.
>>>>>>
>>>>>> You may think, why not enable compression for all fields? The  
>>>>>> case is,
>>>>>> that
>>>>>> this is an overhead for very small and short fields. So you  
>>>>>> should only
>>>>>> use
>>>>>> it for large contents (it's the same like compressing very  
>>>>>> small files
>>>>>> as
>>>>>> ZIP/GZIP: These files mostly get larger than without  
>>>>>> compression).
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> No virus found in this incoming message.
>>>>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>>>>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

I did the following:

highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


which works.

On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> JIRA updated.  Includes new testcase which shows highlighter not working as
> expected.
>
>
> On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:
>
>> Hi
>>
>> I have found that it is not issue with POI. I extracted text using PoI but
>> differenlty and the term is extracted properly.  When I store the text and
>> retrieve it the term exists. However running the text through highlighter
>> doesn't work
>>
>> I will post test case with plain text file on JIRA. Currently on a cramped
>> train!
>>
>> Cheers
>>
>>
>>
>> On 11 Mar 2009, at 18:11, markharw00d <ma...@yahoo.co.uk> wrote:
>>
>>  If you can supply a Junit test that recreates the problem I think we can
>>> start to make progress on this.
>>>
>>>
>>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>>> Hi
>>>>
>>>> Apologies for re sending this mail. Just wondering if anyone has
>>>> experienced the below. I'm not sure if this could happen due nature of
>>>> document. It does seem strange one term search returns summary while another
>>>> does not even though same document is being returned.
>>>>
>>>> I'm asking this so I can code around this if is normal.
>>>>
>>>>
>>>> Apologies again for re sending this mail
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com>
>>>> wrote:
>>>>
>>>>  Hi
>>>>>
>>>>> I am seeing some strange behaviour with the highlighter and I'm
>>>>> wondering if anyone else is experiencing this.  In certain instances I don't
>>>>> get a summary being generated.  I perform the search and the search returns
>>>>> the correct document.  I can see that the lucene document contains the text
>>>>> in the field.  However after doing:
>>>>>
>>>>>   SimpleHTMLFormatter simpleHTMLFormatter = new
>>>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
>>>>>           //required for highlighting
>>>>>           Query query2 = multiSearcher.rewrite(query);
>>>>>           Highlighter highlighter = new
>>>>> Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
>>>>> ...
>>>>>
>>>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>>>               TokenStream tokenStream =
>>>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>>>>> StringReader(text));
>>>>>               String result = highlighter.getBestFragments(tokenStream,
>>>>> text, 3, "...");
>>>>>
>>>>>
>>>>> the string result is empty.  This is very strange, if i try a different
>>>>> term that exists in the document then I get a summary.  For example I have a
>>>>> word document that contains the term "document" and "aspectj".  If I search
>>>>> for "document" I get the correct document but no highlighted summary.
>>>>>  However if I search using "aspectj" I get the same doucment with
>>>>> highlighted summary.
>>>>>
>>>>> Just to mentioned I do rewrite the original query before performing the
>>>>> highlighting.
>>>>>
>>>>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>>>>
>>>>> Cheers
>>>>> Amin
>>>>>
>>>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com> wrote:
>>>>> Hi
>>>>>
>>>>> Got it working!  Thanks again for your help!
>>>>>
>>>>>
>>>>> Amin
>>>>>
>>>>>
>>>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com> wrote:
>>>>> Thanks!  The final piece that I needed to do for the project!
>>>>>
>>>>> Cheers
>>>>>
>>>>> Amin
>>>>>
>>>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de>
>>>>> wrote:
>>>>> > cool.  i will use compression and store in index. is there anything
>>>>> > special
>>>>> > i need to for decompressing the text? i presume i can just do
>>>>> > doc.get("content")?
>>>>> > thanks for your advice all!
>>>>>
>>>>> No just use Field.Store.COMPRESS when adding to index and
>>>>> Document.get()
>>>>> when fetching. The decompression is automatically done.
>>>>>
>>>>> You may think, why not enable compression for all fields? The case is,
>>>>> that
>>>>> this is an overhead for very small and short fields. So you should only
>>>>> use
>>>>> it for large contents (it's the same like compressing very small files
>>>>> as
>>>>> ZIP/GZIP: These files mostly get larger than without compression).
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> No virus found in this incoming message.
>>>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>>>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>>>
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

JIRA updated.  Includes new testcase which shows highlighter not working as
expected.

On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi
>
> I have found that it is not issue with POI. I extracted text using PoI but
> differenlty and the term is extracted properly.  When I store the text and
> retrieve it the term exists. However running the text through highlighter
> doesn't work
>
> I will post test case with plain text file on JIRA. Currently on a cramped
> train!
>
> Cheers
>
>
>
> On 11 Mar 2009, at 18:11, markharw00d <ma...@yahoo.co.uk> wrote:
>
>  If you can supply a Junit test that recreates the problem I think we can
>> start to make progress on this.
>>
>>
>>
>> Amin Mohammed-Coleman wrote:
>>
>>> Hi
>>>
>>> Apologies for re sending this mail. Just wondering if anyone has
>>> experienced the below. I'm not sure if this could happen due nature of
>>> document. It does seem strange one term search returns summary while another
>>> does not even though same document is being returned.
>>>
>>> I'm asking this so I can code around this if is normal.
>>>
>>>
>>> Apologies again for re sending this mail
>>>
>>> Cheers
>>>
>>> Amin
>>>
>>> Sent from my iPhone
>>>
>>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:
>>>
>>>  Hi
>>>>
>>>> I am seeing some strange behaviour with the highlighter and I'm
>>>> wondering if anyone else is experiencing this.  In certain instances I don't
>>>> get a summary being generated.  I perform the search and the search returns
>>>> the correct document.  I can see that the lucene document contains the text
>>>> in the field.  However after doing:
>>>>
>>>>   SimpleHTMLFormatter simpleHTMLFormatter = new
>>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
>>>>           //required for highlighting
>>>>           Query query2 = multiSearcher.rewrite(query);
>>>>           Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
>>>> new QueryScorer(query2));
>>>> ...
>>>>
>>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>>               TokenStream tokenStream =
>>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>>>> StringReader(text));
>>>>               String result = highlighter.getBestFragments(tokenStream,
>>>> text, 3, "...");
>>>>
>>>>
>>>> the string result is empty.  This is very strange, if i try a different
>>>> term that exists in the document then I get a summary.  For example I have a
>>>> word document that contains the term "document" and "aspectj".  If I search
>>>> for "document" I get the correct document but no highlighted summary.
>>>>  However if I search using "aspectj" I get the same doucment with
>>>> highlighted summary.
>>>>
>>>> Just to mentioned I do rewrite the original query before performing the
>>>> highlighting.
>>>>
>>>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>>>
>>>> Cheers
>>>> Amin
>>>>
>>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com>
>>>> wrote:
>>>> Hi
>>>>
>>>> Got it working!  Thanks again for your help!
>>>>
>>>>
>>>> Amin
>>>>
>>>>
>>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
>>>> aminmc@gmail.com> wrote:
>>>> Thanks!  The final piece that I needed to do for the project!
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>>> > cool.  i will use compression and store in index. is there anything
>>>> > special
>>>> > i need to for decompressing the text? i presume i can just do
>>>> > doc.get("content")?
>>>> > thanks for your advice all!
>>>>
>>>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>>>> when fetching. The decompression is automatically done.
>>>>
>>>> You may think, why not enable compression for all fields? The case is,
>>>> that
>>>> this is an overhead for very small and short fields. So you should only
>>>> use
>>>> it for large contents (it's the same like compressing very small files
>>>> as
>>>> ZIP/GZIP: These files mostly get larger than without compression).
>>>>
>>>> Uwe
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

Re: Lucene Highlighting and Dynamic Summaries

Posted by mark harwood <ma...@yahoo.co.uk>.

The attachment didn't make it through here. Can you add it as an attachment to a new JIRA issue?

Thanks,
Mark

________________________________
From: Amin Mohammed-Coleman <am...@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, 12 March, 2009 7:47:20
Subject: Re: Lucene Highlighting and Dynamic Summaries

Hi

Please find attadched a test case plus a document.  Just to mention this occurs sometimes for other files.

Cheers
Amin

On Wed, Mar 11, 2009 at 6:11 PM, markharw00d <ma...@yahoo.co.uk> wrote:

If you can supply a Junit test that recreates the problem I think we can start to make progress on this.

Amin Mohammed-Coleman wrote:

Hi

Apologies for re sending this mail. Just wondering if anyone has experienced the below.. I'm not sure if this could happen due nature of document. It does seem strange one term search returns summary while another does not even though same document is being returned.

I'm asking this so I can code around this if is normal.

Apologies again for re sending this mail

Cheers

Amin

Sent from my iPhone

On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:

Hi

I am seeing some strange behaviour with the highlighter and I'm wondering if anyone else is experiencing this.  In certain instances I don't get a summary being generated.  I perform the search and the search returns the correct document.  I can see that the lucene document contains the text in the field.  However after doing:

   SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
           //required for highlighting
           Query query2 = multiSearcher.rewrite(query);
           Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
...

String text= doc.get(FieldNameEnum.BODY.getDescription());
               TokenStream tokenStream = analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new StringReader(text));
               String result = highlighter.getBestFragments(tokenStream, text, 3, "...");

the string result is empty.  This is very strange, if i try a different term that exists in the document then I get a summary.  For example I have a word document that contains the term "document" and "aspectj".  If I search for "document" I get the correct document but no highlighted summary.  However if I search using "aspectj" I get the same doucment with highlighted summary.

Just to mentioned I do rewrite the original query before performing the highlighting.

I'm not sure what i'm missing here.  Any help would be appreciated.

Cheers
Amin

On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com> wrote:
Hi

Got it working!  Thanks again for your help!

Amin

On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com> wrote:
Thanks!  The final piece that I needed to do for the project!

Cheers

Amin

On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> cool.  i will use compression and store in index. is there anything
> special
> i need to for decompressing the text? i presume i can just do
> doc.get("content")?
> thanks for your advice all!

No just use Field.Store.COMPRESS when adding to index and Document.get()
when fetching. The decompression is automatically done.

You may think, why not enable compression for all fields? The case is, that
this is an overhead for very small and short fields. So you should only use
it for large contents (it's the same like compressing very small files as
ZIP/GZIP: These files mostly get larger than without compression).

Uwe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.10/1995 - Release Date: 03/11/09 08:28:00

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi
Please find attadched a test case plus a document.  Just to mention this
occurs sometimes for other files.


Cheers
Amin

On Wed, Mar 11, 2009 at 6:11 PM, markharw00d <ma...@yahoo.co.uk>wrote:

> If you can supply a Junit test that recreates the problem I think we can
> start to make progress on this.
>
>
>
> Amin Mohammed-Coleman wrote:
>
>> Hi
>>
>> Apologies for re sending this mail. Just wondering if anyone has
>> experienced the below. I'm not sure if this could happen due nature of
>> document. It does seem strange one term search returns summary while another
>> does not even though same document is being returned.
>>
>> I'm asking this so I can code around this if is normal.
>>
>>
>> Apologies again for re sending this mail
>>
>> Cheers
>>
>> Amin
>>
>> Sent from my iPhone
>>
>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:
>>
>>  Hi
>>>
>>> I am seeing some strange behaviour with the highlighter and I'm wondering
>>> if anyone else is experiencing this.  In certain instances I don't get a
>>> summary being generated.  I perform the search and the search returns the
>>> correct document.  I can see that the lucene document contains the text in
>>> the field.  However after doing:
>>>
>>>    SimpleHTMLFormatter simpleHTMLFormatter = new
>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
>>>            //required for highlighting
>>>            Query query2 = multiSearcher.rewrite(query);
>>>            Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
>>> new QueryScorer(query2));
>>> ...
>>>
>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>                TokenStream tokenStream =
>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
>>> StringReader(text));
>>>                String result = highlighter.getBestFragments(tokenStream,
>>> text, 3, "...");
>>>
>>>
>>> the string result is empty.  This is very strange, if i try a different
>>> term that exists in the document then I get a summary.  For example I have a
>>> word document that contains the term "document" and "aspectj".  If I search
>>> for "document" I get the correct document but no highlighted summary.
>>>  However if I search using "aspectj" I get the same doucment with
>>> highlighted summary.
>>>
>>> Just to mentioned I do rewrite the original query before performing the
>>> highlighting.
>>>
>>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>>
>>> Cheers
>>> Amin
>>>
>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com>
>>> wrote:
>>> Hi
>>>
>>> Got it working!  Thanks again for your help!
>>>
>>>
>>> Amin
>>>
>>>
>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com>
>>> wrote:
>>> Thanks!  The final piece that I needed to do for the project!
>>>
>>> Cheers
>>>
>>> Amin
>>>
>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>> > cool.  i will use compression and store in index. is there anything
>>> > special
>>> > i need to for decompressing the text? i presume i can just do
>>> > doc.get("content")?
>>> > thanks for your advice all!
>>>
>>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>>> when fetching. The decompression is automatically done.
>>>
>>> You may think, why not enable compression for all fields? The case is,
>>> that
>>> this is an overhead for very small and short fields. So you should only
>>> use
>>> it for large contents (it's the same like compressing very small files as
>>> ZIP/GZIP: These files mostly get larger than without compression).
>>>
>>> Uwe
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>> ------------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
>> 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

I have found that it is not issue with POI. I extracted text using PoI  
but differenlty and the term is extracted properly.  When I store the  
text and retrieve it the term exists. However running the text through  
highlighter doesn't work

I will post test case with plain text file on JIRA. Currently on a  
cramped train!

Cheers


On 11 Mar 2009, at 18:11, markharw00d <ma...@yahoo.co.uk> wrote:

> If you can supply a Junit test that recreates the problem I think we  
> can start to make progress on this.
>
>
>
> Amin Mohammed-Coleman wrote:
>> Hi
>>
>> Apologies for re sending this mail. Just wondering if anyone has  
>> experienced the below. I'm not sure if this could happen due nature  
>> of document. It does seem strange one term search returns summary  
>> while another does not even though same document is being returned.
>>
>> I'm asking this so I can code around this if is normal.
>>
>>
>> Apologies again for re sending this mail
>>
>> Cheers
>>
>> Amin
>>
>> Sent from my iPhone
>>
>> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com>  
>> wrote:
>>
>>> Hi
>>>
>>> I am seeing some strange behaviour with the highlighter and I'm  
>>> wondering if anyone else is experiencing this.  In certain  
>>> instances I don't get a summary being generated.  I perform the  
>>> search and the search returns the correct document.  I can see  
>>> that the lucene document contains the text in the field.  However  
>>> after doing:
>>>
>>>    SimpleHTMLFormatter simpleHTMLFormatter = new  
>>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
>>>            //required for highlighting
>>>            Query query2 = multiSearcher.rewrite(query);
>>>            Highlighter highlighter = new  
>>> Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
>>> ...
>>>
>>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>>                TokenStream tokenStream =  
>>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new  
>>> StringReader(text));
>>>                String result =  
>>> highlighter.getBestFragments(tokenStream, text, 3, "...");
>>>
>>>
>>> the string result is empty.  This is very strange, if i try a  
>>> different term that exists in the document then I get a summary.   
>>> For example I have a word document that contains the term  
>>> "document" and "aspectj".  If I search for "document" I get the  
>>> correct document but no highlighted summary.  However if I search  
>>> using "aspectj" I get the same doucment with highlighted summary.
>>>
>>> Just to mentioned I do rewrite the original query before  
>>> performing the highlighting.
>>>
>>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>>
>>> Cheers
>>> Amin
>>>
>>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>>> > wrote:
>>> Hi
>>>
>>> Got it working!  Thanks again for your help!
>>>
>>>
>>> Amin
>>>
>>>
>>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>>> > wrote:
>>> Thanks!  The final piece that I needed to do for the project!
>>>
>>> Cheers
>>>
>>> Amin
>>>
>>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de>  
>>> wrote:
>>> > cool.  i will use compression and store in index. is there  
>>> anything
>>> > special
>>> > i need to for decompressing the text? i presume i can just do
>>> > doc.get("content")?
>>> > thanks for your advice all!
>>>
>>> No just use Field.Store.COMPRESS when adding to index and  
>>> Document.get()
>>> when fetching. The decompression is automatically done.
>>>
>>> You may think, why not enable compression for all fields? The case  
>>> is, that
>>> this is an overhead for very small and short fields. So you should  
>>> only use
>>> it for large contents (it's the same like compressing very small  
>>> files as
>>> ZIP/GZIP: These files mostly get larger than without compression).
>>>
>>> Uwe
>>>
>>>
>>> --- 
>>> ------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>> --- 
>> ---------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.10/1995 
>>  - Release Date: 03/11/09 08:28:00
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by markharw00d <ma...@yahoo.co.uk>.

If you can supply a Junit test that recreates the problem I think we can 
start to make progress on this.



Amin Mohammed-Coleman wrote:
> Hi
>
> Apologies for re sending this mail. Just wondering if anyone has 
> experienced the below. I'm not sure if this could happen due nature of 
> document. It does seem strange one term search returns summary while 
> another does not even though same document is being returned.
>
> I'm asking this so I can code around this if is normal.
>
>
> Apologies again for re sending this mail
>
> Cheers
>
> Amin
>
> Sent from my iPhone
>
> On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:
>
>> Hi
>>
>> I am seeing some strange behaviour with the highlighter and I'm 
>> wondering if anyone else is experiencing this.  In certain instances 
>> I don't get a summary being generated.  I perform the search and the 
>> search returns the correct document.  I can see that the lucene 
>> document contains the text in the field.  However after doing:
>>
>>     SimpleHTMLFormatter simpleHTMLFormatter = new 
>> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
>>             //required for highlighting
>>             Query query2 = multiSearcher.rewrite(query);
>>             Highlighter highlighter = new 
>> Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
>> ...
>>
>> String text= doc.get(FieldNameEnum.BODY.getDescription());
>>                 TokenStream tokenStream = 
>> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new 
>> StringReader(text));
>>                 String result = 
>> highlighter.getBestFragments(tokenStream, text, 3, "...");
>>
>>
>> the string result is empty.  This is very strange, if i try a 
>> different term that exists in the document then I get a summary.  For 
>> example I have a word document that contains the term "document" and 
>> "aspectj".  If I search for "document" I get the correct document but 
>> no highlighted summary.  However if I search using "aspectj" I get 
>> the same doucment with highlighted summary.
>>
>> Just to mentioned I do rewrite the original query before performing 
>> the highlighting.
>>
>> I'm not sure what i'm missing here.  Any help would be appreciated.
>>
>> Cheers
>> Amin
>>
>> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman 
>> <am...@gmail.com> wrote:
>> Hi
>>
>> Got it working!  Thanks again for your help!
>>
>>
>> Amin
>>
>>
>> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman 
>> <am...@gmail.com> wrote:
>> Thanks!  The final piece that I needed to do for the project!
>>
>> Cheers
>>
>> Amin
>>
>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> > cool.  i will use compression and store in index. is there anything
>> > special
>> > i need to for decompressing the text? i presume i can just do
>> > doc.get("content")?
>> > thanks for your advice all!
>>
>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>> when fetching. The decompression is automatically done.
>>
>> You may think, why not enable compression for all fields? The case 
>> is, that
>> this is an overhead for very small and short fields. So you should 
>> only use
>> it for large contents (it's the same like compressing very small 
>> files as
>> ZIP/GZIP: These files mostly get larger than without compression).
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.0.237 / Virus Database: 270.11.10/1995 - Release Date: 03/11/09 08:28:00
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

Apologies for re sending this mail. Just wondering if anyone has  
experienced the below. I'm not sure if this could happen due nature of  
document. It does seem strange one term search returns summary while  
another does not even though same document is being returned.

I'm asking this so I can code around this if is normal.


Apologies again for re sending this mail

Cheers

Amin

Sent from my iPhone

On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <am...@gmail.com> wrote:

> Hi
>
> I am seeing some strange behaviour with the highlighter and I'm  
> wondering if anyone else is experiencing this.  In certain instances  
> I don't get a summary being generated.  I perform the search and the  
> search returns the correct document.  I can see that the lucene  
> document contains the text in the field.  However after doing:
>
> 	SimpleHTMLFormatter simpleHTMLFormatter = new  
> SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
> 			//required for highlighting
> 			Query query2 = multiSearcher.rewrite(query);
> 			Highlighter highlighter = new Highlighter(simpleHTMLFormatter,  
> new QueryScorer(query2));
> ...
>
> String text= doc.get(FieldNameEnum.BODY.getDescription());
>                 TokenStream tokenStream =  
> analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new  
> StringReader(text));
>                 String result =  
> highlighter.getBestFragments(tokenStream, text, 3, "...");
>
>
> the string result is empty.  This is very strange, if i try a  
> different term that exists in the document then I get a summary.   
> For example I have a word document that contains the term "document"  
> and "aspectj".  If I search for "document" I get the correct  
> document but no highlighted summary.  However if I search using  
> "aspectj" I get the same doucment with highlighted summary.
>
> Just to mentioned I do rewrite the original query before performing  
> the highlighting.
>
> I'm not sure what i'm missing here.  Any help would be appreciated.
>
> Cheers
> Amin
>
> On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> > wrote:
> Hi
>
> Got it working!  Thanks again for your help!
>
>
> Amin
>
>
> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> > wrote:
> Thanks!  The final piece that I needed to do for the project!
>
> Cheers
>
> Amin
>
> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de>  
> wrote:
> > cool.  i will use compression and store in index. is there anything
> > special
> > i need to for decompressing the text? i presume i can just do
> > doc.get("content")?
> > thanks for your advice all!
>
> No just use Field.Store.COMPRESS when adding to index and  
> Document.get()
> when fetching. The decompression is automatically done.
>
> You may think, why not enable compression for all fields? The case  
> is, that
> this is an overhead for very small and short fields. So you should  
> only use
> it for large contents (it's the same like compressing very small  
> files as
> ZIP/GZIP: These files mostly get larger than without compression).
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi
I am seeing some strange behaviour with the highlighter and I'm wondering if
anyone else is experiencing this.  In certain instances I don't get a
summary being generated.  I perform the search and the search returns the
correct document.  I can see that the lucene document contains the text in
the field.  However after doing:

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
class=\"highlight\"><b>", "</b></span>");

//required for highlighting

Query query2 = multiSearcher.rewrite(query);

Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
newQueryScorer(query2));

...

String text= doc.get(FieldNameEnum.BODY.getDescription());

                TokenStream tokenStream = analyzer
.tokenStream(FieldNameEnum.BODY.getDescription(), new StringReader(text));

                String result = highlighter.getBestFragments(tokenStream,
text, 3, "...");

the string result is empty.  This is very strange, if i try a different term
that exists in the document then I get a summary.  For example I have a word
document that contains the term "document" and "aspectj".  If I search for
"document" I get the correct document but no highlighted summary.  However
if I search using "aspectj" I get the same doucment with highlighted
summary.

Just to mentioned I do rewrite the original query before performing the
highlighting.

I'm not sure what i'm missing here.  Any help would be appreciated.

Cheers

Amin

On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi
> Got it working!  Thanks again for your help!
>
>
> Amin
>
>
> On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:
>
>> Thanks!  The final piece that I needed to do for the project!
>> Cheers
>>
>> Amin
>>
>> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>
>>> > cool.  i will use compression and store in index. is there anything
>>> > special
>>> > i need to for decompressing the text? i presume i can just do
>>> > doc.get("content")?
>>> > thanks for your advice all!
>>>
>>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>>> when fetching. The decompression is automatically done.
>>>
>>> You may think, why not enable compression for all fields? The case is,
>>> that
>>> this is an overhead for very small and short fields. So you should only
>>> use
>>> it for large contents (it's the same like compressing very small files as
>>> ZIP/GZIP: These files mostly get larger than without compression).
>>>
>>> Uwe
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi
Got it working!  Thanks again for your help!


Amin

On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Thanks!  The final piece that I needed to do for the project!
> Cheers
>
> Amin
>
> On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
>> > cool.  i will use compression and store in index. is there anything
>> > special
>> > i need to for decompressing the text? i presume i can just do
>> > doc.get("content")?
>> > thanks for your advice all!
>>
>> No just use Field.Store.COMPRESS when adding to index and Document.get()
>> when fetching. The decompression is automatically done.
>>
>> You may think, why not enable compression for all fields? The case is,
>> that
>> this is an overhead for very small and short fields. So you should only
>> use
>> it for large contents (it's the same like compressing very small files as
>> ZIP/GZIP: These files mostly get larger than without compression).
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Thanks!  The final piece that I needed to do for the project!
Cheers

Amin

On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> > cool.  i will use compression and store in index. is there anything
> > special
> > i need to for decompressing the text? i presume i can just do
> > doc.get("content")?
> > thanks for your advice all!
>
> No just use Field.Store.COMPRESS when adding to index and Document.get()
> when fetching. The decompression is automatically done.
>
> You may think, why not enable compression for all fields? The case is, that
> this is an overhead for very small and short fields. So you should only use
> it for large contents (it's the same like compressing very small files as
> ZIP/GZIP: These files mostly get larger than without compression).
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Lucene Highlighting and Dynamic Summaries

Posted by Uwe Schindler <uw...@thetaphi.de>.

> cool.  i will use compression and store in index. is there anything
> special
> i need to for decompressing the text? i presume i can just do
> doc.get("content")?
> thanks for your advice all!

No just use Field.Store.COMPRESS when adding to index and Document.get()
when fetching. The decompression is automatically done.

You may think, why not enable compression for all fields? The case is, that
this is an overhead for very small and short fields. So you should only use
it for large contents (it's the same like compressing very small files as
ZIP/GZIP: These files mostly get larger than without compression).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

cool.  i will use compression and store in index. is there anything special
i need to for decompressing the text? i presume i can just do
doc.get("content")?
thanks for your advice all!

On Sat, Mar 7, 2009 at 11:50 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> You could store the text contents compressed; I think extracting text from
> PDF files is much more time-intensive than decompressing a stored field.
> And
> text-only contents often compress very good. In my opinion, if the
> (uncompressed) contents of the docs are not very large (so I mean several
> megabytes each), I would prefer storing it in index.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> > Sent: Saturday, March 07, 2009 12:46 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene Highlighting and Dynamic Summaries
> >
> > It depends :)
> >
> > It's a trade-off.  If storing is not prohibitive, I recommend that as
> > it makes life easier for highlighting.
> >
> >       Erik
> >
> > On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:
> >
> > > hi
> > > that's what i was thinking about.  i would need to get the file and
> > > extract
> > > the text again and then pass through the highlighter.  The other
> > > option is
> > > storing the content in the index the downside being index is going
> > > to be
> > > large.  Which would be the recommended approach?
> > >
> > > Cheers
> > >
> > > Amin
> > >
> > > On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher
> > <erik@ehatchersolutions.com
> > > >wrote:
> > >
> > >> With the caveat that if you're not storing the text you want
> > >> highlighted,
> > >> you'll have to retrieve it somehow and send it into the Highlighter
> > >> yourself.
> > >>
> > >>       Erik
> > >>
> > >>
> > >> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:
> > >>
> > >>
> > >>> You should look at contrib/highlighter, which does exactly this.
> > >>>
> > >>> Mike
> > >>>
> > >>> Amin Mohammed-Coleman wrote:
> > >>>
> > >>> Hi
> > >>>> I am currently indexing documents (pdf, ms word, etc) that are
> > >>>> uploaded,
> > >>>> these documents can be searched and what the search returns to
> > >>>> the user
> > >>>> are
> > >>>> summaries of the documents.  Currently the summaries are
> > >>>> extracted when
> > >>>> indexing the file (summary constructed by taking the first 10
> > >>>> lines of
> > >>>> the
> > >>>> document and stored in the index as field).  This is not ideal
> > >>>> (static
> > >>>> summary), and I was wondering if it would be possible to create a
> > >>>> dynamic
> > >>>> summary when a hit is found and highlight the terms found.  The
> > >>>> content
> > >>>> of
> > >>>> the document is not stored in the index.
> > >>>>
> > >>>> So basically what I'm looking to do is:
> > >>>>
> > >>>> 1) PDF indexed
> > >>>> 2) PDF body contains the word "search"
> > >>>> 3) Do a search and return the hit
> > >>>> 4) Construct a summary with the term "search" included.
> > >>>>
> > >>>> I'm not sure how to go about doing this (I presume it is
> > >>>> possible).  I
> > >>>> would
> > >>>> be grateful for any advice.
> > >>>>
> > >>>>
> > >>>> Cheers
> > >>>> Amin
> > >>>>
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Lucene Highlighting and Dynamic Summaries

Posted by Uwe Schindler <uw...@thetaphi.de>.

You could store the text contents compressed; I think extracting text from
PDF files is much more time-intensive than decompressing a stored field. And
text-only contents often compress very good. In my opinion, if the
(uncompressed) contents of the docs are not very large (so I mean several
megabytes each), I would prefer storing it in index.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> Sent: Saturday, March 07, 2009 12:46 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Highlighting and Dynamic Summaries
> 
> It depends :)
> 
> It's a trade-off.  If storing is not prohibitive, I recommend that as
> it makes life easier for highlighting.
> 
> 	Erik
> 
> On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:
> 
> > hi
> > that's what i was thinking about.  i would need to get the file and
> > extract
> > the text again and then pass through the highlighter.  The other
> > option is
> > storing the content in the index the downside being index is going
> > to be
> > large.  Which would be the recommended approach?
> >
> > Cheers
> >
> > Amin
> >
> > On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher
> <erik@ehatchersolutions.com
> > >wrote:
> >
> >> With the caveat that if you're not storing the text you want
> >> highlighted,
> >> you'll have to retrieve it somehow and send it into the Highlighter
> >> yourself.
> >>
> >>       Erik
> >>
> >>
> >> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:
> >>
> >>
> >>> You should look at contrib/highlighter, which does exactly this.
> >>>
> >>> Mike
> >>>
> >>> Amin Mohammed-Coleman wrote:
> >>>
> >>> Hi
> >>>> I am currently indexing documents (pdf, ms word, etc) that are
> >>>> uploaded,
> >>>> these documents can be searched and what the search returns to
> >>>> the user
> >>>> are
> >>>> summaries of the documents.  Currently the summaries are
> >>>> extracted when
> >>>> indexing the file (summary constructed by taking the first 10
> >>>> lines of
> >>>> the
> >>>> document and stored in the index as field).  This is not ideal
> >>>> (static
> >>>> summary), and I was wondering if it would be possible to create a
> >>>> dynamic
> >>>> summary when a hit is found and highlight the terms found.  The
> >>>> content
> >>>> of
> >>>> the document is not stored in the index.
> >>>>
> >>>> So basically what I'm looking to do is:
> >>>>
> >>>> 1) PDF indexed
> >>>> 2) PDF body contains the word "search"
> >>>> 3) Do a search and return the hit
> >>>> 4) Construct a summary with the term "search" included.
> >>>>
> >>>> I'm not sure how to go about doing this (I presume it is
> >>>> possible).  I
> >>>> would
> >>>> be grateful for any advice.
> >>>>
> >>>>
> >>>> Cheers
> >>>> Amin
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

It depends :)

It's a trade-off.  If storing is not prohibitive, I recommend that as  
it makes life easier for highlighting.

	Erik

On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:

> hi
> that's what i was thinking about.  i would need to get the file and  
> extract
> the text again and then pass through the highlighter.  The other  
> option is
> storing the content in the index the downside being index is going  
> to be
> large.  Which would be the recommended approach?
>
> Cheers
>
> Amin
>
> On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher <erik@ehatchersolutions.com 
> >wrote:
>
>> With the caveat that if you're not storing the text you want  
>> highlighted,
>> you'll have to retrieve it somehow and send it into the Highlighter
>> yourself.
>>
>>       Erik
>>
>>
>> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:
>>
>>
>>> You should look at contrib/highlighter, which does exactly this.
>>>
>>> Mike
>>>
>>> Amin Mohammed-Coleman wrote:
>>>
>>> Hi
>>>> I am currently indexing documents (pdf, ms word, etc) that are  
>>>> uploaded,
>>>> these documents can be searched and what the search returns to  
>>>> the user
>>>> are
>>>> summaries of the documents.  Currently the summaries are  
>>>> extracted when
>>>> indexing the file (summary constructed by taking the first 10  
>>>> lines of
>>>> the
>>>> document and stored in the index as field).  This is not ideal  
>>>> (static
>>>> summary), and I was wondering if it would be possible to create a  
>>>> dynamic
>>>> summary when a hit is found and highlight the terms found.  The  
>>>> content
>>>> of
>>>> the document is not stored in the index.
>>>>
>>>> So basically what I'm looking to do is:
>>>>
>>>> 1) PDF indexed
>>>> 2) PDF body contains the word "search"
>>>> 3) Do a search and return the hit
>>>> 4) Construct a summary with the term "search" included.
>>>>
>>>> I'm not sure how to go about doing this (I presume it is  
>>>> possible).  I
>>>> would
>>>> be grateful for any advice.
>>>>
>>>>
>>>> Cheers
>>>> Amin
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

hi
that's what i was thinking about.  i would need to get the file and extract
the text again and then pass through the highlighter.  The other option is
storing the content in the index the downside being index is going to be
large.  Which would be the recommended approach?

Cheers

Amin

On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher <er...@ehatchersolutions.com>wrote:

> With the caveat that if you're not storing the text you want highlighted,
> you'll have to retrieve it somehow and send it into the Highlighter
> yourself.
>
>        Erik
>
>
> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:
>
>
>> You should look at contrib/highlighter, which does exactly this.
>>
>> Mike
>>
>> Amin Mohammed-Coleman wrote:
>>
>>  Hi
>>> I am currently indexing documents (pdf, ms word, etc) that are uploaded,
>>> these documents can be searched and what the search returns to the user
>>> are
>>> summaries of the documents.  Currently the summaries are extracted when
>>> indexing the file (summary constructed by taking the first 10 lines of
>>> the
>>> document and stored in the index as field).  This is not ideal (static
>>> summary), and I was wondering if it would be possible to create a dynamic
>>> summary when a hit is found and highlight the terms found.  The content
>>> of
>>> the document is not stored in the index.
>>>
>>> So basically what I'm looking to do is:
>>>
>>> 1) PDF indexed
>>> 2) PDF body contains the word "search"
>>> 3) Do a search and return the hit
>>> 4) Construct a summary with the term "search" included.
>>>
>>> I'm not sure how to go about doing this (I presume it is possible).  I
>>> would
>>> be grateful for any advice.
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Highlighting and Dynamic Summaries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

With the caveat that if you're not storing the text you want  
highlighted, you'll have to retrieve it somehow and send it into the  
Highlighter yourself.

	Erik

On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:

>
> You should look at contrib/highlighter, which does exactly this.
>
> Mike
>
> Amin Mohammed-Coleman wrote:
>
>> Hi
>> I am currently indexing documents (pdf, ms word, etc) that are  
>> uploaded,
>> these documents can be searched and what the search returns to the  
>> user are
>> summaries of the documents.  Currently the summaries are extracted  
>> when
>> indexing the file (summary constructed by taking the first 10 lines  
>> of the
>> document and stored in the index as field).  This is not ideal  
>> (static
>> summary), and I was wondering if it would be possible to create a  
>> dynamic
>> summary when a hit is found and highlight the terms found.  The  
>> content of
>> the document is not stored in the index.
>>
>> So basically what I'm looking to do is:
>>
>> 1) PDF indexed
>> 2) PDF body contains the word "search"
>> 3) Do a search and return the hit
>> 4) Construct a summary with the term "search" included.
>>
>> I'm not sure how to go about doing this (I presume it is  
>> possible).  I would
>> be grateful for any advice.
>>
>>
>> Cheers
>> Amin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Highlighting and Dynamic Summaries

Posted by Michael McCandless <lu...@mikemccandless.com>.

You should look at contrib/highlighter, which does exactly this.

Mike

Amin Mohammed-Coleman wrote:

> Hi
> I am currently indexing documents (pdf, ms word, etc) that are  
> uploaded,
> these documents can be searched and what the search returns to the  
> user are
> summaries of the documents.  Currently the summaries are extracted  
> when
> indexing the file (summary constructed by taking the first 10 lines  
> of the
> document and stored in the index as field).  This is not ideal (static
> summary), and I was wondering if it would be possible to create a  
> dynamic
> summary when a hit is found and highlight the terms found.  The  
> content of
> the document is not stored in the index.
>
> So basically what I'm looking to do is:
>
> 1) PDF indexed
> 2) PDF body contains the word "search"
> 3) Do a search and return the hit
> 4) Construct a summary with the term "search" included.
>
> I'm not sure how to go about doing this (I presume it is possible).   
> I would
> be grateful for any advice.
>
>
> Cheers
> Amin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org