You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jon Stewart <jo...@lightboxtechnologies.com> on 2013/10/14 23:11:57 UTC

PostingsHighlighter/PassageFormatter has zero matches for some results

Hello,

I've observed that when using PostingsHighlighter in Lucene 4.4 that
some of the responsive documents in TopDocs will have zero matches in
the associated array of Passage objects. I.e., in the call of
PassageFormatter.format(), there will be some calls where none of the
Passage objects in the array will have matches. I've seen this on a
simple one-word query, where the word clearly exists in the Document's
text for the field (and the Document is included in the TopDocs result
set).

Any ideas?

Thanks,

Jon
-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 15, 2013 at 10:57 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Tue, Oct 15, 2013 at 10:11 AM, Robert Muir <rc...@gmail.com> wrote:
>> On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> Well, unfortunately, this is a trap that users do hit.
>>>
>>> By requiring the user to think about the limit on creating
>>> PostingsHighlighter, he/she would think about it and realize they are
>>> in fact setting a limit.
>>>
>>> Silent limits are dangerous because you don't offhand know what's
>>> wrong / why you see nothing getting highlighted.
>>>
>>>
>>
>> I already made my argument: for 99% of use cases the defaults are
>> fine. In most cases highlighting is trying to summarize the document
>> and something that deep just doesnt contribute much (see the default
>> scoring model!). There is an optional ctor for the others doing expert
>> things to specify the length.
>>
>> I don't think we should make APIs unusable because you think XYZ is a trap.
>
> How would this make the APIs unusable?
>
> I don't think requiring the user to set the truncation (a single int
> parameter) up front is "unusable"?
>
> Instead, it's making it clear that this class silently discards tokens
> from the document, which I think is dangerous for any class to
> silently do.  The user needs to think about what to pass, and realize
> what they pass means truncation is happening.

Its a summarizer: its whole purpose is to truncate the document :)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Oct 15, 2013 at 10:11 AM, Robert Muir <rc...@gmail.com> wrote:
> On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Well, unfortunately, this is a trap that users do hit.
>>
>> By requiring the user to think about the limit on creating
>> PostingsHighlighter, he/she would think about it and realize they are
>> in fact setting a limit.
>>
>> Silent limits are dangerous because you don't offhand know what's
>> wrong / why you see nothing getting highlighted.
>>
>>
>
> I already made my argument: for 99% of use cases the defaults are
> fine. In most cases highlighting is trying to summarize the document
> and something that deep just doesnt contribute much (see the default
> scoring model!). There is an optional ctor for the others doing expert
> things to specify the length.
>
> I don't think we should make APIs unusable because you think XYZ is a trap.

How would this make the APIs unusable?

I don't think requiring the user to set the truncation (a single int
parameter) up front is "unusable"?

Instead, it's making it clear that this class silently discards tokens
from the document, which I think is dangerous for any class to
silently do.  The user needs to think about what to pass, and realize
what they pass means truncation is happening.

> Why not make DEFAULT_MAX_THREAD_STATES a required parameter to indexwriter?

I think that's quite different: that param is for optimizing how many
threads can run concurrently in IndexWriter, and there are lots of
other parameters you could tune if you want to try to speed things up.
 It's not about discarding tokens, which is a change in functionality
and very different.

Long ago, IndexWriter used to do something very similar: it would
silently discard all tokens after the first 10,000 by default.  But
that was horribly trappy, and so we made it a required ctor parameter.
 Now, finally, we've removed it entirely and you can use
LimitTokenCountFilter if you want to truncate before indexing.

> Hell lets make it so users have to supply all parameters to
> everything, so everything is like
> IndexWriter(int,int,int,int,int,int,int,int,int,int,int,int) and so
> on. Then you will be satisfied there are no traps, but it will be
> totally unusable.

I agree that would be unusable, but that's not what I'm proposing;
it's not so black and white.

I do agree with you that we need to keep our APIs very minimal, and
that every added parameter is an added cost.  But we need to balance
that with settings that do nasty things, like truncate tokens; I think
it's fair in such cases to consider making them an explicit choice in
the API.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Jon Stewart <jo...@lightboxtechnologies.com>.
I'm very grateful for the assistance. It'd be great to know the value
of DEFAULT_MAX_LENGTH in the documentation. I know the majority of
applications care more about precision than recall... but I know of a
lot of people using Lucene for high recall applications, too. Working
in high recall domains doesn't necessarily make us Lucene experts.

Many/most of the maximums/defaults used in Lucene can be changed and
have accessors available, which naturally highlights and documents
them to the user. PostingsHighlighter doesn't have such accessors, and
the treatment of DEFAULT_MAX_LENGTH in the javadocs is brief. I don't
know whether I just flat out missed it or assumed that
DEFAULT_MAX_LENGTH would be big enough, but, FWIW, the docs where
getNumMatches() was 0 on all Passages didn't strike me as being
particularly large.


Jon

On Tue, Oct 15, 2013 at 10:11 AM, Robert Muir <rc...@gmail.com> wrote:
> On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Well, unfortunately, this is a trap that users do hit.
>>
>> By requiring the user to think about the limit on creating
>> PostingsHighlighter, he/she would think about it and realize they are
>> in fact setting a limit.
>>
>> Silent limits are dangerous because you don't offhand know what's
>> wrong / why you see nothing getting highlighted.
>>
>>
>
> I already made my argument: for 99% of use cases the defaults are
> fine. In most cases highlighting is trying to summarize the document
> and something that deep just doesnt contribute much (see the default
> scoring model!). There is an optional ctor for the others doing expert
> things to specify the length.
>
> I don't think we should make APIs unusable because you think XYZ is a trap.
>
> Why not make DEFAULT_MAX_THREAD_STATES a required parameter to indexwriter?
>
> Hell lets make it so users have to supply all parameters to
> everything, so everything is like
> IndexWriter(int,int,int,int,int,int,int,int,int,int,int,int) and so
> on. Then you will be satisfied there are no traps, but it will be
> totally unusable.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Well, unfortunately, this is a trap that users do hit.
>
> By requiring the user to think about the limit on creating
> PostingsHighlighter, he/she would think about it and realize they are
> in fact setting a limit.
>
> Silent limits are dangerous because you don't offhand know what's
> wrong / why you see nothing getting highlighted.
>
>

I already made my argument: for 99% of use cases the defaults are
fine. In most cases highlighting is trying to summarize the document
and something that deep just doesnt contribute much (see the default
scoring model!). There is an optional ctor for the others doing expert
things to specify the length.

I don't think we should make APIs unusable because you think XYZ is a trap.

Why not make DEFAULT_MAX_THREAD_STATES a required parameter to indexwriter?

Hell lets make it so users have to supply all parameters to
everything, so everything is like
IndexWriter(int,int,int,int,int,int,int,int,int,int,int,int) and so
on. Then you will be satisfied there are no traps, but it will be
totally unusable.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Michael McCandless <lu...@mikemccandless.com>.
Well, unfortunately, this is a trap that users do hit.

By requiring the user to think about the limit on creating
PostingsHighlighter, he/she would think about it and realize they are
in fact setting a limit.

Silent limits are dangerous because you don't offhand know what's
wrong / why you see nothing getting highlighted.



Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 15, 2013 at 9:42 AM, Robert Muir <rc...@gmail.com> wrote:
> I strongly disagree: there is no trap, its a reasonable default for
> good summarization, and the behavior is no different than the other
> highlighters here.
>
> Typically people *do* care about performance and its important to have
> a clean simple API too.
>
> In my opinion increasing this limit is very esoteric: usually
> sentences that deep do not summarize the document well.
>
>
>
> On Tue, Oct 15, 2013 at 9:38 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Maybe we should make the max length a required argument to
>> PostingsHighlighter ctor?
>>
>> Because it's trappy now, since you don't realize offhand that it's
>> silently enforcing a limit ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Oct 15, 2013 at 9:31 AM, Robert Muir <rc...@gmail.com> wrote:
>>> Thanks Jon. Ill add some stuff to the javadocs here to try to make it
>>> more obvious.
>>>
>>> On Tue, Oct 15, 2013 at 5:54 AM, Jon Stewart
>>> <jo...@lightboxtechnologies.com> wrote:
>>>> Awesome, that did it! I didn't realize that DEFAULT_MAX_LENGTH was
>>>> only 10,000. I've now upped it to 16MB (I'm not doing the usual thing
>>>> and performance is not a particular concern).
>>>>
>>>> Thanks,
>>>>
>>>> Jon
>>>>
>>>>
>>>> On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir <rc...@gmail.com> wrote:
>>>>> are your documents large?
>>>>>
>>>>> try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.
>>>>>
>>>>> sounds like the passages you see with matches are very deep into the
>>>>> document and its just hitting the default limit and returning the
>>>>> default summarization (getEmptyHighlight())
>>>>>
>>>>> otherwise, please open a JIRA issue :)
>>>>>
>>>>> On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
>>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>>> I upgraded to 4.5. Same results, unfortunately. Most docs in the
>>>>>> result set will have a Passage where numMatches() > 0, but some do
>>>>>> not. In these cases, the Passage array's length is greater than zero.
>>>>>>
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>>>>>>> did you try the latest release? There are some bugs fixed...
>>>>>>>
>>>>>>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>>>>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>>>>>>> some of the responsive documents in TopDocs will have zero matches in
>>>>>>>> the associated array of Passage objects. I.e., in the call of
>>>>>>>> PassageFormatter.format(), there will be some calls where none of the
>>>>>>>> Passage objects in the array will have matches. I've seen this on a
>>>>>>>> simple one-word query, where the word clearly exists in the Document's
>>>>>>>> text for the field (and the Document is included in the TopDocs result
>>>>>>>> set).
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jon
>>>>>>>> --
>>>>>>>> Jon Stewart, Principal
>>>>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jon Stewart, Principal
>>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jon Stewart, Principal
>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
I strongly disagree: there is no trap, its a reasonable default for
good summarization, and the behavior is no different than the other
highlighters here.

Typically people *do* care about performance and its important to have
a clean simple API too.

In my opinion increasing this limit is very esoteric: usually
sentences that deep do not summarize the document well.



On Tue, Oct 15, 2013 at 9:38 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Maybe we should make the max length a required argument to
> PostingsHighlighter ctor?
>
> Because it's trappy now, since you don't realize offhand that it's
> silently enforcing a limit ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Oct 15, 2013 at 9:31 AM, Robert Muir <rc...@gmail.com> wrote:
>> Thanks Jon. Ill add some stuff to the javadocs here to try to make it
>> more obvious.
>>
>> On Tue, Oct 15, 2013 at 5:54 AM, Jon Stewart
>> <jo...@lightboxtechnologies.com> wrote:
>>> Awesome, that did it! I didn't realize that DEFAULT_MAX_LENGTH was
>>> only 10,000. I've now upped it to 16MB (I'm not doing the usual thing
>>> and performance is not a particular concern).
>>>
>>> Thanks,
>>>
>>> Jon
>>>
>>>
>>> On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir <rc...@gmail.com> wrote:
>>>> are your documents large?
>>>>
>>>> try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.
>>>>
>>>> sounds like the passages you see with matches are very deep into the
>>>> document and its just hitting the default limit and returning the
>>>> default summarization (getEmptyHighlight())
>>>>
>>>> otherwise, please open a JIRA issue :)
>>>>
>>>> On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>> I upgraded to 4.5. Same results, unfortunately. Most docs in the
>>>>> result set will have a Passage where numMatches() > 0, but some do
>>>>> not. In these cases, the Passage array's length is greater than zero.
>>>>>
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>>>>>> did you try the latest release? There are some bugs fixed...
>>>>>>
>>>>>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>>>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>>>>>> some of the responsive documents in TopDocs will have zero matches in
>>>>>>> the associated array of Passage objects. I.e., in the call of
>>>>>>> PassageFormatter.format(), there will be some calls where none of the
>>>>>>> Passage objects in the array will have matches. I've seen this on a
>>>>>>> simple one-word query, where the word clearly exists in the Document's
>>>>>>> text for the field (and the Document is included in the TopDocs result
>>>>>>> set).
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jon
>>>>>>> --
>>>>>>> Jon Stewart, Principal
>>>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jon Stewart, Principal
>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> --
>>> Jon Stewart, Principal
>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Michael McCandless <lu...@mikemccandless.com>.
Maybe we should make the max length a required argument to
PostingsHighlighter ctor?

Because it's trappy now, since you don't realize offhand that it's
silently enforcing a limit ...

Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 15, 2013 at 9:31 AM, Robert Muir <rc...@gmail.com> wrote:
> Thanks Jon. Ill add some stuff to the javadocs here to try to make it
> more obvious.
>
> On Tue, Oct 15, 2013 at 5:54 AM, Jon Stewart
> <jo...@lightboxtechnologies.com> wrote:
>> Awesome, that did it! I didn't realize that DEFAULT_MAX_LENGTH was
>> only 10,000. I've now upped it to 16MB (I'm not doing the usual thing
>> and performance is not a particular concern).
>>
>> Thanks,
>>
>> Jon
>>
>>
>> On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir <rc...@gmail.com> wrote:
>>> are your documents large?
>>>
>>> try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.
>>>
>>> sounds like the passages you see with matches are very deep into the
>>> document and its just hitting the default limit and returning the
>>> default summarization (getEmptyHighlight())
>>>
>>> otherwise, please open a JIRA issue :)
>>>
>>> On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
>>> <jo...@lightboxtechnologies.com> wrote:
>>>> I upgraded to 4.5. Same results, unfortunately. Most docs in the
>>>> result set will have a Passage where numMatches() > 0, but some do
>>>> not. In these cases, the Passage array's length is greater than zero.
>>>>
>>>>
>>>> Jon
>>>>
>>>>
>>>> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>>>>> did you try the latest release? There are some bugs fixed...
>>>>>
>>>>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>>>>> some of the responsive documents in TopDocs will have zero matches in
>>>>>> the associated array of Passage objects. I.e., in the call of
>>>>>> PassageFormatter.format(), there will be some calls where none of the
>>>>>> Passage objects in the array will have matches. I've seen this on a
>>>>>> simple one-word query, where the word clearly exists in the Document's
>>>>>> text for the field (and the Document is included in the TopDocs result
>>>>>> set).
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jon
>>>>>> --
>>>>>> Jon Stewart, Principal
>>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jon Stewart, Principal
>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
Thanks Jon. Ill add some stuff to the javadocs here to try to make it
more obvious.

On Tue, Oct 15, 2013 at 5:54 AM, Jon Stewart
<jo...@lightboxtechnologies.com> wrote:
> Awesome, that did it! I didn't realize that DEFAULT_MAX_LENGTH was
> only 10,000. I've now upped it to 16MB (I'm not doing the usual thing
> and performance is not a particular concern).
>
> Thanks,
>
> Jon
>
>
> On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir <rc...@gmail.com> wrote:
>> are your documents large?
>>
>> try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.
>>
>> sounds like the passages you see with matches are very deep into the
>> document and its just hitting the default limit and returning the
>> default summarization (getEmptyHighlight())
>>
>> otherwise, please open a JIRA issue :)
>>
>> On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
>> <jo...@lightboxtechnologies.com> wrote:
>>> I upgraded to 4.5. Same results, unfortunately. Most docs in the
>>> result set will have a Passage where numMatches() > 0, but some do
>>> not. In these cases, the Passage array's length is greater than zero.
>>>
>>>
>>> Jon
>>>
>>>
>>> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>>>> did you try the latest release? There are some bugs fixed...
>>>>
>>>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>>>> <jo...@lightboxtechnologies.com> wrote:
>>>>> Hello,
>>>>>
>>>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>>>> some of the responsive documents in TopDocs will have zero matches in
>>>>> the associated array of Passage objects. I.e., in the call of
>>>>> PassageFormatter.format(), there will be some calls where none of the
>>>>> Passage objects in the array will have matches. I've seen this on a
>>>>> simple one-word query, where the word clearly exists in the Document's
>>>>> text for the field (and the Document is included in the TopDocs result
>>>>> set).
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jon
>>>>> --
>>>>> Jon Stewart, Principal
>>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> --
>>> Jon Stewart, Principal
>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Jon Stewart, Principal
> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Jon Stewart <jo...@lightboxtechnologies.com>.
Awesome, that did it! I didn't realize that DEFAULT_MAX_LENGTH was
only 10,000. I've now upped it to 16MB (I'm not doing the usual thing
and performance is not a particular concern).

Thanks,

Jon


On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir <rc...@gmail.com> wrote:
> are your documents large?
>
> try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.
>
> sounds like the passages you see with matches are very deep into the
> document and its just hitting the default limit and returning the
> default summarization (getEmptyHighlight())
>
> otherwise, please open a JIRA issue :)
>
> On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
> <jo...@lightboxtechnologies.com> wrote:
>> I upgraded to 4.5. Same results, unfortunately. Most docs in the
>> result set will have a Passage where numMatches() > 0, but some do
>> not. In these cases, the Passage array's length is greater than zero.
>>
>>
>> Jon
>>
>>
>> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>>> did you try the latest release? There are some bugs fixed...
>>>
>>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>>> <jo...@lightboxtechnologies.com> wrote:
>>>> Hello,
>>>>
>>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>>> some of the responsive documents in TopDocs will have zero matches in
>>>> the associated array of Passage objects. I.e., in the call of
>>>> PassageFormatter.format(), there will be some calls where none of the
>>>> Passage objects in the array will have matches. I've seen this on a
>>>> simple one-word query, where the word clearly exists in the Document's
>>>> text for the field (and the Document is included in the TopDocs result
>>>> set).
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks,
>>>>
>>>> Jon
>>>> --
>>>> Jon Stewart, Principal
>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
are your documents large?

try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH.

sounds like the passages you see with matches are very deep into the
document and its just hitting the default limit and returning the
default summarization (getEmptyHighlight())

otherwise, please open a JIRA issue :)

On Mon, Oct 14, 2013 at 9:32 PM, Jon Stewart
<jo...@lightboxtechnologies.com> wrote:
> I upgraded to 4.5. Same results, unfortunately. Most docs in the
> result set will have a Passage where numMatches() > 0, but some do
> not. In these cases, the Passage array's length is greater than zero.
>
>
> Jon
>
>
> On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
>> did you try the latest release? There are some bugs fixed...
>>
>> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
>> <jo...@lightboxtechnologies.com> wrote:
>>> Hello,
>>>
>>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>>> some of the responsive documents in TopDocs will have zero matches in
>>> the associated array of Passage objects. I.e., in the call of
>>> PassageFormatter.format(), there will be some calls where none of the
>>> Passage objects in the array will have matches. I've seen this on a
>>> simple one-word query, where the word clearly exists in the Document's
>>> text for the field (and the Document is included in the TopDocs result
>>> set).
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>>
>>> Jon
>>> --
>>> Jon Stewart, Principal
>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Jon Stewart, Principal
> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Jon Stewart <jo...@lightboxtechnologies.com>.
I upgraded to 4.5. Same results, unfortunately. Most docs in the
result set will have a Passage where numMatches() > 0, but some do
not. In these cases, the Passage array's length is greater than zero.


Jon


On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir <rc...@gmail.com> wrote:
> did you try the latest release? There are some bugs fixed...
>
> On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
> <jo...@lightboxtechnologies.com> wrote:
>> Hello,
>>
>> I've observed that when using PostingsHighlighter in Lucene 4.4 that
>> some of the responsive documents in TopDocs will have zero matches in
>> the associated array of Passage objects. I.e., in the call of
>> PassageFormatter.format(), there will be some calls where none of the
>> Passage objects in the array will have matches. I've seen this on a
>> simple one-word query, where the word clearly exists in the Document's
>> text for the field (and the Document is included in the TopDocs result
>> set).
>>
>> Any ideas?
>>
>> Thanks,
>>
>> Jon
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PostingsHighlighter/PassageFormatter has zero matches for some results

Posted by Robert Muir <rc...@gmail.com>.
did you try the latest release? There are some bugs fixed...

On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart
<jo...@lightboxtechnologies.com> wrote:
> Hello,
>
> I've observed that when using PostingsHighlighter in Lucene 4.4 that
> some of the responsive documents in TopDocs will have zero matches in
> the associated array of Passage objects. I.e., in the call of
> PassageFormatter.format(), there will be some calls where none of the
> Passage objects in the array will have matches. I've seen this on a
> simple one-word query, where the word clearly exists in the Document's
> text for the field (and the Document is included in the TopDocs result
> set).
>
> Any ideas?
>
> Thanks,
>
> Jon
> --
> Jon Stewart, Principal
> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org