You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2011/06/23 19:41:48 UTC

[jira] [Created] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Provide limit on phrase analysis in FastVectorHighlighter
---------------------------------------------------------

Key: LUCENE-3234
URL: https://issues.apache.org/jira/browse/LUCENE-3234
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mike Sokolov

With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept
less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze.

The patch includes an artifical test case that shows > 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.

With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3234:
-----------------------------------

    Attachment: LUCENE-3234.patch

Updated patch attached. I added CHANGES.txt entries for Lucene and Solr, used Integer.MAX_VALUE for the default and added @param for phraseLimit in the new constructor javadoc. Will commit soon.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054326#comment-13054326 ] 

Robert Muir commented on LUCENE-3234:
-------------------------------------

Oh I see, I think i'm nervous about testRepeatedTerms too.
Maybe we can comment it out and just mention its more of a benchmark?

The problem could be that the test is timing-based... in general a machine could suddenly get busy at any time,
especially since we run many tests in parallel, so I'm worried it could intermittently fail.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Digy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054935#comment-13054935 ] 

Digy commented on LUCENE-3234:
------------------------------

I am not sure how much it is related to this issue but there was
a similar issue in Lucene.Net.
https://issues.apache.org/jira/browse/LUCENENET-350



> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054162#comment-13054162 ] 

Mike Sokolov commented on LUCENE-3234:
--------------------------------------

I did go back and look at the original case that made me worried; in that case the "bad" document is 650K, and the matched term occurs 23000 times in it.  The search still finishes in 24 sec or so on my desktop, which isn't too bad I guess, considering.

After looking at that and measuring the change in the test case in the patch as the number of terms increase, I don't think there actually is an n^2 - just linear, but the growth is still enough that the patch has value. The test case in the patch is closely targeted at the method which takes all the time when you have large numbers of matching terms in a single document.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054014#comment-13054014 ] 

Robert Muir commented on LUCENE-3234:
-------------------------------------

yeah, you are right.. but seeing as how positions are ints too, I think it might be easier to do Integer.MAX_VALUE versus the -1 parameter.


> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114484#comment-13114484 ] 

Koji Sekiguchi commented on LUCENE-3234:
----------------------------------------

Ok, thank you, David. I opened SOLR-2794.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114233#comment-13114233 ] 

Koji Sekiguchi commented on LUCENE-3234:
----------------------------------------

It could be the suggested 5000. I don't have a persistence on it. The current default is just for back-compat.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated LUCENE-3234:
---------------------------------

    Attachment: LUCENE-3234.patch

Sure - the test is fragile.  It was just meant to illustrate the use case; not really a good unit test for regression.  The last patch has it commented.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054010#comment-13054010 ] 

Mike Sokolov commented on LUCENE-3234:
--------------------------------------

Yes, although a smaller number might be fine.  Maybe Koji will comment: I don't completely understand the scaling here, but it seemed to me that I had a case with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time on my desktop.  The max value will be an int, sire, although I think the number is going to scale like positions, not offsets FWIW.


> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114472#comment-13114472 ] 

David Smiley commented on LUCENE-3234:
--------------------------------------

Koji; you didn't react to my comment that an app would be incredibly unlikely to actually depend on the unlimited behavior; thus there is no back-compat.  Again, I think it should be 5000.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055301#comment-13055301 ] 

Mike Sokolov commented on LUCENE-3234:
--------------------------------------

Thank you, Koji - it's nice to have my first patch committed!

um - one little comment; since you made the default be MAX_VALUE, there is a javadoc comment that should be updated which says it is 5000.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3234:
-----------------------------------

    Affects Version/s: 3.3
                       2.9.4
                       3.0.3
                       3.1
                       3.2
        Fix Version/s: 4.0
                       3.4

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054133#comment-13054133 ] 

Robert Muir commented on LUCENE-3234:
-------------------------------------

oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that we might think might be n^2.
If the little test case does that... then that will work, just wasn't sure if it did.

either way just something to look at in the profiler, etc.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054114#comment-13054114 ] 

Robert Muir commented on LUCENE-3234:
-------------------------------------

You can change it if you don't mind. However, I think I agree it would be good to figure out if there is an n^2 here. This might have some affect on what the default value should be... ideally there is some way we could fix the n^2.

Is there a way to turn your test case into a benchmark, or do you have a separate benchmark (the example you mentioned where it blows up really bad). This could help in looking at what's going on.


> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi resolved LUCENE-3234.
------------------------------------

    Resolution: Fixed

trunk: Committed revision 1139995.
3x: Committed revision 1139997.

Thanks, Mike!

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi reassigned LUCENE-3234:
--------------------------------------

    Assignee: Koji Sekiguchi

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054125#comment-13054125 ] 

Mike Sokolov commented on LUCENE-3234:
--------------------------------------

I don't think I can share the test documents I have - they belong to someone else.  I can look at trying to make something bad happen with the wikipedia data, but I'm curious why a benchmark is preferable to a test case? 

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054093#comment-13054093 ] 

Mike Sokolov commented on LUCENE-3234:
--------------------------------------

Yes, that makes sense to me - default to 5000, say, and set explicitly to either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 since otherwise you should probably treat it as an error).  Do you want me to change the patch, or should I just leave that to the committer?

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3234:
-----------------------------------

    Attachment: LUCENE-3234.patch

Oops, wrong patch. This one is correct.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055309#comment-13055309 ] 

Koji Sekiguchi commented on LUCENE-3234:
----------------------------------------

Thank you again for checking the commit, Mike! The javadoc has been fixed.

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated LUCENE-3234:
---------------------------------

    Attachment: LUCENE-3234.patch

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054193#comment-13054193 ] 

Koji Sekiguchi commented on LUCENE-3234:
----------------------------------------

Mike, thank you for your continuous interest to FVH! Can you add the parameter for Solr, with an appropriate default value if you would like. I don't know assertTrue test in testManyRepeatedTerms() is ok, for JENKINS?

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114157#comment-13114157 ] 

David Smiley commented on LUCENE-3234:
--------------------------------------

Why is the default unlimited; shouldn't it be the suggested 5000?  I doubt it could be for backwards compatibility since I can't see how an app might depend on the unlimited behavior. I think Solr should have good defaults that protect against pathological cases. Other defaults in Lucene/Solr have such defaults, in general (e.g. hl.maxAnalyzedChars).

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054007#comment-13054007 ] 

Robert Muir commented on LUCENE-3234:
-------------------------------------

I like this tradeoff Mike, thanks!

should we consider setting some kind of absurd default like 10,000 to really prevent some pathological cases with huge documents?
We could document in CHANGES.txt that if you want the old behavior, set it to -1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?)

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated LUCENE-3234:
---------------------------------

    Attachment:     (was: LUCENE-3234.patch)

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054213#comment-13054213 ] 

Mike Sokolov edited comment on LUCENE-3234 at 6/24/11 2:06 AM:
---------------------------------------------------------------

Added solr parameter hl.phraseLimit (default=5000)

Koji - I'm not sure what the issue w/assertTrue is?  It looked to me as if the test case ultimately inherits from org.junit.Assert, which defines the method?   Is there a different version of junit on Jenkins without that method?

      was (Author: sokolov):
    Added solr parameter hl.phraseLimit (default=5000)
  
> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated LUCENE-3234:
---------------------------------

    Attachment: LUCENE-3234.patch

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated LUCENE-3234:
---------------------------------

    Attachment: LUCENE-3234.patch

Added solr parameter hl.phraseLimit (default=5000)

> Provide limit on phrase analysis in FastVectorHighlighter
> ---------------------------------------------------------
>
>                 Key: LUCENE-3234
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3234
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
>            Reporter: Mike Sokolov
>            Assignee: Koji Sekiguchi
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3234.patch, LUCENE-3234.patch
>
>
> With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document.  If one is willing to accept
> less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible.  This is analogous to the Highlighter limit on the number of characters to analyze.
> The patch includes an artifical test case that shows > 1000x speedup.  In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document.  Most of our sites operate in this way (just show the first snippet), so this would be a big win for us.
> With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org