You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mark Harwood (JIRA)" <ji...@apache.org> on 2006/02/08 00:13:14 UTC

[jira] Created: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Analyzer for preventing overload of search service by queries with common terms in large indexes
------------------------------------------------------------------------------------------------

         Key: LUCENE-494
         URL: http://issues.apache.org/jira/browse/LUCENE-494
     Project: Lucene - Java
        Type: New Feature
  Components: Analysis  
    Reporter: Mark Harwood
    Priority: Minor


An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries. For very large indexes the cost
of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
this term to take 2 seconds.

Use the various "addStopWords" methods in this class to automate the identification and addition of 
stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558854#action_12558854 ] 

Mark Harwood commented on LUCENE-494:
-------------------------------------

I personally don't use this but others may. It was easier to solve my particular problem by adding stop words to my XSL query templates (I added support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop words). This was more about ease of configuration in my particular app.

I know Nutch has something similar implemented elsewhere - maybe in the query parser.

I also had the notion that wrapping IndexReader to auto-cache TermDocs for super-popular terms using a BitSet would be a good way to avoid the IO overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span queries which need a TermPositions implementation but would work for straight TermQueries.



> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-494.
------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.4

Committed, thanks Mark!

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-494:
--------------------------------------

    Assignee: Grant Ingersoll

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-494:
-----------------------------------

    Affects Version/s: 2.4

I think it makes sense to add this in after the 2.3 release.

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-494?page=all ]

Mark Harwood updated LUCENE-494:
--------------------------------

    Attachment: QueryAutoStopWordAnalyzerTest.java

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-494
>          URL: http://issues.apache.org/jira/browse/LUCENE-494
>      Project: Lucene - Java
>         Type: New Feature
>   Components: Analysis
>     Reporter: Mark Harwood
>     Priority: Minor
>  Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557785#action_12557785 ] 

Grant Ingersoll commented on LUCENE-494:
----------------------------------------

This seems generally useful and could go in contrib/analysis I suppose.  Any thoughts on it, Mark, in hindsight?  Do you still use it from time to time or do you now think there are better ways of doing it?

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: QueryAutoStopWordAnalyzer.java, QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-494?page=all ]

Mark Harwood updated LUCENE-494:
--------------------------------

    Attachment: QueryAutoStopWordAnalyzer.java

> Analyzer for preventing overload of search service by queries with common terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-494
>          URL: http://issues.apache.org/jira/browse/LUCENE-494
>      Project: Lucene - Java
>         Type: New Feature
>   Components: Analysis
>     Reporter: Mark Harwood
>     Priority: Minor
>  Attachments: QueryAutoStopWordAnalyzer.java
>
> An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
> which prevents very common words from being passed into queries. For very large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org