You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2011/06/06 20:08:59 UTC

[jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format

     [ https://issues.apache.org/jira/browse/SOLR-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved SOLR-1844.
-------------------------------

    Resolution: Won't Fix
      Assignee: Steven Rowe

Thanks David.

> CommonGramsQueryFilterFactory should read words in a comma-delimited format
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-1844
>                 URL: https://issues.apache.org/jira/browse/SOLR-1844
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: David Smiley
>            Assignee: Steven Rowe
>            Priority: Minor
>
> CommonGramsQueryFilterFactory expects that the file(s) given to the "words" argument is a carriage-return delimited list of words.  It doesn't support comments either.  This file format should be more flexible to support comma delimited values.  I came across this because I was trying to use the sample file provided by HathiTrust:
> http://www.hathitrust.org/node/180    (named in a file new400common.txt)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Hi David,

Just curious about your use of the HathiTrust list.  I usually explain to people that it's customized to our index and they are probably better off making their own list based on the lists of stop words appropriate for the languages in their index (sources listed in the blog post http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance)  If you already have an index built and are re-indexing with CommonGrams , you can also use the -t flag with HighFreqTerms.java in lucene contrib to determine the words that have the largest position lists and are therefore candidates to be added to your CommonGrams word list.  We recently ran HighFreqTerms.java against our indexes and discovered that it would be better to remove some of the less frequent foreign language stopwords and instead use some very frequent words from the index.

Tom Burton-West
www.hathitrust.org/blogs
________________________________________
From: Steven Rowe (JIRA) [jira@apache.org]
Sent: Monday, June 06, 2011 2:08 PM
To: dev@lucene.apache.org
Subject: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format

     [ https://issues.apache.org/jira/browse/SOLR-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved SOLR-1844.
-------------------------------

    Resolution: Won't Fix
      Assignee: Steven Rowe

Thanks David.

> CommonGramsQueryFilterFactory should read words in a comma-delimited format
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-1844
>                 URL: https://issues.apache.org/jira/browse/SOLR-1844
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: David Smiley
>            Assignee: Steven Rowe
>            Priority: Minor
>
> CommonGramsQueryFilterFactory expects that the file(s) given to the "words" argument is a carriage-return delimited list of words.  It doesn't support comments either.  This file format should be more flexible to support comma delimited values.  I came across this because I was trying to use the sample file provided by HathiTrust:
> http://www.hathitrust.org/node/180    (named in a file new400common.txt)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org