You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Basem Narmok (JIRA)" <ji...@apache.org> on 2009/10/09 00:52:31 UTC

[jira] Created: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Arabic Analyzer: Stopwords list needs enhancement
-------------------------------------------------

                 Key: LUCENE-1966
                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
    Affects Versions: 2.9.1
            Reporter: Basem Narmok
            Priority: Trivial
             Fix For: 2.9


The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Basem Narmok (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Basem Narmok updated LUCENE-1966:
---------------------------------

    Attachment: LUCENE-1966.patch

Robert, you are correct, to solve the problem we have two options: 
1- to remove words like علي and وفي
2- to use unnormalized stiowirds list, before the normalization filter.

I think the best is the second option, so this patch only modifies the list (unnormalized), please try it.

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1966:
--------------------------------

    Affects Version/s:     (was: 2.9.1)
                       2.9
        Fix Version/s:     (was: 2.9)
                       3.0

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763774#action_12763774 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem, thanks for the patch, and the comments.

One thing I noticed: if I apply the patch, على (the stopword) will not be filtered as a stopword. This is because it will be normalized to علي (the name).

So, if we are going to normalize before stopfilter, I think we need to make sure the stopwords do not contain yeh without dots, or else these will not work. This is one example of why I was scared to apply normalization before stopwords, because by doing so, we cause على and علي to conflate.

Let me know what you think about this.


> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.1
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764501#action_12764501 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

before I commit this, I want to solicit any comments/concerns about backwards compat, assuming the following notice:

{noformat}
Changes in runtime behavior

 * LUCENE-1966: Modified and cleaned the default Arabic stopwords list used
   by ArabicAnalyzer. You'll need to fully re-index any previously created 
   indexes.  (Basem Narmok via Robert Muir)
{noformat}

i know contrib has no bw compat guarantee, but just want to double-check. 
Perhaps in the future someone might help fix the Persian stopwords file also so this may happen again :)


> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Basem Narmok (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Basem Narmok updated LUCENE-1966:
---------------------------------

    Attachment: LUCENE-1966.patch
                arabic-stopwords-comments.txt

Please see the arabic-stopwords-comments.txt to see my comments on the list, and why/what did I change.

The patch provides an updated Arabic stopwords file, and modifies ArabicAnalyzer to filter stopwords after the normalization, as the provided list is a normalized Arabic stop words.

Best,

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.1
>            Reporter: Basem Narmok
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-1966:
-----------------------------------

    Assignee: Robert Muir

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.1
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764519#action_12764519 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem, yes I think the improvements are good.

My question is really: is it OK to commit this for 3.0 or should we wait for 3.1?


> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764449#action_12764449 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem, thanks. I like the new list.

I have one very minor question: in the list we have أيضا / ايضا twice.

I wanted to check with you, is this by accident or did you have some other spellings in mind?

If it is by accident, let me know, I can just remove the duplicates before committing.

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Basem Narmok (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764456#action_12764456 ] 

Basem Narmok commented on LUCENE-1966:
--------------------------------------

Hi Robert,

Regarding ايضا / أيضا ...

No, not by accident, I included both formats (normalized,unnormalized). Arabic users tend to use both on the internet (different spellings), another example is words like أي / اي

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764465#action_12764465 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem I can simply remove 123 & 124 if this is the case, but I did not want to do this without checking first.

The reason is, I wonder if perhaps you intended for these two to be أيضاً and ايضاً (with fathatan)

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764462#action_12764462 ] 

Robert Muir edited comment on LUCENE-1966 at 10/11/09 8:10 AM:
---------------------------------------------------------------

Basem, I meant: there are two entries for أيضا , and two entries for ايضا (total of four)

edit: here are the relevant line numbers from the new stopwords.txt:

Lines 72 and 73:
{noformat}
ايضا
أيضا
{noformat}

Lines 123 and 124:
{noformat}
ايضا
أيضا
{noformat}

      was (Author: rcmuir):
    Basem, I meant: there are two entries for أيضا , and two entries for ايضا (total of four)

  
> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764495#action_12764495 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem, ok! Thanks a lot for your help here. I will commit soon.

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764462#action_12764462 ] 

Robert Muir commented on LUCENE-1966:
-------------------------------------

Basem, I meant: there are two entries for أيضا , and two entries for ايضا (total of four)


> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Basem Narmok (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764493#action_12764493 ] 

Basem Narmok commented on LUCENE-1966:
--------------------------------------

Oh, my mistake, sorry, yes please remove the last two on 123 & 124.

no, they are just duplicate of the ones on line 72 & 73



> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-1966.
---------------------------------

    Resolution: Fixed

Committed revision 825110.

Thanks Basem!

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

Posted by "Basem Narmok (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764515#action_12764515 ] 

Basem Narmok commented on LUCENE-1966:
--------------------------------------

Seems good.

BTW with FAST ESP we never used stopwords, as hits from stopwords get low relevancy (keywords with high number of hits = low value, low importance, so less relevant), so such hits will never get into the top results. Also, using stopwords will affect phrase search, most of the search engines avoid removing them. But, at the end it depends on the client's application, and what she really wants, as enterprise search could have very specific and different needs than Internet search.

Anyways, still I am testing the Arabic Analyzer, and I will provide you with more comments soon. but for the stopwords they are good for now :)

> Arabic Analyzer: Stopwords list needs enhancement
> -------------------------------------------------
>
>                 Key: LUCENE-1966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Basem Narmok
>            Assignee: Robert Muir
>            Priority: Trivial
>             Fix For: 3.0
>
>         Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch, LUCENE-1966.patch
>
>
> The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org