You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2018/08/22 14:11:00 UTC

[jira] [Commented] (LUCENE-8462) New Arabic snowball stemmer

    [ https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588901#comment-16588901 ] 

Robert Muir commented on LUCENE-8462:
-------------------------------------

This change looks great, however i'm wondering if we can produce an alternative vocabulary list for test purposes?
1. This list is *huge*. I think it may be autogenerated (morph generation or something). It causes our test data to jump from 2MB to 28MB. In general we just want a simple list to catch us if we introduce some bug. All the other languages combined are 2MB compressed, the arabic one is 26MB compressed...
2. Unlike all the other snowball test data, this arabic vocabulary list is explicitly labeled as GPL. I'm personally not comfortable committing GPL stuff (even test data .txt files) without asking for more guidance first. But maybe we can avoid this problem since I don't think we want such an enormous list.

> New Arabic snowball stemmer
> ---------------------------
>
>                 Key: LUCENE-8462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8462
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ryadh Dahimene
>            Priority: Trivial
>              Labels: Arabic, snowball, stemmer
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the snowball-data available here [https://github.com/snowballstem/snowball-data/tree/master/arabic]
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/439]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org