You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/11/09 04:01:59 UTC

Re: [lucy-dev] Bundling Snowball

On Fri, Oct 29, 2010 at 09:54:45PM -0400, Peter Karman wrote:
> > The build will slow down some because of the extra compilation for the
> > stemming files.  I'm tempted to add a "semiclean" build target (or something
> > like that) which would leave the charmony.h config file and compiled
> > dependencies like Snowball intact.
> 
> +1

The "semiclean" build target has been added.  I opened
<https://issues.apache.org/jira/browse/LUCY-125> for bundling the Snowball
stemming library.  A separate issue will follow for bundling the stoplists.

Marvin Humphrey

Re: [lucy-dev] Bundling Snowball

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Nov 08, 2010 at 07:01:59PM -0800, Marvin Humphrey wrote:
> The "semiclean" build target has been added.  I opened
> <https://issues.apache.org/jira/browse/LUCY-125> for bundling the Snowball
> stemming library.  A separate issue will follow for bundling the stoplists.

I've opened <https://issues.apache.org/jira/browse/LUCY-129> for bundling the
Snowball stoplists.

Once that's committed, our only non-Perl-core CPAN dependencies will be
JSON:XS and Parse::RecDescent.

Marvin Humphrey

Re: [lucy-dev] Bundling Snowball

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 10, 2010 at 12:10:51PM -0500, Robert Muir wrote:
> One more note that I forgot to mention: in snowball's svn (but i think
> not in the libstemmer pkg) there is actually vocabulary test data:
> input files containing a sample vocabulary for each language, expected
> output, and combined files called 'diffs' that show what the stemmer
> changes.
> 
> these provide pretty good coverage for tests to ensure your
> integration is working... when they make a change to the algorithms
> these are updated too (though it seems not always in the same commit):

I reopened <https://issues.apache.org/jira/browse/LUCY-125> to add tests based
on the Snowball vocabulary materials.

Those "diff" files are quite large.  Instead of including them all, I just
extracted a sampling of 10 words per language.  That's enough to verify that
our Stemmer Analyzer is at least working for each language, and in my view
it's not necessary to run the full battery of Snowball vocab tests within the
Lucy test suite.

Marvin Humphrey

Re: [lucy-dev] Bundling Snowball

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 10, 2010 at 12:10:51PM -0500, Robert Muir wrote:
> One more note that I forgot to mention: in snowball's svn (but i think not
> in the libstemmer pkg) there is actually vocabulary test data: input files
> containing a sample vocabulary for each language, expected output, and
> combined files called 'diffs' that show what the stemmer changes.
> 
> these provide pretty good coverage for tests to ensure your
> integration is working... when they make a change to the algorithms
> these are updated too (though it seems not always in the same commit):
> 
> example: http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527

I used this sample data to prepare tests for the Lingua::Stem::Snowball CPAN
distribution.  Now that we are bundling the Snowball C libraries, we are no
longer benefitting by proxy from that test suite, and we should roll our own
tests.

Yesterday, I adapted the update_snowstem.pl script in
<https://issues.apache.org/jira/browse/LUCY-125> to work off of an svn
checkout of Snowball; I committed the patches and closed the issue this
morning.

Now I'll go add test data generation to update_snowstem.pl's capabilities and
add new test files for each language to validate that our stemmers work
properly.

Thanks for bringing it up!

Marvin Humphrey

Re: [lucy-dev] Bundling Snowball

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 9, 2010 at 3:53 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Tue, Nov 09, 2010 at 04:51:33AM -0500, Robert Muir wrote:
>> Some quick notes, from lucene-java:
>

One more note that I forgot to mention: in snowball's svn (but i think
not in the libstemmer pkg) there is actually vocabulary test data:
input files containing a sample vocabulary for each language, expected
output, and combined files called 'diffs' that show what the stemmer
changes.

these provide pretty good coverage for tests to ensure your
integration is working... when they make a change to the algorithms
these are updated too (though it seems not always in the same commit):

example: http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527

Re: [lucy-dev] Bundling Snowball

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Nov 09, 2010 at 04:51:33AM -0500, Robert Muir wrote:
> Some quick notes, from lucene-java:

Thanks, Robert!  The Lucene analysis components have really tightened up since
you got involved, and I'm pleased that Lucy will get to benefit from your
hard-won knowledge as well.

> * are you going to do svn checkouts for bundling snowball? 

Good plan.  Debian does something similar:

    http://thread.gmane.org/gmane.comp.search.snowball/1191

I'd been working off libstemmer_c.tgz, but to document my actions and make
them repeatable, I've written a script which transforms the content of a
source dir (right now the expanded libstemmer_c dir) into the form that we
need.

That script should probably be changed to operate off of an svn checkout from
the Snowball repository -- or perhaps multiple svn checkouts.  That way we can
1) document exactly what revision of the Snowball code we've imported, and 2)
get the most up-to-date and complete complement of languages.

> I don't think they are really releasing anymore, but there are in fact new
> languages, etc in svn.

It's seriously a pain that the Snowball folks don't do numbered releases.  :(

Back in 2007, Richard Boulton discussed adding revision info to the
libstemmer.h interface which would allow you to track the stemmer version, but
it doesn't look like they ever got around to it.

> * every so often snowball makes changes to the rules for the
> languages.. this can be tricky depending on how you handle backwards
> compatibility. In lucene java we have a checkout of revision 502, but
> then with the newer languages added (Armenian, Catalan, Basque)... if
> we fully 'svn updated' to the latest rev it would change things about
> german stemming from our previous release, for example, and be a
> hassle for people who created indexes with those older versions.

The easy answer is to freeze the stemmer version for each language, as you've
done.  Basically, import once, do it right the first time, and don't update
ever again.  Or at least understand that you are potentially breaking back
compat if you ever update a language.

> * whee bundling the stoplists: there are some languages, even
> "released" ones (Turkish, Romanian, etc) that don't have
> snowball-included stoplists. if you want, you could use the ones we
> have in lucene to provide stoplists for these languages...
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt
> 
> these are of variable quality: the ones with source information in the
> header means that I found one clearly marked with BSD or Apache.
> If they have no header, it means i made them myself... it might seem
> absurd to worry about "licensing" for stopwords, but you never know :)

That would be handy, as it would allow us to build a
tokenizer/stopalizer/stemmer stack for each supported Snowball language. 

Marvin Humphrey

Re: [lucy-dev] Bundling Snowball

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 8, 2010 at 10:01 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> The "semiclean" build target has been added.  I opened
> <https://issues.apache.org/jira/browse/LUCY-125> for bundling the Snowball
> stemming library.  A separate issue will follow for bundling the stoplists.
>

Some quick notes, from lucene-java:
* are you going to do svn checkouts for bundling snowball? I don't
think they are really releasing anymore, but there are in fact new
languages, etc in svn.
* every so often snowball makes changes to the rules for the
languages.. this can be tricky depending on how you handle backwards
compatibility. In lucene java we have a checkout of revision 502, but
then with the newer languages added (Armenian, Catalan, Basque)... if
we fully 'svn updated' to the latest rev it would change things about
german stemming from our previous release, for example, and be a
hassle for people who created indexes with those older versions.
* when bundling the stoplists: there are some languages, even
"released" ones (Turkish, Romanian, etc) that don't have
snowball-included stoplists. if you want, you could use the ones we
have in lucene to provide stoplists for these languages...

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt

these are of variable quality: the ones with source information in the
header means that I found one clearly marked with BSD or Apache.
If they have no header, it means i made them myself... it might seem
absurd to worry about "licensing" for stopwords, but you never know :)