You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/12/05 22:38:14 UTC

[lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer: core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/PolyAnalyzer.pm

Hi, Nick,

Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!

On Mon, Dec 05, 2011 at 09:02:42PM -0000, nwellnhof@apache.org wrote:
>  PolyAnalyzer*
>  PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>      else if (language) {
>          self->analyzers = VA_new(3);
>          VA_Push(self->analyzers, (Obj*)CaseFolder_new());
> -        VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
> +        VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>          VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>      }

This will cause a backwards compatibility break.  I really want to make your
StandardTokenizer the default, but I think we might want to go about it
differently.

How about we leave PolyAnalyzer alone, but add a new class called
"EasyAnalyzer", with the following default stack:

    1. StandardTokenizer
    2. Normalizer
    3. SnowballStemmer

This integrates both your recent contributions, plus changes the order to be
avoid the Highlighter problems you identified and be more in line with the
potential refactoring you talked about.

It would be nice to benchmark this just to see what sort of performance impact
changing the order has before we finalize it.

If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
throughout the tutorial and other high-level documentation.

Marvin Humphrey

Re: [lucy-dev] Re: [lucy-commits] svn commit: r1210630 - in /incubator/lucy/branches/LUCY-196-uax-tokenizer: core/Lucy/Analysis/PolyAnalyzer.c core/Lucy/Analysis/PolyAnalyzer.cfh perl/lib/Lucy/Analysis/PolyAnalyzer.pm

Posted by Nick Wellnhofer <we...@aevum.de>.

On 05/12/2011 22:38, Marvin Humphrey wrote:
> Hi, Nick,
>
> Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!
>
> On Mon, Dec 05, 2011 at 09:02:42PM -0000, nwellnhof@apache.org wrote:
>>   PolyAnalyzer*
>>   PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
>> @@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
>>       else if (language) {
>>           self->analyzers = VA_new(3);
>>           VA_Push(self->analyzers, (Obj*)CaseFolder_new());
>> -        VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
>> +        VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
>>           VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
>>       }
>
> This will cause a backwards compatibility break.  I really want to make your
> StandardTokenizer the default, but I think we might want to go about it
> differently.

I made that change mainly to see if the test suite breaks (and it 
didn't). I plan to revert it before committing StandardTokenizer to trunk.

> How about we leave PolyAnalyzer alone, but add a new class called
> "EasyAnalyzer", with the following default stack:
>
>      1. StandardTokenizer
>      2. Normalizer
>      3. SnowballStemmer
>
> This integrates both your recent contributions, plus changes the order to be
> avoid the Highlighter problems you identified and be more in line with the
> potential refactoring you talked about.
>
> It would be nice to benchmark this just to see what sort of performance impact
> changing the order has before we finalize it.
>
> If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
> throughout the tutorial and other high-level documentation.

Sounds like a good idea.

Nick

Re: [lucy-dev] StandardTokenizer has landed

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Dec 6, 2011 at 8:45 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> What I still want to do is to incorporate the word break test cases from the
> Unicode website:
>
> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt
>

we use a script that generates a unit test from this file... maybe you
can reuse some of the code for your purposes:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/test/org/apache/lucene/analysis/core/generateJavaUnicodeWordBreakTest.pl

-- 
lucidimagination.com

Re: [lucy-dev] StandardTokenizer has landed

Posted by Nick Wellnhofer <we...@aevum.de>.

On 08/12/11 02:59, Marvin Humphrey wrote:
> I think this is OK.  For efficiency reasons, we do not want to require UTF-8
> validity at the granularity of the Token during analysis.  How about we
> establish these rules for the analysis phase?
>
>    * Input to the analysis chain must be valid UTF-8.
>    * Analyzers must be prepared to encounter broken UTF-8 but may either throw
>      an exception or produce junk.
>    * Broken UTF-8 emitted by an analysis chain should be detected prior to
>      Indexer commit.

Sounds reasonable.

>> 2. If there's invalid UTF-8 near the end of the input buffer, we might
>> read up to three bytes past the end of the buffer.
>
> I think this is OK, too.  First, this is only a problem for broken analysis
> chains.  Second, the typical outcome will be a token with a small amount of
> random bogus content, and the Indexer will probably throw an exception prior
> to commit anyway rather than leak the content into the index.

But reading past the end of the buffer might cause a segfault. So if we 
want to follow the rules above, we should guard against that.

Nick

Re: [lucy-dev] StandardTokenizer has landed

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Dec 07, 2011 at 03:44:13PM +0100, Nick Wellnhofer wrote:
> I can see only two bad things that can happen with invalid UTF-8:
>
> 1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to  
> other analyzers, possibly creating even more invalid UTF-8.

I think this is OK.  For efficiency reasons, we do not want to require UTF-8
validity at the granularity of the Token during analysis.  How about we
establish these rules for the analysis phase?

  * Input to the analysis chain must be valid UTF-8.
  * Analyzers must be prepared to encounter broken UTF-8 but may either throw
    an exception or produce junk.
  * Broken UTF-8 emitted by an analysis chain should be detected prior to
    Indexer commit.

For the record, we currently perform a UTF-8 validity check on individual
terms within PostingPool.c (during the CB_Mimic_Str() invocations, which
perform internal UTF-8 sanity checking).  This is the right phase for the
check, IMO -- it's after the terms have been sorted and uniqued, so we perform
the validity check once per unique term rather than e.g. once per Token if we
were to enforce UTF-8 validity at the end of the analysis chain.

> 2. If there's invalid UTF-8 near the end of the input buffer, we might  
> read up to three bytes past the end of the buffer.

I think this is OK, too.  First, this is only a problem for broken analysis
chains.  Second, the typical outcome will be a token with a small amount of
random bogus content, and the Indexer will probably throw an exception prior
to commit anyway rather than leak the content into the index.

Marvin Humphrey

Re: [lucy-dev] StandardTokenizer has landed

Posted by Nick Wellnhofer <we...@aevum.de>.

On 07/12/2011 03:56, Marvin Humphrey wrote:
> I also wanted to double check what happens when invalid UTF-8 shows up.  It
> looks like the masking that's in place would force any bogus header bytes
> positioned as continuation bytes to be evaluated safely, so no problem there.
>
> The one thing that isn't clear to me is that it's impossible to overshoot the
> end of the compressed lookup table arrays.  I see that we're covered as far as
> the plane_index table goes:
>
>          if (plane_index>= WB_PLANE_MAP_SIZE) { return 0; }
>          plane_id  = wb_plane_map[plane_index];
>
> There aren't boundary checks for the other tables,

The other tables don't need boundary checks because they're indexed 
using an id from another table which are all safe to use.

I can see only two bad things that can happen with invalid UTF-8:

1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to 
other analyzers, possibly creating even more invalid UTF-8.

2. If there's invalid UTF-8 near the end of the input buffer, we might 
read up to three bytes past the end of the buffer.

> but I see that you defined
> a bunch of size-related constants in the autogenerated WordBreak.tab file
> which haven't yet been used:
>
>    #define WB_PLANES_SHIFT 6
>    #define WB_PLANES_MASK  63
>    #define WB_PLANES_SIZE  1472
>
> Perhaps you were already planning to add stuff like this eventually?
>
>    #if (WB_ASCII_SIZE<  128)
>      #error "ASCII word break table too small"
>    #endif

The tables used to have different shift and mask values, therefore the 
SHIFT and MASK defines. Now they're fixed at 6 bits and the defines are 
unused.

Nick

Re: [lucy-dev] StandardTokenizer has landed

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Dec 06, 2011 at 10:43:02PM +0100, Nick Wellnhofer wrote:
>> I'm pretty sure I can tease that lookup code apart if I try, but it would be
>> easier with narration -- and it will certainly be easier for people who
>> haven't done much hard-core UTF-8 and Unicode coding.
>
> I will try to make that code more descriptive.

Your improvements are great!

I now understand what this commit did:

    URL: http://svn.apache.org/viewvc?rev=1210722&view=rev
    Log:
    Optimize word break property lookup

    UTF-8 decoding and property lookup can be done in one pass. If we use
    index tables with a shift of 6 bits, the two passes align very nicely.
    We also optimize for ASCII characters, so this should be really fast.

I can now see where the UTF-8 decoding logic ends and the table lookup begins.
It looks like your table lookup code is indeed branchless, and I can't suggest
any way to optimize it further.

I also wanted to double check what happens when invalid UTF-8 shows up.  It
looks like the masking that's in place would force any bogus header bytes
positioned as continuation bytes to be evaluated safely, so no problem there.

The one thing that isn't clear to me is that it's impossible to overshoot the
end of the compressed lookup table arrays.  I see that we're covered as far as
the plane_index table goes:

        if (plane_index >= WB_PLANE_MAP_SIZE) { return 0; }
        plane_id  = wb_plane_map[plane_index];

There aren't boundary checks for the other tables, but I see that you defined
a bunch of size-related constants in the autogenerated WordBreak.tab file
which haven't yet been used:

  #define WB_PLANES_SHIFT 6
  #define WB_PLANES_MASK  63
  #define WB_PLANES_SIZE  1472

Perhaps you were already planning to add stuff like this eventually?

  #if (WB_ASCII_SIZE < 128)
    #error "ASCII word break table too small"
  #endif

> It isn't really made clear but the whole process can again be applied to  
> the first-stage table so you end up with three tables. If you use a  
> 16-bit and an 8-bit split, you can think of Unicode planes and rows.

Perfect, got it!

This comment was especially helpful:

            // four byte sequence
            // 11110ppp 10pppppp 10rrrrrr 10cccccc

> The word break tables are tested for all Unicode characters in  
> gen_word_break_tables.pl but that doesn't cover the C code.
>
> I just implemented and committed the tests against the UCD test suite.  
> There were two bugs in the implementation of the actual word break rules  
> that I fixed. Now the complete test suite passes.

Passes for me, too, and runs clean under Valgrind.  Lookin' good!

Marvin Humphrey

Re: [lucy-dev] StandardTokenizer has landed

Posted by Nick Wellnhofer <we...@aevum.de>.

On 06/12/11 20:08, Marvin Humphrey wrote:
> On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:
>> This and similar schemes are widely used in Unicode processing. It isn't
>> too complicated once you wrap your head around it.
>
> Yes, that's the impression I got.  The data structures didn't seem hard --
> just a couple extra derefs, essentially.  It's all the bit-twiddling that
> makes it hard to grok -- but that's the price we pay for the efficiency of
> bitwise operations.
>
> Can I ask you to please add some comments to S_wb_lookup() in
> StandardTokenizer.c?  Also, longer and more descriptive variable names would
> be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.
>
> I'm pretty sure I can tease that lookup code apart if I try, but it would be
> easier with narration -- and it will certainly be easier for people who
> haven't done much hard-core UTF-8 and Unicode coding.

I will try to make that code more descriptive.

>> There's also a brief description in section 5.1 of the Unicode Standard.
>
> Looks like this:
>
>      http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
>
>      Optimized Two-Stage Table.
>
>      Wherever any blocks are identical, the pointers just point to the same
>      block. For transcoding tables, this case occurs generally for a block
>      containing only mappings to the default or “unmappable” character. Instead
>      of using NULL pointers and a default value, one “shared” block of default
>      entries is created. This block is pointed to by all first-stage table
>      entries, for which no character value can be mapped.  By avoiding tests
>      and branches, this strategy provides access time that approaches the
>      simple array access, but at a great savings in storage.

It's actually a three-stage table. See the next paragraph:

	Multistage Table Tuning.

	Given a table of arbitrary size and content, it is a relatively
	simple matter to write a small utility that can calculate the
	optimal number of stages and their width for a multistage
	table. Tuning the number of stages and the width of their
	arrays of index pointers can result in various trade-offs of
	table size versus average access time.

It isn't really made clear but the whole process can again be applied to 
the first-stage table so you end up with three tables. If you use a 
16-bit and an 8-bit split, you can think of Unicode planes and rows.

>> What I still want to do is to incorporate the word break test cases from
>> the Unicode website:
>>
>> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt
>
> That would really help.  Bitwise code is hard to debug visually.  That's one
> reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
> which are somewhat more thorough than other Lucy components.

The word break tables are tested for all Unicode characters in 
gen_word_break_tables.pl but that doesn't cover the C code.

I just implemented and committed the tests against the UCD test suite. 
There were two bugs in the implementation of the actual word break rules 
that I fixed. Now the complete test suite passes.

Nick

Re: [lucy-dev] StandardTokenizer has landed

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:
> On 06/12/2011 05:16, Marvin Humphrey wrote:
>> I didn't grok everything that was being done in the compressed table lookup
>> scheme, but your code is as well-documented and easy to follow as anything
>> that does that much bit-twiddling possibly could be, and I feel like I could
>> dive in and work on it if the need arose.
>
> This and similar schemes are widely used in Unicode processing. It isn't  
> too complicated once you wrap your head around it.

Yes, that's the impression I got.  The data structures didn't seem hard --
just a couple extra derefs, essentially.  It's all the bit-twiddling that
makes it hard to grok -- but that's the price we pay for the efficiency of
bitwise operations.

Can I ask you to please add some comments to S_wb_lookup() in
StandardTokenizer.c?  Also, longer and more descriptive variable names would
be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.

I'm pretty sure I can tease that lookup code apart if I try, but it would be
easier with narration -- and it will certainly be easier for people who
haven't done much hard-core UTF-8 and Unicode coding.

> There's also a brief description in section 5.1 of the Unicode Standard.

Looks like this:

    http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

    Optimized Two-Stage Table.

    Wherever any blocks are identical, the pointers just point to the same
    block. For transcoding tables, this case occurs generally for a block
    containing only mappings to the default or “unmappable” character. Instead
    of using NULL pointers and a default value, one “shared” block of default
    entries is created. This block is pointed to by all first-stage table
    entries, for which no character value can be mapped.  By avoiding tests
    and branches, this strategy provides access time that approaches the
    simple array access, but at a great savings in storage.

> I also made the assumption that the Tokenizer input is valid UTF-8. Is  
> that true?

That is enforced in the default indexing chain by the fact that
Lucy_ViewCB_Assign_Str() performs a UTF-8 validity check in this section of
trunk/perl/xs/Lucy/Index/Inverter.c:

        // Get the field value, forcing text fields to UTF-8.
        switch (Lucy_FType_Primitive_ID(type) & lucy_FType_PRIMITIVE_ID_MASK) {
            case lucy_FType_TEXT: {
                    STRLEN val_len;
                    char *val_ptr = SvPVutf8(value_sv, val_len);
                    lucy_ViewCharBuf *value
                        = (lucy_ViewCharBuf*)inv_entry->value;
                    Lucy_ViewCB_Assign_Str(value, val_ptr, val_len);
                    break;
                }

Token_Set_Text(), Token_new() and others do not enforce UTF-8 validity.  So
the answer to your question is that source material which comes in through
Indexer should be clean unless the chain contains a broken Analyzer, but lower
level APIs do not currently make guarantees.

We could probably stand to be more deliberate about where we check for UTF-8
validity in the indexing/analysis chain.  It's important for efficiency
reasons that modules such as StandardTokenizer be able to assume UTF-8 sanity
-- however I think it's important that feeding such a module invalid UTF-8
result in a exception (uncatchable if need be) rather than a potentially
exploitable memory error.

> What I still want to do is to incorporate the word break test cases from  
> the Unicode website:
>
> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

That would really help.  Bitwise code is hard to debug visually.  That's one
reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
which are somewhat more thorough than other Lucy components.

> I like the way the snowball stemmer tests read test data from JSON files  
> using our own parser. So I'd convert the Unicode tests to JSON with a  
> perl script.

+1 

> I saw that there is an issue with JSON files and RAT  because we can't
> include a license header.

I think we should rid ourselves of this annoyance by modifying our JSON parser
to handle comments, probably line comments starting with "#".

If others agree, I'll open an issue.  Handling pound comments won't be that
hard to implement, and it would benefit the project if someone else were to
gain a little familiarity the JSON parser.

> Maybe we should put all  Unicode database related material (also the word
> break tables) in a  single directory like modules/unicode/ucd like the
> snowball stemmer.

Works for me.

FWIW, "modules" was originally intended as a place to put components which
would be optional.  Theoretically, we only need Analyzer itself in core -- all
of the subclasses can live outside, to be compiled as separate shared objects
or not.  But there's other work that needs to happen before we can compile
anything other than a monolithic Lucy shared object.

Marvin Humphrey

Re: [lucy-dev] StandardTokenizer has landed

Posted by Nick Wellnhofer <we...@aevum.de>.

On 06/12/2011 05:16, Marvin Humphrey wrote:
> I didn't grok everything that was being done in the compressed table lookup
> scheme, but your code is as well-documented and easy to follow as anything
> that does that much bit-twiddling possibly could be, and I feel like I could
> dive in and work on it if the need arose.

This and similar schemes are widely used in Unicode processing. It isn't 
too complicated once you wrap your head around it. There's also a brief 
description in section 5.1 of the Unicode Standard.

I also made the assumption that the Tokenizer input is valid UTF-8. Is 
that true?

What I still want to do is to incorporate the word break test cases from 
the Unicode website:

http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

I like the way the snowball stemmer tests read test data from JSON files 
using our own parser. So I'd convert the Unicode tests to JSON with a 
perl script. I saw that there is an issue with JSON files and RAT 
because we can't include a license header. Maybe we should put all 
Unicode database related material (also the word break tables) in a 
single directory like modules/unicode/ucd like the snowball stemmer.

Nick

[lucy-dev] StandardTokenizer has landed

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Dec 05, 2011 at 01:38:14PM -0800, Marvin Humphrey wrote:
> Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!

Following up on this:

I've now reviewed all the StandardTokenizer commits that landed today.  Very
impressive work!  Aside from the PolyAnalyzer issue, all I could come up with
was some bikeshedding (which I'll just keep to myself).  :)

I didn't grok everything that was being done in the compressed table lookup
scheme, but your code is as well-documented and easy to follow as anything
that does that much bit-twiddling possibly could be, and I feel like I could
dive in and work on it if the need arose.

Marvin Humphrey