You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2011/11/15 21:51:49 UTC

[lucy-dev] Unicode integration

(Moving the "Custom analyzers" thread from lucy-user to lucy-dev)

On 15/11/11 05:22, Marvin Humphrey wrote:
> On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:
>> Would it make sense to have all the Unicode functionality in the Lucy
>> core using a third party Unicode library? Or should we rely on the
>> Unicode support of the host language like we do for case folding?
>
> That hinges on the dependability, portability, licensing terms and
> ease-of-integration for this theoretical third party Unicode library.
> Dependencies are cool so long as we can bundle them, they don't take a million
> years to compile, they don't sabotage all the hard work we've done to make
> Lucy portable, etc.  (For a longer take on dependencies, see
> <http://markmail.org/message/2zsunkfleqocix67>.)

If all dependencies must be bundled, we can rule out something like ICU 
[1] because it's simply too big.

One alternative I could find is utf8proc [2]. It's 20K of C code, 
MIT-licensed and used for Postgres extensions and a Ruby gem. It 
supports Unicode normalization, case folding and stripping of accents.

Then there's the Perl module Unicode::Normalize with very similar 
functionality. But I'm not sure if the Perl License is compatible with 
the Apache License.

One downside of bundling a Unicode library is that they all need some 
rather large tables. utf8proc comes with a 1.2 MB .c file containing the 
tables. The whole library compiles to about 500 KB. Unicode::Normalize 
builds its tables from the Unicode database files that come with Perl 
and compiles to about 300 KB. All this on i386, 32 bit.

On the positive side, we'd have things like case folding, normalization 
and accent stripping directly in core. We'd also get Unicode features 
for new host languages out of the box and it's the only way to make sure 
Unicode is handled consistently across different host languages and 
client platforms. The latter might be a rather academic concern, though.

Nick

[1] http://icu-project.org/
[2] http://www.public-software-group.org/utf8proc

Re: [lucy-dev] Unicode integration

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:
> If we go with utf8proc, I would propose a new analyzer  
> Lucy::Analysis::Normalizer with the following interface:
>
> my $normalizer = Lucy::Analysis::Normalizer->new(
>     normalization_form => $string,
>     case_fold          => $bool,
>     strip_accents      => $bool,
> );

For the benefit of those who are not subscribed to the lucy-issues list[1], I
wanted to pass along that Nick has followed through with a full, portable C
implementation of Lucy::Analysis::Normalizer, with proper documentation,
tests... the whole nine yards.

    https://issues.apache.org/jira/browse/LUCY-191

Things could hardly have gone better or more according to the "Apache Way".
Nick did not let himself be held back by either the redaction of the Analyzer
subclassing API or the dependency constraints he was asked to work within, he
diverted the discussion from the user list to the dev list at the appropriate
moment, proposed an interface and the basic shape of an implementation, built
consensus for his proposal, then coded up his contribution with hardly any
help and delivered a solid patch.

And then as an encore, yesterday Nick submitted a patch to solve our current
Highlighter bug.

Bravo, Nick!

Marvin Humphrey

[1] The lucy-issues list gets notifications from our JIRA issue tracker.
    Significant design decisions must always be undertaken on the dev list, so
    conversations in the issue tracker are limited to implementation
    discussions. <http://incubator.apache.org/lucy/mailing_lists.html>

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 9:29 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 17/11/2011 14:08, Robert Muir wrote:
>>
>> yeah, the problematic ones can be seen here:
>> http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt
>>
>> # Derived Property: FC_NFKC_Closure
>> #  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
>> #  Then if (c != b) add the mapping from a to c to the set of
>> #  mappings that constitute the FC_NFKC_Closure list
>>
>> So from what I can tell at a glance: with the utf8proc algorithm, if
>> you specify NFKC and casefolding, its not yet 'done'
>
> I just verified that the output utf8proc produces with the options STABLE,
> COMPOSE, COMPAT, and CASEFOLD really matches the FC_NFKC mapping. See the
> test program at https://gist.github.com/1373256
>

but the problem cannot be tested with single codepoints I think? I'm
pretty sure the issue
has to do with contextual normalization/casefolding (both of these are
not 1-1)... especiallly
involving things like greek diacritics.

a simple test would just generate lots of random unicode strings,
normalize with this option,
and then normalize that result again and compare that they are the same.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Nick Wellnhofer <we...@aevum.de>.

On 17/11/2011 14:08, Robert Muir wrote:
> yeah, the problematic ones can be seen here:
> http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt
>
> # Derived Property: FC_NFKC_Closure
> #  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
> #  Then if (c != b) add the mapping from a to c to the set of
> #  mappings that constitute the FC_NFKC_Closure list
>
> So from what I can tell at a glance: with the utf8proc algorithm, if
> you specify NFKC and casefolding, its not yet 'done'

I just verified that the output utf8proc produces with the options 
STABLE, COMPOSE, COMPAT, and CASEFOLD really matches the FC_NFKC 
mapping. See the test program at https://gist.github.com/1373256

This is because case folding is done together with the decomposition step.

I also think this would be a nice default for a search engine.

Nick

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>>
>> The point of the derived property is that there are sneaky
>> interactions between these.
>
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.

yeah, the problematic ones can be seen here:
http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt

# Derived Property: FC_NFKC_Closure
#  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
#  Then if (c != b) add the mapping from a to c to the set of
#  mappings that constitute the FC_NFKC_Closure list

So from what I can tell at a glance: with the utf8proc algorithm, if
you specify NFKC and casefolding, its not yet 'done'

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>>
>> The point of the derived property is that there are sneaky
>> interactions between these.
>
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.
>
> Nick
>

I don't think so: it seems to only decompose the 'output' case folding
mapping. this is not enough.

If I remember, the problem is that normalization of course uses
context, so the algorithm must be done as stated in the standard:

 toNFKC_Casefold(X): Map each character C in X to NFKC_Casefold(C) and then
normalize the resulting string to NFC

doing the mappings: then normalizing the whole string.

in icu this is instead done as an additional normalization form, so
its single-pass/non-recursive there.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Nick Wellnhofer <we...@aevum.de>.

On 17/11/2011 13:37, Robert Muir wrote:
> The point of the derived property is that there are sneaky
> interactions between these.

Having a look at the utf8proc code, the function 
utf8proc_decompose_char calls itself recursively when substituting 
characters. So it looks like it does support NFKC_Casefold properly.

Nick

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 7:30 AM, Nick Wellnhofer <we...@aevum.de> wrote:
>
> I'm not sure about the last point but NFKC, CaseFolding, and removal of
> Default_Ignorable_Code_Points are supported.
>

The point of the derived property is that there are sneaky
interactions between these.

in icu, this form is "nfkc_cf" and you get it like any other
normalizer, and accomplish this
in a single pass over the text.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Nick Wellnhofer <we...@aevum.de>.

On 17/11/2011 01:46, Robert Muir wrote:
> Does your unicode library also support "NFKC_CaseFold" ? It might be a
> nice default:
>
> # Derived Property:   NFKC_Casefold (NFKC_CF)
> #   This property removes certain variations from characters: case,
> compatibility, and default-ignorables.
> #   It is used for loose matching and certain types of identifiers.
> #   It is constructed by applying NFKC, CaseFolding, and removal of
> Default_Ignorable_Code_Points.
> #   The process of applying these transformations is repeated until a
> stable result is produced.

I'm not sure about the last point but NFKC, CaseFolding, and removal of 
Default_Ignorable_Code_Points are supported.

Nick

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Nov 16, 2011 at 5:24 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 16/11/11 04:49, Marvin Humphrey wrote:
>>
>> It would be great to support accent stripping in Lucy -- that's something
>> a
>> lot of people need.  Normalization would also be a nice feature to offer
>> (Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
>> replacement?).
>
> Thinking about the implications of Unicode in the analyzer chain, I've come
> to the conclusion that the first step should always be tokenization. In the
> current implementation the CaseFolder comes first in the chain by default.
> But case folding (or lowercasing) can add or remove Unicode codepoints and
> mess with the character offsets for the highlighter. See the attached script
> for a demonstration.
>
>> It would also be great to migrate Lucy::Analysis::CaseFolder code away
>> from
>> its dependency on the Perl C API.
>
> Yes, we could even do proper Unicode case folding, normalization and accent
> stripping in one pass with utf8proc. This should be the next step after
> tokenization. The stopalizer and stemmers should be safe when using NFC or
> NFKC. I think we can leave the choice between these normalization forms to
> the user.
>
> If we go with utf8proc, I would propose a new analyzer
> Lucy::Analysis::Normalizer with the following interface:
>
> my $normalizer = Lucy::Analysis::Normalizer->new(
>    normalization_form => $string,
>    case_fold          => $bool,
>    strip_accents      => $bool,
> );
>
> normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The
> decomposed forms won't play well with other analyzers but could be easily
> added for completeness. I'm not sure whether we should default to NFC or
> NFKC.
>
> case_fold and strip_accents are simple on/off switches. By default case_fold
> is enabled and strip_accents disabled.
>

Does your unicode library also support "NFKC_CaseFold" ? It might be a
nice default:

# Derived Property:   NFKC_Casefold (NFKC_CF)
#   This property removes certain variations from characters: case,
compatibility, and default-ignorables.
#   It is used for loose matching and certain types of identifiers.
#   It is constructed by applying NFKC, CaseFolding, and removal of
Default_Ignorable_Code_Points.
#   The process of applying these transformations is repeated until a
stable result is produced.


-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 11/16/11 11:09 PM:
> On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:

[snip]

>>
>> The default analyzer chain would be tokenize, normalize, stem.
> 
> The gist of your proposal seems sound.  It's great to see that you are
> thinking about all these things, and to see them all laid out here.
> 
> I don't see much to disagree with in your API choices, aside from the questions
> of what the default analyzer order should be and whether case_fold should be a
> boolean.  Neither of those quibbles block the proposal.
> 

+1 to that.

I've enjoyed following this thread, having wrestled with utf-8 analysis a lot in
libswish3[0]. I think robust utf-8 string handling in core is a win, especially
if it includes a relatively lightweight way of dealing with the Unicode tables
in a portable way.

+1 to utf8proc

Thanks for initiating this thread, Nick.

[0] http://s.apache.org/722

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 12:09 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:

> I wonder: does either "common" or "simple" Unicode case folding preserve a
> one-to-one relationship between num-code-points-in and num-code-points-out?
> Because I believe that a case folding algorithm with that property would not
> mess up the Highlighting data.
>

Simple case folding does. I don't know what "common" is, its not a
type of case folding, its only a type of mapping in a file that is
common to both simple and full case folding (the only two types)

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 17, 2011 at 10:52 PM, Marvin Humphrey
<ma...@rectangular.com> wrote:
>
> OK, I remain at least academically interested in what sort of performance
> advantages 'simple' case folding affords us, and at what penalty in terms of
> relevancy.
>

I think it depends how its implemented, I'm not sure there is really a
performance advantage to the simpler one. In ICU at least, the
recursive part of nfkc_cf is computed up-front, into the data files,
and you get normalization+case folding at runtime in one-pass (versus
utf8proc's multiple passes, and its not clear all the corner cases are
working there)

As far as relevance, I think realistically only german users (ß/SS) or
anyone with ancient greek would care if you cheated and used the
simple one instead, especially if you are already normalizing anyway.

But that was just my point: if you are normalizing anyway, why not
just choose a normalization form that also does the case folding too.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Nov 17, 2011 at 02:05:44PM +0100, Nick Wellnhofer wrote:
> On 17/11/2011 06:09, Marvin Humphrey wrote:
>> I wonder: does either "common" or "simple" Unicode case folding preserve a
>> one-to-one relationship between num-code-points-in and num-code-points-out?
>
> Yes, simple case folding does.

OK, I remain at least academically interested in what sort of performance
advantages 'simple' case folding affords us, and at what penalty in terms of
relevancy.

However, as you and Robert are in agreement that NFKC_CF should work well, and
since utf8proc apparently only supports full casefolding anyhow, I'd love to
see an implementation.

I've been working today on pulling utf8proc into our repository:

    https://issues.apache.org/jira/browse/LUCY-189
    https://issues.apache.org/jira/browse/LEGAL-110

With those patches applied, utf8proc is integrated into our build, and you can
pound-include "utf8proc.h" from any C file under either core/ or perl/xs/.

> It only offers full case folding afaics.

Okeedoke -- thanks for looking into the matter.

> Simple case folding would work before tokenization but I still don't  
> like the idea of allowing certain analyzers before tokenization if they  
> don't add or remove codepoints. There might even be some long term gains  
> if we move tokenization completely out of the analysis chain.
> The analyzers could work directly on tokens instead of inversions and we
> could employ a token cache, for example.

Since we redacted the Analyzer subclassing API, we have a lot of freedom to
make such experiments!

Marvin Humphrey

Re: [lucy-dev] Unicode integration

Posted by Nick Wellnhofer <we...@aevum.de>.

On 17/11/2011 06:09, Marvin Humphrey wrote:
> I wonder: does either "common" or "simple" Unicode case folding preserve a
> one-to-one relationship between num-code-points-in and num-code-points-out?

Yes, simple case folding does.

> Because I believe that a case folding algorithm with that property would not
> mess up the Highlighting data.
>
> But then it looks like utf8proc only offers one CASEFOLD option.  I wonder
> which one it is, or if it's configurable.

It only offers full case folding afaics.

Simple case folding would work before tokenization but I still don't 
like the idea of allowing certain analyzers before tokenization if they 
don't add or remove codepoints. There might even be some long term gains 
if we move tokenization completely out of the analysis chain. The 
analyzers could work directly on tokens instead of inversions and we 
could employ a token cache, for example.

Nick

Re: [lucy-dev] Unicode integration

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:
> Thinking about the implications of Unicode in the analyzer chain, I've  
> come to the conclusion that the first step should always be  
> tokenization.

I'm not sure whether I concur, just yet.  Case folding after tokenization
likely means a minor performance hit, as it will be slightly more expensive to
casefold the text as individual tokens rather than in bulk.  I'm sure that the
degradation would be acceptable for the sake of correctness, but it would be
nice to explore all possibilities before we decide that it's required.

> In the current implementation the CaseFolder comes first in the chain by
> default.  But case folding (or lowercasing) can add or remove Unicode
> codepoints and mess with the character offsets for the highlighter.

Lucy::Analysis::CaseFolder is a rebadged KinoSearch::Analysis::LCNormalizer.
LCNormalizer simply applied lc() to the text to achieve case-insensitive
search.  When LCNormalizer was renamed to "CaseFolder", it gained the intent
that it would apply "Unicode case folding" -- though not the reality.

Here is some background for anyone following along who may be unfamiliar with
the distinction between Unicode case folding and other case manipulation
techniques:

    http://unicode.org/faq/casemap_charprop.html#2

    Q: What is the difference between case mapping and case folding?

    A: Case mapping or case conversion is a process whereby strings are
    converted to a particular form—uppercase, lowercase, or titlecase—possibly
    for display to the user. Case folding is primarily used for caseless
    comparison of text, such as identifiers in a computer program, rather than
    actual text transformation. Case folding in Unicode is based on the
    lowercase mapping, but includes additional changes to the source text to
    help make it language-insensitive and consistent. As a result, case-folded
    text should be used solely for internal processing and generally should
    not be stored or displayed to the end user.

The fact that CaseFolder is currently powered by the function that underlies
Perl's lc() means that it is buggy and incomplete.  It also means that Lucy
never had to choose between the various flavors of Unicode case folding.

I wonder: does either "common" or "simple" Unicode case folding preserve a
one-to-one relationship between num-code-points-in and num-code-points-out?
Because I believe that a case folding algorithm with that property would not
mess up the Highlighting data.

But then it looks like utf8proc only offers one CASEFOLD option.  I wonder
which one it is, or if it's configurable.

> See the attached script for a demonstration.

Ah, Turkish İ.  Sigh.

Thank you for going to the trouble to provide that excellent code sample.

> If we go with utf8proc, I would propose a new analyzer  
> Lucy::Analysis::Normalizer with the following interface:
>
> my $normalizer = Lucy::Analysis::Normalizer->new(
>     normalization_form => $string,
>     case_fold          => $bool,
>     strip_accents      => $bool,
> );

It seems that utf8proc offers composite string transformations.  I agree with
the basic concept of a Lucy Analyzer class which is a wrapper around those
composite capabilities.  

> normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The  
> decomposed forms won't play well with other analyzers but could be  
> easily added for completeness.
>
> I'm not sure whether we should default to  
> NFC or NFKC.
>
> case_fold and strip_accents are simple on/off switches. By default  
> case_fold is enabled and strip_accents disabled.
>
> The default analyzer chain would be tokenize, normalize, stem.

The gist of your proposal seems sound.  It's great to see that you are
thinking about all these things, and to see them all laid out here.

I don't see much to disagree with in your API choices, aside from the questions
of what the default analyzer order should be and whether case_fold should be a
boolean.  Neither of those quibbles block the proposal.

> Lucy::Analysis::CaseFolder could then be implemented as a subclass of  
> Lucy::Analysis::Normalizer for compatibility.

Makes sense.

> Further idea: implement a simple and fast tokenizer in core based on the  
> Unicode character class table provided with utf8proc.

Sounds interesting.  Presumably it would use a fixed pattern...

Marvin Humphrey

Re: [lucy-dev] Unicode integration

Posted by Nick Wellnhofer <we...@aevum.de>.

On 16/11/11 04:49, Marvin Humphrey wrote:
> It would be great to support accent stripping in Lucy -- that's something a
> lot of people need.  Normalization would also be a nice feature to offer
> (Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
> replacement?).

Thinking about the implications of Unicode in the analyzer chain, I've 
come to the conclusion that the first step should always be 
tokenization. In the current implementation the CaseFolder comes first 
in the chain by default. But case folding (or lowercasing) can add or 
remove Unicode codepoints and mess with the character offsets for the 
highlighter. See the attached script for a demonstration.

> It would also be great to migrate Lucy::Analysis::CaseFolder code away from
> its dependency on the Perl C API.

Yes, we could even do proper Unicode case folding, normalization and 
accent stripping in one pass with utf8proc. This should be the next step 
after tokenization. The stopalizer and stemmers should be safe when 
using NFC or NFKC. I think we can leave the choice between these 
normalization forms to the user.

If we go with utf8proc, I would propose a new analyzer 
Lucy::Analysis::Normalizer with the following interface:

my $normalizer = Lucy::Analysis::Normalizer->new(
     normalization_form => $string,
     case_fold          => $bool,
     strip_accents      => $bool,
);

normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The 
decomposed forms won't play well with other analyzers but could be 
easily added for completeness. I'm not sure whether we should default to 
NFC or NFKC.

case_fold and strip_accents are simple on/off switches. By default 
case_fold is enabled and strip_accents disabled.

The default analyzer chain would be tokenize, normalize, stem.

Lucy::Analysis::CaseFolder could then be implemented as a subclass of 
Lucy::Analysis::Normalizer for compatibility.

Further idea: implement a simple and fast tokenizer in core based on the 
Unicode character class table provided with utf8proc.

Nick

Re: [lucy-dev] Unicode integration

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Nov 15, 2011 at 09:51:49PM +0100, Nick Wellnhofer wrote:
> One alternative I could find is utf8proc [2]. It's 20K of C code,  
> MIT-licensed and used for Postgres extensions and a Ruby gem. It  
> supports Unicode normalization, case folding and stripping of accents.

utf8proc also supplies some low-level routines which we might be able to use
to replace stuff in Lucy/Util/StringHelper.c.

It compiles plenty fast, it's just two .c files (one of which includes the
other) and one .h file.  The Makefile isn't portable, but I see that they've
taken pains to accommodate MSVC in the C files, so they'll probably compile
everywhere that Lucy does.

I see no problems with bundling utf8proc as a dependency.

Looks like a great find, Nick!

> One downside of bundling a Unicode library is that they all need some  
> rather large tables. utf8proc comes with a 1.2 MB .c file containing the  
> tables. The whole library compiles to about 500 KB. Unicode::Normalize  
> builds its tables from the Unicode database files that come with Perl  
> and compiles to about 300 KB. All this on i386, 32 bit.

With the current state of things, adding 500 KB is unlikely to make a
difference.

On my Mac running Snow Leopard, adding utf8proc pushes the compiled size of
Lucy.bundle from 2.8 MB to 3.3 MB.  The largest compiled objects contributing
to that tally are Lucy.o at 1.2 MB, compiled from Lucy.xs, and autogen/parcel.c
at 1.2 MB, which contains Clownfish OO support such as vtables.  The Snowball
stemmers add around 200 KB, and the Snowball stoplists around 100 KB.

If we put our minds to it, we could slim down Lucy.o and parcel.o -- maybe by
a lot, since nobody's ever bothered to work on optimizing for space.  The same
wouldn't hold true for utf8proc (or the Snowball materials).  But that just
doesn't seem important given all the other reasons utf8proc looks like a good
fit.

> On the positive side, we'd have things like case folding, normalization  
> and accent stripping directly in core.

It would be great to support accent stripping in Lucy -- that's something a
lot of people need.  Normalization would also be a nice feature to offer
(Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
replacement?).

It would also be great to migrate Lucy::Analysis::CaseFolder code away from
its dependency on the Perl C API.

> We'd also get Unicode features  for new host languages out of the box and
> it's the only way to make sure  Unicode is handled consistently across
> different host languages and  client platforms. The latter might be a rather
> academic concern, though.

Personally, I don't see cross-host index compatibility as so important that we
ought to make big sacrifices to achieve it.

Regardless, integrating utf8proc seems worthwhile for lots of other reasons. 

+1 from me!

Marvin Humphrey