You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/10/12 19:47:02 UTC

Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails)

On Wed, Oct 12, 2011 at 05:57:33PM +0200, goran kent wrote:
> > It sounds like, without seeing a reproduce-able test case, that Lucy is
> > choking appropriately on malformed UTF-8.
> 
> Absolutely.  What's interesting is that the same Lucy code does not
> choke on the other machines with the older Perl.

Lucy trusts that incoming data it has received from Perl is well-formed.
(Technically, it assumes that string data obtained via the XS routine
SvPVutf8() is well-formed UTF-8, notwithstanding the difference between Perl's
loose internal representation and the Unicode standard for UTF-8.)  We could
add an index-time validity check, but that would slow down indexing.

At search-time, though, Lucy is reading from the file system rather than
receiving data from Perl -- and data from the file system cannot be trusted.
Therefore, Lucy always performs validity checks when reading what is
ostensibly UTF-8 data out of an existing index.

I don't know of a mechanism whereby Lucy's behavior would change between
different versions of Perl.  In any case, having invalid UTF-8 in your Perl
scalars is bad news -- it can do things like crash the regex engine.  It will
also lead to corrupt Lucy indexes that fail the search-time UTF-8 validity
check.

How are you getting raw data into Perl?

> Anyway, I like the idea of rolling my own perl to be absolutely sure
> of coherence across my machines.

+1

Marvin Humphrey