You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/10/12 19:47:02 UTC
Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8
(indexes ok) and 5.10.1 (indexing fails)
On Wed, Oct 12, 2011 at 05:57:33PM +0200, goran kent wrote:
> > It sounds like, without seeing a reproduce-able test case, that Lucy is
> > choking appropriately on malformed UTF-8.
>
> Absolutely. What's interesting is that the same Lucy code does not
> choke on the other machines with the older Perl.
Lucy trusts that incoming data it has received from Perl is well-formed.
(Technically, it assumes that string data obtained via the XS routine
SvPVutf8() is well-formed UTF-8, notwithstanding the difference between Perl's
loose internal representation and the Unicode standard for UTF-8.) We could
add an index-time validity check, but that would slow down indexing.
At search-time, though, Lucy is reading from the file system rather than
receiving data from Perl -- and data from the file system cannot be trusted.
Therefore, Lucy always performs validity checks when reading what is
ostensibly UTF-8 data out of an existing index.
I don't know of a mechanism whereby Lucy's behavior would change between
different versions of Perl. In any case, having invalid UTF-8 in your Perl
scalars is bad news -- it can do things like crash the regex engine. It will
also lead to corrupt Lucy indexes that fail the search-time UTF-8 validity
check.
How are you getting raw data into Perl?
> Anyway, I like the idea of rolling my own perl to be absolutely sure
> of coherence across my machines.
+1
Marvin Humphrey