You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucy.apache.org by nw...@apache.org on 2011/12/12 16:31:04 UTC

[lucy-commits] svn commit: r1213280 - /incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c

Author: nwellnhof
Date: Mon Dec 12 15:31:03 2011
New Revision: 1213280

URL: http://svn.apache.org/viewvc?rev=1213280&view=rev
Log:
Don't read past end of input buffer

Even there is invalid UTF-8.

Modified:
    incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c

Modified: incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
URL: http://svn.apache.org/viewvc/incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c?rev=1213280&r1=1213279&r2=1213280&view=diff
==============================================================================
--- incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c (original)
+++ incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c Mon Dec 12 15:31:03 2011
@@ -104,6 +104,12 @@ StandardTokenizer_transform_text(Standar
 void
 StandardTokenizer_tokenize_str(StandardTokenizer *self, const char *text,
                                size_t len, Inversion *inversion) {
+    if (len >= 1 && (uint8_t)text[len-1] >= 0xC0
+    ||  len >= 2 && (uint8_t)text[len-2] >= 0xE0
+    ||  len >= 3 && (uint8_t)text[len-3] >= 0xF0) {
+        THROW(ERR, "Invalid UTF-8 sequence");
+    }
+
     lucy_StringIter iter = { 0, 0 };
 
     while (iter.byte_pos < len) {
@@ -231,6 +237,9 @@ S_wb_lookup(const char *ptr) {
     size_t plane_id, row_index;
 
     if (start < 0xE0) {
+        if (start < 0xC0) {
+            THROW(ERR, "Invalid UTF-8 sequence");
+        }
         // two byte sequence
         // 110rrrrr 10cccccc
         plane_id  = 0;



Re: [lucy-dev] Re: [lucy-commits] svn commit: r1213280 - /incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/S tandardTokenizer.c

Posted by Nick Wellnhofer <we...@aevum.de>.
On 12/12/11 19:47, Marvin Humphrey wrote:
> On Mon, Dec 12, 2011 at 03:31:04PM -0000, nwellnhof@apache.org wrote:
>> Log:
>> Don't read past end of input buffer
>>
>> Even there is invalid UTF-8.
>>
>> Modified:
>>      incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
>                                ^^^^^^^^^^^^^^^^^^^^^^
>
> Did you mean to commit this to trunk?

This one already got committed to trunk with the merge. I just forgot to 
commit it to the branch before. I could have simply deleted the branch 
but I decided to make the commit and keep the branch for review.

Nick


[lucy-dev] Re: [lucy-commits] svn commit: r1213280 - /incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/S tandardTokenizer.c

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Dec 12, 2011 at 03:31:04PM -0000, nwellnhof@apache.org wrote:
> Log:
> Don't read past end of input buffer
> 
> Even there is invalid UTF-8.
> 
> Modified:
>     incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
                              ^^^^^^^^^^^^^^^^^^^^^^

Did you mean to commit this to trunk?

Marvin Humphrey