You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucy.apache.org by nw...@apache.org on 2011/12/12 16:31:04 UTC
[lucy-commits] svn commit: r1213280 -
/incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
Author: nwellnhof
Date: Mon Dec 12 15:31:03 2011
New Revision: 1213280
URL: http://svn.apache.org/viewvc?rev=1213280&view=rev
Log:
Don't read past end of input buffer
Even there is invalid UTF-8.
Modified:
incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
Modified: incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
URL: http://svn.apache.org/viewvc/incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c?rev=1213280&r1=1213279&r2=1213280&view=diff
==============================================================================
--- incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c (original)
+++ incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c Mon Dec 12 15:31:03 2011
@@ -104,6 +104,12 @@ StandardTokenizer_transform_text(Standar
void
StandardTokenizer_tokenize_str(StandardTokenizer *self, const char *text,
size_t len, Inversion *inversion) {
+ if (len >= 1 && (uint8_t)text[len-1] >= 0xC0
+ || len >= 2 && (uint8_t)text[len-2] >= 0xE0
+ || len >= 3 && (uint8_t)text[len-3] >= 0xF0) {
+ THROW(ERR, "Invalid UTF-8 sequence");
+ }
+
lucy_StringIter iter = { 0, 0 };
while (iter.byte_pos < len) {
@@ -231,6 +237,9 @@ S_wb_lookup(const char *ptr) {
size_t plane_id, row_index;
if (start < 0xE0) {
+ if (start < 0xC0) {
+ THROW(ERR, "Invalid UTF-8 sequence");
+ }
// two byte sequence
// 110rrrrr 10cccccc
plane_id = 0;
Re: [lucy-dev] Re: [lucy-commits] svn commit: r1213280 - /incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/S
tandardTokenizer.c
Posted by Nick Wellnhofer <we...@aevum.de>.
On 12/12/11 19:47, Marvin Humphrey wrote:
> On Mon, Dec 12, 2011 at 03:31:04PM -0000, nwellnhof@apache.org wrote:
>> Log:
>> Don't read past end of input buffer
>>
>> Even there is invalid UTF-8.
>>
>> Modified:
>> incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
> ^^^^^^^^^^^^^^^^^^^^^^
>
> Did you mean to commit this to trunk?
This one already got committed to trunk with the merge. I just forgot to
commit it to the branch before. I could have simply deleted the branch
but I decided to make the commit and keep the branch for review.
Nick
[lucy-dev] Re: [lucy-commits] svn commit: r1213280 -
/incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/S
tandardTokenizer.c
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Dec 12, 2011 at 03:31:04PM -0000, nwellnhof@apache.org wrote:
> Log:
> Don't read past end of input buffer
>
> Even there is invalid UTF-8.
>
> Modified:
> incubator/lucy/branches/LUCY-196-uax-tokenizer/core/Lucy/Analysis/StandardTokenizer.c
^^^^^^^^^^^^^^^^^^^^^^
Did you mean to commit this to trunk?
Marvin Humphrey