You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2011/11/22 23:10:59 UTC

[lucy-dev] Implementing a tokenizer in core

Currently, Lucy only provides the RegexTokenizer which is implemented on 
top of the perl regex engine. With the help of utf8proc we could 
implement a simple but more efficient tokenizer without external 
dependencies in core. Most important, we'd have to implement something 
similar to the \w regex character class. The Unicode standard [1,2] 
recommends that \w is equivalent to 
[\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories 
Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation 
plus circled letters. That's exactly how perl implements \w. Other 
implementations like .NET seem to differ slightly [3]. So we could 
lookup Unicode categories with utf8proc and then a perl-compatible check 
for a word character would be as easy as (cat <= 10 || cat == 12 || c >= 
0x24b6 && c <= 0x24e9).

The default regex in RegexTokenizer also handles apostrophes which I 
don't find very useful personally. But this could also be implemented in 
the core tokenizer.

I'm wondering what other kind of regexes people are using with 
RegexTokenizer, and whether this simple core tokenizer should be 
customizable for some of these use cases.

Nick

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Peter Karman <pe...@peknet.com>.

Nathan Kurz wrote on 11/25/11 4:35 PM:
>  I'd like to discourage a quoted search for "Proper Name"
> from matching "is that proper?<br>\nName your price," and I think the
> easiest way to do this is by indexing some things that would normally
> be ignored.

The easiest way is to use the libswish3 parser, which automatically bumps the
token position based on HTML constructs like that.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Nathan Kurz <na...@verse.com>.

On Tue, Nov 22, 2011 at 6:50 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> I don't think we need to worry much about making this tokenizer flexible.  We
> already offer a certain amount of flexibility via RegexTokenizer.

I agree with this.  I think the number of people that need an
extremely efficient tokenizer that is also extremely flexible is low.
Keep RegexTokenizer as the flexible option, and write this alternative
for greater performance.  Rather than making it completely
configurable, put the emphasis on making it clear, simple, and
independent of the inner workings of Lucy.   Maybe put it in LucyX
(API dogfood), and let it serve as an example for anyone who wants to
write their own.

My tokenizing needs are theoretical at this point, but the areas that
I care about involve tokenizing white space, capitalization, and
markup.   I'd like to discourage a quoted search for "Proper Name"
from matching "is that proper?<br>\nName your price," and I think the
easiest way to do this is by indexing some things that would normally
be ignored.   I also care about punctuation such as Marvin's "Maggie's
Farm" apostrophe example, as well as things like  like
"hyphenated-compound", "C++", "U.S.A.".

--nate

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 30, 2011 at 11:29:26PM +0100, Nick Wellnhofer wrote:
> OK, things are getting a little more complicated. I'd also like to  
> generate some #defines along with the tables, so I could either generate  
> a separate .h file, or I could simply create a single .c file that gets  
> included by another .c file. This is not very tasteful but it would  
> simplify things.

All of those sound fine to me.  Sounds like you like the .h file option best,
so +1 to that.

> Another question: The perl script that generates the tables uses text  
> files from http://www.unicode.org/Public/UNIDATA/. Should we bundle  
> these files with Lucy?

How about we provide a link in the script's docs to the monolithic archive of
the version of those files we want to use?  For instance:

    http://www.unicode.org/Public/6.0.0/ucd/UCD.zip

Then the script can just take an arg to the expanded directory.

    perl devel/bin/gen_uniprops.pl /path/to/UCD

We can also bundle if you prefer (the license allows it) -- it's just a little
more work and a little more bandwidth.

Marvin Humphrey

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Nick Wellnhofer <we...@aevum.de>.

On 30/11/11 17:04, Marvin Humphrey wrote:
> The script likely belongs in trunk/devel/bin.
>
> The file with the generated tables could arguably go in a few different
> places.  I would suggest either trunk/core/Lucy/Analysis/WordBreakTables.c
> if the tables are specialized, or trunk/core/Lucy/Util/UnicodeProperties.c if
> we anticipate adding more tables in the future.

OK, things are getting a little more complicated. I'd also like to 
generate some #defines along with the tables, so I could either generate 
a separate .h file, or I could simply create a single .c file that gets 
included by another .c file. This is not very tasteful but it would 
simplify things.

Another question: The perl script that generates the tables uses text 
files from http://www.unicode.org/Public/UNIDATA/. Should we bundle 
these files with Lucy?

Nick

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 30, 2011 at 08:04:29AM -0800, Marvin Humphrey wrote:
> Lastly, we will need to add adapt LICENSE and NOTICE to accommodate the new
> data.  I'll start a new thread for that, as recent conversations on
> general@incubator indicated the need for additional changes to our LICENSE and
> NOTICE files.

OK, I've dealt with updating NOTICE.  I'm now of the opinion that your
generated files won't require an entry in NOTICE.

We'll still need an entry in LICENSE.  In my opinion, we should just ignore
the existing Unicode license which applies to utf8proc, and add a dedicated
entry.

I suggest modifying LICENSE per the patch below.

Marvin Humphrey


diff --git a/LICENSE b/LICENSE
index 85054d2..790c467 100644
--- a/LICENSE
+++ b/LICENSE
@@ -232,6 +232,47 @@ license for those materials:
     OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
     OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+This product contains materials derived from files licensed by the Unicode
+consortium.  Here is the license for those materials:
+
+    COPYRIGHT AND PERMISSION NOTICE
+
+    Copyright (c) 1991-2011 Unicode, Inc. All rights reserved. Distributed
+    under the Terms of Use in http://www.unicode.org/copyright.html.
+
+    Permission is hereby granted, free of charge, to any person obtaining a
+    copy of the Unicode data files and any associated documentation (the "Data
+    Files") or Unicode software and any associated documentation (the
+    "Software") to deal in the Data Files or Software without restriction,
+    including without limitation the rights to use, copy, modify, merge,
+    publish, distribute, and/or sell copies of the Data Files or Software, and
+    to permit persons to whom the Data Files or Software are furnished to do
+    so, provided that (a) the above copyright notice(s) and this permission
+    notice appear with all copies of the Data Files or Software, (b) both the
+    above copyright notice(s) and this permission notice appear in associated
+    documentation, and (c) there is clear notice in each modified Data File or
+    in the Software as well as in the documentation associated with the Data
+    File(s) or Software that the data or software has been modified.
+
+    THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
+    KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF
+    THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS
+    INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR
+    CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF
+    USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
+    TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+    PERFORMANCE OF THE DATA FILES OR SOFTWARE.
+
+    Except as contained in this notice, the name of a copyright holder shall
+    not be used in advertising or otherwise to promote the sale, use or other
+    dealings in these Data Files or Software without prior written
+    authorization of the copyright holder.
+
+    Unicode and the Unicode logo are trademarks of Unicode, Inc., and may be
+    registered in some jurisdictions. All other trademarks and registered
+    trademarks mentioned herein are the property of their respective owners.
+
 Portions of the utf8proc library are bundled with this distribution under
 modules/unicode/utf8proc.  The utf8proc library also contains materials
 derived from files licensed by the Unicode consortium.  Here are the licenses

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 30, 2011 at 04:40:00PM +0100, Nick Wellnhofer wrote:
> I had a closer look at the word boundary rules in UAX #29, and they  
> shouldn't be too hard to implement without using an external library. I  
> started with an initial prototype and it looks very promising.

Fantastic!

> In order to lookup the Word_Break property values, we have to precompute  
> a few tables. I would write a Perl script for that. The tables can be  
> generated once and shipped with the source code much like the tables for  
> utf8proc. I'm not sure where to put that script and the generated  
> tables, though.

The script likely belongs in trunk/devel/bin.

The file with the generated tables could arguably go in a few different
places.  I would suggest either trunk/core/Lucy/Analysis/WordBreakTables.c
if the tables are specialized, or trunk/core/Lucy/Util/UnicodeProperties.c if
we anticipate adding more tables in the future.

The generated file will need to embed the Unicode license, and should not have
an ALv2 license header.  We will also need to add an entry in
trunk/devel/conf/rat-excludes so that the generated file doesn't get flagged
by the Apache RAT[1] license header check run by buildbot at
<http://ci.apache.org/projects/lucy/rat-output.html>[2].

Lastly, we will need to add adapt LICENSE and NOTICE to accommodate the new
data.  I'll start a new thread for that, as recent conversations on
general@incubator indicated the need for additional changes to our LICENSE and
NOTICE files.

Marvin Humphrey

[1] http://incubator.apache.org/rat/
[2] Whoops, this is failing right now.  We need to deal with this before
    0.3.0.

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Nick Wellnhofer <we...@aevum.de>.

On 24/11/2011 22:41, Marvin Humphrey wrote:
> On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
>> On 23/11/11 03:50, Marvin Humphrey wrote:
>>> How about making this tokenizer implement the word break rules described in
>>> the Unicode standard annex on Text Segmentation?  That's what the Lucene
>>> StandardTokenizer does (as of 3.1).
>>
>> That would certainly be a nice choice for the default tokenizer. It
>> would be easy to implement with ICU but utf8proc doesn't buy us much
>> here.
>
> Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

I had a closer look at the word boundary rules in UAX #29, and they 
shouldn't be too hard to implement without using an external library. I 
started with an initial prototype and it looks very promising.

In order to lookup the Word_Break property values, we have to precompute 
a few tables. I would write a Perl script for that. The tables can be 
generated once and shipped with the source code much like the tables for 
utf8proc. I'm not sure where to put that script and the generated 
tables, though.

Nick

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
> On 23/11/11 03:50, Marvin Humphrey wrote:
>> How about making this tokenizer implement the word break rules described in
>> the Unicode standard annex on Text Segmentation?  That's what the Lucene
>> StandardTokenizer does (as of 3.1).
>
> That would certainly be a nice choice for the default tokenizer. It  
> would be easy to implement with ICU but utf8proc doesn't buy us much 
> here.

Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

>> I don't think we need to worry much about making this tokenizer flexible.  We
>> already offer a certain amount of flexibility via RegexTokenizer.
>
> Yes, making this tokenizer customizable probably isn't worth the effort.  
> I'd be happy with a simple tokenizer that extracts \w+ tokens. I can  
> offer to implement such a tokenizer if it's deemed useful.

A straight up \w+ tokenizer wouldn't be optimal for English, at least.  It
would break on apostrophes, resulting in a large number of solitary 's' tokens
thanks to possesives and contractions -- i.e. "maggie's farm" would tokenize
as ["maggie", "s", "farm"] instead of ["maggie's", "farm"].

Marvin Humphrey

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Nick Wellnhofer <we...@aevum.de>.

On 23/11/11 03:50, Marvin Humphrey wrote:
> How about making this tokenizer implement the word break rules described in
> the Unicode standard annex on Text Segmentation?  That's what the Lucene
> StandardTokenizer does (as of 3.1).

That would certainly be a nice choice for the default tokenizer. It 
would be easy to implement with ICU but utf8proc doesn't buy us much here.

> I don't think we need to worry much about making this tokenizer flexible.  We
> already offer a certain amount of flexibility via RegexTokenizer.

Yes, making this tokenizer customizable probably isn't worth the effort. 
I'd be happy with a simple tokenizer that extracts \w+ tokens. I can 
offer to implement such a tokenizer if it's deemed useful.

Nick

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Nov 22, 2011 at 11:10:59PM +0100, Nick Wellnhofer wrote:
> With the help of utf8proc we could implement a simple but more efficient
> tokenizer without external dependencies in core.

I like the idea.  It would be less flexible, but that's not a problem if we
continue to offer RegexTokenizer in addition to this one.

> Most important, we'd have to implement something similar to the \w regex
> character class.

Just a passing thought: I wonder if we could abuse the Lemon parser generator
for this.  It's a parser, not a lexer, but...

It would potentially be easier for sophisticated users to hack a grammar file
than a hand-coded lexer.

It would also be nice to use Lemon as much as we can so that more people get
familiar with it, and thus able to maintain all parts of Lucy that use it.

> The default regex in RegexTokenizer also handles apostrophes which I  
> don't find very useful personally. But this could also be implemented in  
> the core tokenizer.

How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).

    http://unicode.org/reports/tr29/

I don't think we need to worry much about making this tokenizer flexible.  We
already offer a certain amount of flexibility via RegexTokenizer.

Marvin Humphrey

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Dan Markham <dm...@gmail.com>.

quick grep in my code base I find these.

'[^¡]+'            --- crazy unicode char to be unique
'[^\x{1}]+'     --- another crazy  unique char 
'\S+'              --- we use this a lot to not get hit by strings with hyphens in them.
'\w+(?:[\'\x{2019}]\w+)*'  -- the default



-Dan




On Nov 22, 2011, at 2:10 PM, Nick Wellnhofer wrote:

> Currently, Lucy only provides the RegexTokenizer which is implemented on top of the perl regex engine. With the help of utf8proc we could implement a simple but more efficient tokenizer without external dependencies in core. Most important, we'd have to implement something similar to the \w regex character class. The Unicode standard [1,2] recommends that \w is equivalent to [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus circled letters. That's exactly how perl implements \w. Other implementations like .NET seem to differ slightly [3]. So we could lookup Unicode categories with utf8proc and then a perl-compatible check for a word character would be as easy as (cat <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
> 
> The default regex in RegexTokenizer also handles apostrophes which I don't find very useful personally. But this could also be implemented in the core tokenizer.
> 
> I'm wondering what other kind of regexes people are using with RegexTokenizer, and whether this simple core tokenizer should be customizable for some of these use cases.
> 
> Nick
> 
> [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
> [2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
> [3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

Re: [lucy-dev] Implementing a tokenizer in core

Posted by Peter Karman <pe...@peknet.com>.

Nick Wellnhofer wrote on 11/22/11 4:10 PM:
> Currently, Lucy only provides the RegexTokenizer which is implemented on top of
> the perl regex engine. With the help of utf8proc we could implement a simple but
> more efficient tokenizer without external dependencies in core. Most important,
> we'd have to implement something similar to the \w regex character class. The
> Unicode standard [1,2] recommends that \w is equivalent to
> [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter,
> Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus circled
> letters. That's exactly how perl implements \w. Other implementations like .NET
> seem to differ slightly [3]. So we could lookup Unicode categories with utf8proc
> and then a perl-compatible check for a word character would be as easy as (cat
> <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).
> 
> The default regex in RegexTokenizer also handles apostrophes which I don't find
> very useful personally. But this could also be implemented in the core tokenizer.
> 
> I'm wondering what other kind of regexes people are using with RegexTokenizer,
> and whether this simple core tokenizer should be customizable for some of these
> use cases.

When I use Lucy I use the default regex. That's mostly because I know my
collections are en_US. AFAIK, a language|locale-aware tokenizer would need to
discriminate "word" boundaries, for which \w might be too blunt an instrument.

I agree that a core tokenizer would be a Good Thing.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com