You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Igal @ getRailo.org" <ig...@getrailo.org> on 2012/11/04 00:35:29 UTC

using CharFilter to inject a space

hi,

I want to make sure that every comma (,) and semi-colon (;) is followed 
by a space prior to tokenizing.

the idea is to then use a WhitespaceTokenizer which will keep commas but 
still split the phrase in a case like:

     "I bought red apples,green pears,and yellow oranges"

I'm thinking of extending CharFilter to "inject" a space after the 
comma.  my questions are:

     1) does it make sense or am I completely off here?

     2) are there any code examples of CharFilter implementations with 
injection of a char?

TIA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Jack Krupansky <ja...@basetechnology.com>.
I still think that we're looking at an "XY Problem" here, haggling over a 
"solution" when the problem has not been clearly and fully stated.

In particular, rather than parsing straight natural language text, the data 
appears to have a structured form. Until the structure is fully defined, 
detailing a parser, especially by playing games such as "injecting spaces" 
is an exercise in futility. I mean, you MIGHT come up with a solution that 
SEEMS to work (at least for SOME cases), and MAY make you happy, but I would 
hate to see other Lucene users adopt such an approach to problem solving.

Tell us the full problem and then we can focus on legitimate "solutions".

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Sunday, November 04, 2012 8:06 AM
To: java-user
Subject: Re: using CharFilter to inject a space

Ahh, I don't know of a better way. I can imagine complex solutions
involving something akin to WordDelimiterFilter... and I can imagine that
that would be ridiculously expensive to maintain when there are really
simple solutions like you're looking at.

Mostly I was curious about your use-case....

Erick


On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org 
<ig...@getrailo.org>wrote:

> well, my main goal is to use a ShingleFilter that will only take shingles
> that are not separated by commas etc.
>
> for example, the phrase:
>
>     "red apples, green tomatoes, and brown potatoes"
>
> should yield the shingles "red apples", "green tomatoes", "and brown",
> "brown potatoes"; but not "apples green" and not "tomatoes and" as those
> are separated by commas.
>
> the problem with the common tokenizers is that they get rid of the commas
> so if I use a ShingleFilter after them there's no way to tell if there was
> a comma there or not.
>
> (another option I consider is to add an Attribute to specify if there was
> a comma before or after a token)
>
> if there's a better way -- I'm open to suggestions,
>
>
> Igal
>
>
>
> On 11/3/2012 8:10 PM, Erick Erickson wrote:
>
>> So I've gotta ask... _why_ do you want to inject the spaces?
>> If it's just to break this up into tokens,  wouldn't something like
>> LetterTokenizer do? Assuming you aren't interested in
>> leaving in numbers.... Or even StandardTokenizer unless you have
>> e-mail & etc.
>>
>> Or what about PatternReplaceCharFilter?
>>
>> FWIW,
>> Erick
>>
>>
>>
>> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <ig...@getrailo.org> wrote:
>>
>>  You're right.  I'm not sure what I was thinking.
>>>
>>> Thanks for all your help,
>>>
>>> Igal
>>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <rc...@gmail.com> wrote:
>>>
>>>  On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org>
>>>> wrote:
>>>>
>>>>> hi Robert,
>>>>>
>>>>> thank you for your replies.
>>>>>
>>>>> I couldn't find much documentation/examples of this, but this is what 
>>>>> I
>>>>>
>>>> came
>>>>
>>>>> up with (below).  is that the way I'm supposed to use the
>>>>>
>>>> MappingCharFilter?
>>>> You don't need to extend anything.
>>>> You also don't want to create a normalizecharmap for each reader
>>>> (thats way too heavy)
>>>>
>>>> Just build the NormalizeCharMap once, and pass it to
>>>> MappingCharFilter's Constructor.
>>>>
>>>> ------------------------------**------------------------------**
>>>> ---------
>>>> To unsubscribe, e-mail: 
>>>> java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
>>>> For additional commands, e-mail: 
>>>> java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>>>>
>>>>
>>>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Erick Erickson <er...@gmail.com>.
Ahh, I don't know of a better way. I can imagine complex solutions
involving something akin to WordDelimiterFilter... and I can imagine that
that would be ridiculously expensive to maintain when there are really
simple solutions like you're looking at.

Mostly I was curious about your use-case....

Erick


On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org <ig...@getrailo.org>wrote:

> well, my main goal is to use a ShingleFilter that will only take shingles
> that are not separated by commas etc.
>
> for example, the phrase:
>
>     "red apples, green tomatoes, and brown potatoes"
>
> should yield the shingles "red apples", "green tomatoes", "and brown",
> "brown potatoes"; but not "apples green" and not "tomatoes and" as those
> are separated by commas.
>
> the problem with the common tokenizers is that they get rid of the commas
> so if I use a ShingleFilter after them there's no way to tell if there was
> a comma there or not.
>
> (another option I consider is to add an Attribute to specify if there was
> a comma before or after a token)
>
> if there's a better way -- I'm open to suggestions,
>
>
> Igal
>
>
>
> On 11/3/2012 8:10 PM, Erick Erickson wrote:
>
>> So I've gotta ask... _why_ do you want to inject the spaces?
>> If it's just to break this up into tokens,  wouldn't something like
>> LetterTokenizer do? Assuming you aren't interested in
>> leaving in numbers.... Or even StandardTokenizer unless you have
>> e-mail & etc.
>>
>> Or what about PatternReplaceCharFilter?
>>
>> FWIW,
>> Erick
>>
>>
>>
>> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <ig...@getrailo.org> wrote:
>>
>>  You're right.  I'm not sure what I was thinking.
>>>
>>> Thanks for all your help,
>>>
>>> Igal
>>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <rc...@gmail.com> wrote:
>>>
>>>  On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org>
>>>> wrote:
>>>>
>>>>> hi Robert,
>>>>>
>>>>> thank you for your replies.
>>>>>
>>>>> I couldn't find much documentation/examples of this, but this is what I
>>>>>
>>>> came
>>>>
>>>>> up with (below).  is that the way I'm supposed to use the
>>>>>
>>>> MappingCharFilter?
>>>> You don't need to extend anything.
>>>> You also don't want to create a normalizecharmap for each reader
>>>> (thats way too heavy)
>>>>
>>>> Just build the NormalizeCharMap once, and pass it to
>>>> MappingCharFilter's Constructor.
>>>>
>>>> ------------------------------**------------------------------**
>>>> ---------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
>>>> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>>>>
>>>>
>>>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: using CharFilter to inject a space

Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
well, my main goal is to use a ShingleFilter that will only take 
shingles that are not separated by commas etc.

for example, the phrase:

     "red apples, green tomatoes, and brown potatoes"

should yield the shingles "red apples", "green tomatoes", "and brown", 
"brown potatoes"; but not "apples green" and not "tomatoes and" as those 
are separated by commas.

the problem with the common tokenizers is that they get rid of the 
commas so if I use a ShingleFilter after them there's no way to tell if 
there was a comma there or not.

(another option I consider is to add an Attribute to specify if there 
was a comma before or after a token)

if there's a better way -- I'm open to suggestions,


Igal


On 11/3/2012 8:10 PM, Erick Erickson wrote:
> So I've gotta ask... _why_ do you want to inject the spaces?
> If it's just to break this up into tokens,  wouldn't something like
> LetterTokenizer do? Assuming you aren't interested in
> leaving in numbers.... Or even StandardTokenizer unless you have
> e-mail & etc.
>
> Or what about PatternReplaceCharFilter?
>
> FWIW,
> Erick
>
>
>
> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <ig...@getrailo.org> wrote:
>
>> You're right.  I'm not sure what I was thinking.
>>
>> Thanks for all your help,
>>
>> Igal
>>   On Nov 3, 2012 5:44 PM, "Robert Muir" <rc...@gmail.com> wrote:
>>
>>> On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org>
>>> wrote:
>>>> hi Robert,
>>>>
>>>> thank you for your replies.
>>>>
>>>> I couldn't find much documentation/examples of this, but this is what I
>>> came
>>>> up with (below).  is that the way I'm supposed to use the
>>> MappingCharFilter?
>>> You don't need to extend anything.
>>> You also don't want to create a normalizecharmap for each reader
>>> (thats way too heavy)
>>>
>>> Just build the NormalizeCharMap once, and pass it to
>>> MappingCharFilter's Constructor.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Erick Erickson <er...@gmail.com>.
So I've gotta ask... _why_ do you want to inject the spaces?
If it's just to break this up into tokens,  wouldn't something like
LetterTokenizer do? Assuming you aren't interested in
leaving in numbers.... Or even StandardTokenizer unless you have
e-mail & etc.

Or what about PatternReplaceCharFilter?

FWIW,
Erick



On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <ig...@getrailo.org> wrote:

> You're right.  I'm not sure what I was thinking.
>
> Thanks for all your help,
>
> Igal
>  On Nov 3, 2012 5:44 PM, "Robert Muir" <rc...@gmail.com> wrote:
>
> > On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org>
> > wrote:
> > > hi Robert,
> > >
> > > thank you for your replies.
> > >
> > > I couldn't find much documentation/examples of this, but this is what I
> > came
> > > up with (below).  is that the way I'm supposed to use the
> > MappingCharFilter?
> > >
> >
> > You don't need to extend anything.
> > You also don't want to create a normalizecharmap for each reader
> > (thats way too heavy)
> >
> > Just build the NormalizeCharMap once, and pass it to
> > MappingCharFilter's Constructor.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: using CharFilter to inject a space

Posted by Igal Sapir <ig...@getrailo.org>.
You're right.  I'm not sure what I was thinking.

Thanks for all your help,

Igal
 On Nov 3, 2012 5:44 PM, "Robert Muir" <rc...@gmail.com> wrote:

> On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org>
> wrote:
> > hi Robert,
> >
> > thank you for your replies.
> >
> > I couldn't find much documentation/examples of this, but this is what I
> came
> > up with (below).  is that the way I'm supposed to use the
> MappingCharFilter?
> >
>
> You don't need to extend anything.
> You also don't want to create a normalizecharmap for each reader
> (thats way too heavy)
>
> Just build the NormalizeCharMap once, and pass it to
> MappingCharFilter's Constructor.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: using CharFilter to inject a space

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> hi Robert,
>
> thank you for your replies.
>
> I couldn't find much documentation/examples of this, but this is what I came
> up with (below).  is that the way I'm supposed to use the MappingCharFilter?
>

You don't need to extend anything.
You also don't want to create a normalizecharmap for each reader
(thats way too heavy)

Just build the NormalizeCharMap once, and pass it to
MappingCharFilter's Constructor.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
hi Robert,

thank you for your replies.

I couldn't find much documentation/examples of this, but this is what I 
came up with (below).  is that the way I'm supposed to use the 
MappingCharFilter?

also, if that is the correct way, wouldn't it make sense to return a 
reference to "this" from NormalizeCharMap.Builder.add() so that we can 
chain the calls to add() like so: builder.add( ",", ", " ).add( ";", "; 
" ).build() ?

thanks,

Igal


     public class CommaSpaceCharFilter extends MappingCharFilter {

         public CommaSpaceCharFilter( Reader input ) {

             super( getMap(), input );
         }

         final static NormalizeCharMap getMap() {

             NormalizeCharMap.Builder builder = new 
NormalizeCharMap.Builder();

             builder.add( ",", ", " );
             builder.add( ";", "; " );

             NormalizeCharMap ncm = builder.build();

             return ncm;
         }
     }



On 11/3/2012 5:13 PM, Robert Muir wrote:
> On Sat, Nov 3, 2012 at 7:47 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
>> I considered it, and it's definitely an option.
>>
>> but I read in the book "Lucene In Action" that MappingCharFilter is
>> inefficient and I'm not sure that I need that.  if implementing my own
>> involves a lot of coding then I might resort to it as I don't have large
>> data sets to index at this time.
> Also I think (dont remember off the top of my head) that this note in
> Lucene in Action refers to the fact that its base class
> (BaseCharFilter) corrected offsets in O(n) at the time.
>
> We fixed this to be O(log(N)) here as of 3.1:
> https://issues.apache.org/jira/browse/LUCENE-2098
>
> So I think its worth giving it a try before trying to code something yourself!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Nov 3, 2012 at 7:47 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> I considered it, and it's definitely an option.
>
> but I read in the book "Lucene In Action" that MappingCharFilter is
> inefficient and I'm not sure that I need that.  if implementing my own
> involves a lot of coding then I might resort to it as I don't have large
> data sets to index at this time.

Also I think (dont remember off the top of my head) that this note in
Lucene in Action refers to the fact that its base class
(BaseCharFilter) corrected offsets in O(n) at the time.

We fixed this to be O(log(N)) here as of 3.1:
https://issues.apache.org/jira/browse/LUCENE-2098

So I think its worth giving it a try before trying to code something yourself!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Nov 3, 2012 at 7:47 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> I considered it, and it's definitely an option.
>
> but I read in the book "Lucene In Action" that MappingCharFilter is
> inefficient and I'm not sure that I need that.  if implementing my own
> involves a lot of coding then I might resort to it as I don't have large
> data sets to index at this time.
>

The implementation changed for 4.0:
https://issues.apache.org/jira/browse/LUCENE-3830

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
I considered it, and it's definitely an option.

but I read in the book "Lucene In Action" that MappingCharFilter is 
inefficient and I'm not sure that I need that.  if implementing my own 
involves a lot of coding then I might resort to it as I don't have large 
data sets to index at this time.

thanks for your answer,


Igal


On 11/3/2012 4:42 PM, Robert Muir wrote:
> On Sat, Nov 3, 2012 at 7:35 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
>> hi,
>>
>> I want to make sure that every comma (,) and semi-colon (;) is followed by a
>> space prior to tokenizing.
>>
>> the idea is to then use a WhitespaceTokenizer which will keep commas but
>> still split the phrase in a case like:
>>
>>      "I bought red apples,green pears,and yellow oranges"
>>
>> I'm thinking of extending CharFilter to "inject" a space after the comma.
>> my questions are:
>>
>>      1) does it make sense or am I completely off here?
>>
>>      2) are there any code examples of CharFilter implementations with
>> injection of a char?
> Can't you just use something like MappingCharFilter with a single
> mapping of "," to ", " ?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: using CharFilter to inject a space

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Nov 3, 2012 at 7:35 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> hi,
>
> I want to make sure that every comma (,) and semi-colon (;) is followed by a
> space prior to tokenizing.
>
> the idea is to then use a WhitespaceTokenizer which will keep commas but
> still split the phrase in a case like:
>
>     "I bought red apples,green pears,and yellow oranges"
>
> I'm thinking of extending CharFilter to "inject" a space after the comma.
> my questions are:
>
>     1) does it make sense or am I completely off here?
>
>     2) are there any code examples of CharFilter implementations with
> injection of a char?

Can't you just use something like MappingCharFilter with a single
mapping of "," to ", " ?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org