You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2011/01/20 14:19:37 UTC

Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

Trying to extend MappingCharFilter so that it only changes a token if 
the length of the token matches the length of singleMatch in 
NormalizeCharMap  (currently the singleMatch just has to be found in the 
token I want ut to match the whole token). Can this be done it sounds 
simple enough but I cannot make any headway understanding the 
MappingCharFilter source code

thanks Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

Posted by Paul Taylor <pa...@fastmail.fm>.
On 29/01/2011 01:45, Koji Sekiguchi wrote:
> (11/01/25 2:14), Paul Taylor wrote:
>> On 22/01/2011 15:43, Koji Sekiguchi wrote:
>>> (11/01/20 22:19), Paul Taylor wrote:
>>>> Trying to extend MappingCharFilter so that it only changes a token 
>>>> if the length of the token
>>>> matches the length of singleMatch in NormalizeCharMap (currently 
>>>> the singleMatch just has to be
>>>> found in the token I want ut to match the whole token). Can this be 
>>>> done it sounds simple enough but
>>>> I cannot make any headway understanding the MappingCharFilter 
>>>> source code
>>>>
>>>> thanks Paul
>>>
>>> Paul,
>>>
>>> Can you give us a concrete input/output (you wanted) with mapping table
>>> so that I can understand what you want?
>>>
>>> Thanks,
>>>
>>> Koji
>> Sure
>>
>> charConvertMap.add("!!!","ApostropheApostropheApostrophe");
>> charConvertMap.add("*** ***","StarStarStar");
>> charConvertMap.add("!","Apostrophe");
>>
>> Normally, punctuation gets removed during index and searching which 
>> is what I want for good search
>> results but when the token only contains specific punctuation strings 
>> I don't want to remove the
>> punctuation because it would make it impossible to match, so I 
>> convert it to a textual representation.
>>
>> As it stands in the 3rd case '!' will be preserved wherever it is 
>> found, so to get a good match on
>> 'Wow!' you would have to search for 'Wow!. But I want you to be able 
>> to search for 'Wow' and it
>> return 'Wow!' which is the case if "!" isn't in the char convert map, 
>> but if you searched for '!' I
>> want it to return the token which is just '!' which is only the case 
>> if the value is added to the map.
>>
>> I need to do this because the text we are indexing and searching are 
>> short strings representing an
>> music artist name (there is an artist called !!!)
>>
>> thanks Paul
>>
>>
> Hi Paul,
>
> Still I'm not sure I understand your issue correctly, but if you want:
>
> query="Wow!" result="Wow!"
> query="Wow"  result="Wow!"
> query="!"    result="Wow!"
> query="!!!"  result="!!!"
>
> does the following maps solve your problem?
> (I assume you use Whitespace-type-Tokenizer here)
>
> charConvertMap.add("!!!","ApostropheApostropheApostrophe");
> charConvertMap.add("!"," Apostrophe");  // there is a space in front 
> of "!"
>
> Koji
No, the list of names  your solution would convert all cases of 
apostrophe which is not what I want to, and I need to do this for is 
much larger than the two examples I give here,so you cannot rely on the 
order they are added
in.

Is it possible to help me with the original question, how do I subclass 
MaapingCharFilter so that it only changes complete matching tokens.

i.e if my charconvertmap contained

charConvertMap.add("!!!","ApostropheApostropheApostrophe");

it would convert a token of !!! to 'ApostropheApostropheApostrophe' but 
a token of 'Hello!!!' becomes Hello

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(11/01/25 2:14), Paul Taylor wrote:
> On 22/01/2011 15:43, Koji Sekiguchi wrote:
>> (11/01/20 22:19), Paul Taylor wrote:
>>> Trying to extend MappingCharFilter so that it only changes a token if the length of the token
>>> matches the length of singleMatch in NormalizeCharMap (currently the singleMatch just has to be
>>> found in the token I want ut to match the whole token). Can this be done it sounds simple enough but
>>> I cannot make any headway understanding the MappingCharFilter source code
>>>
>>> thanks Paul
>>
>> Paul,
>>
>> Can you give us a concrete input/output (you wanted) with mapping table
>> so that I can understand what you want?
>>
>> Thanks,
>>
>> Koji
> Sure
>
> charConvertMap.add("!!!","ApostropheApostropheApostrophe");
> charConvertMap.add("*** ***","StarStarStar");
> charConvertMap.add("!","Apostrophe");
>
> Normally, punctuation gets removed during index and searching which is what I want for good search
> results but when the token only contains specific punctuation strings I don't want to remove the
> punctuation because it would make it impossible to match, so I convert it to a textual representation.
>
> As it stands in the 3rd case '!' will be preserved wherever it is found, so to get a good match on
> 'Wow!' you would have to search for 'Wow!. But I want you to be able to search for 'Wow' and it
> return 'Wow!' which is the case if "!" isn't in the char convert map, but if you searched for '!' I
> want it to return the token which is just '!' which is only the case if the value is added to the map.
>
> I need to do this because the text we are indexing and searching are short strings representing an
> music artist name (there is an artist called !!!)
>
> thanks Paul
>
>
Hi Paul,

Still I'm not sure I understand your issue correctly, but if you want:

query="Wow!" result="Wow!"
query="Wow"  result="Wow!"
query="!"    result="Wow!"
query="!!!"  result="!!!"

does the following maps solve your problem?
(I assume you use Whitespace-type-Tokenizer here)

charConvertMap.add("!!!","ApostropheApostropheApostrophe");
charConvertMap.add("!"," Apostrophe");  // there is a space in front of "!"

Koji
-- 
http://www.rondhuit.com/en/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

Posted by Paul Taylor <pa...@fastmail.fm>.
On 22/01/2011 15:43, Koji Sekiguchi wrote:
> (11/01/20 22:19), Paul Taylor wrote:
>> Trying to extend MappingCharFilter so that it only changes a token if 
>> the length of the token
>> matches the length of singleMatch in NormalizeCharMap (currently the 
>> singleMatch just has to be
>> found in the token I want ut to match the whole token). Can this be 
>> done it sounds simple enough but
>> I cannot make any headway understanding the MappingCharFilter source 
>> code
>>
>> thanks Paul
>
> Paul,
>
> Can you give us a concrete input/output (you wanted) with mapping table
> so that I can understand what you want?
>
> Thanks,
>
> Koji
Sure

charConvertMap.add("!!!","ApostropheApostropheApostrophe");
charConvertMap.add("*** ***","StarStarStar");
charConvertMap.add("!","Apostrophe");

Normally, punctuation gets removed during index and searching which is 
what I want for good search results but when the token only contains 
specific punctuation strings I don't want to remove the punctuation 
because it would make it impossible to match, so I convert it to a 
textual representation.

As it stands in the 3rd case '!' will be preserved wherever it is found, 
so to get a good match on 'Wow!' you would have to search for 'Wow!. But 
I want you to be able to search for 'Wow' and it return 'Wow!' which is 
the case if "!" isn't in the char convert map, but if you searched for 
'!'  I want it to  return the token which is just '!' which is only the 
case if the value is added to the map.

I need to do this because the text we are indexing and searching are 
short strings representing an music artist name (there is an artist 
called !!!)

thanks Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Trying to extend MappingCharFilter so that it only changes a token if the length of the token matches the length of singleMatch

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(11/01/20 22:19), Paul Taylor wrote:
> Trying to extend MappingCharFilter so that it only changes a token if the length of the token
> matches the length of singleMatch in NormalizeCharMap (currently the singleMatch just has to be
> found in the token I want ut to match the whole token). Can this be done it sounds simple enough but
> I cannot make any headway understanding the MappingCharFilter source code
>
> thanks Paul

Paul,

Can you give us a concrete input/output (you wanted) with mapping table
so that I can understand what you want?

Thanks,

Koji
-- 
http://www.rondhuit.com/en/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org