You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Valery <kh...@gmail.com> on 2009/08/20 16:28:08 UTC

Any Tokenizator friendly to C++, C#, .NET, etc ?

Hi all,

I am trying to tune Lucene to respect such tokens like C++, C#, .NET

The task is known for Lucene community, but surprisingly I can't google out
somewhat good info on it.

Of course, I tried to re-use Lucene's building blocks for Tokenizer. Here
we go:

1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
C#, .NET" ends up with "c c net". Too bad.

2) WhitespaceTokenizer gives me a lot of lexems that are actually should
have been chopped into smaller pieces. Example: "C/C++" comes out like a
single lexem. If I follow this way I end-up with "Tokenization of tokens" --
that sounds a bit odd, doesn't it?

3) CharTokenizer allows me to add the '/' to be also a token-emitting
char, but then '/' gets immediately lost like those whitespace chars. In
result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
original char stream for the "/" char to re-build "SAP R/3" term as a whole.

Do you see any other relevant building blocks missed by me?

Also, people around there have meant that such problem should be solved by a
synonym dictionary. However this hint sheds no light on which tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the starting
point and write anew my own Tokenizer. This Tokenizer should also react on
delimiting characters and emit the token. However, it should distinguish
between delimiters like whitespaces along with ";,?" and the delimiters like
"./&".

Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
Lexem level,
whereas the token emitting characters like "./&" should be kept in Lexem
level.

Your comments, gurus?

regards,
Valery

--
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Hi Ken, 

thanks for the comments. Well, Terrence's ANTLR was and is a good piece of
work. 

Do you mean that you use ANTLR to generate a Tokenzer (lexem parser) 

or 

did you even proceed further and used ANTLR to generate higher level parsers
to overrule Lucene's TokenFilters?

or maybe even both?..

regards,
Valery



Ken Krugler wrote:
> 
> Hi Valery,
> 
>  From our experience at Krugle, we wound up having to create our own  
> tokenizers (actually kind of  specialized parser) for the different  
> languages. It didn't seem like a good option to try to twist one of  
> the existing tokenizers into something that would work well enough. We  
> wound up using ANTLR for this.
> 
> -- Ken
> 
> 
> On Aug 20, 2009, at 8:09am, Valery wrote:
> 
>>
>> Hi Robert,
>>
>> thanks for the hint.
>>
>> Indeed, a natural way to go. Especially if one builds a Tokenizer of  
>> the
>> level of quality like StandardTokenizer's.
>>
>> OTOH, you mean that the out-of-the-box stuff is indeed not  
>> customizable for
>> this task?..
>>
>> regards
>> Valery
>>
>>
>>
>> Robert Muir wrote:
>>>
>>> Valery,
>>>
>>> One thing you could try would be to create a JFlex-based tokenizer,
>>> specifying a grammar with the rules you want.
>>> You could use the source code & grammar of StandardTokenizer as a
>>> starting point.
>>>
>>>
>>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<kh...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>>
>>>> The task is known for Lucene community, but surprisingly I can't  
>>>> google
>>>> out
>>>> somewhat good info on it.
>>>>
>>>> Of course, I tried to re-use Lucene's  building blocks for  
>>>> Tokenizer.
>>>> Here
>>>> we go:
>>>>
>>>>  1) StandardTokenizer -- oh, this option would be just fantastic,  
>>>> but
>>>> "C++,
>>>> C#, .NET" ends up with "c c net". Too bad.
>>>>
>>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually  
>>>> should
>>>> have been chopped into smaller pieces. Example: "C/C++" comes out  
>>>> like a
>>>> single lexem. If I follow this way I end-up with "Tokenization of  
>>>> tokens"
>>>> --
>>>> that sounds a bit odd, doesn't it?
>>>>
>>>>  3) CharTokenizer allows me to add the '/' to be also a token- 
>>>> emitting
>>>> char, but then '/' gets immediately lost like those whitespace  
>>>> chars. In
>>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
>>>> search
>>>> the
>>>> original char stream for the "/" char to re-build "SAP R/3" term  
>>>> as a
>>>> whole.
>>>>
>>>> Do you see any other relevant building blocks missed by me?
>>>>
>>>> Also, people around there have meant that such problem should be  
>>>> solved
>>>> by a
>>>> synonym dictionary. However this hint sheds no light on which
>>>> tokenization
>>>> strategy should be more appropriate *before* the synonym step.
>>>>
>>>> So, it looks like I have to take the class CharTokenizer as for the
>>>> starting
>>>> point and write anew my own Tokenizer. This Tokenizer should also  
>>>> react
>>>> on
>>>> delimiting characters and emit the token. However, it should  
>>>> distinguish
>>>> between delimiters like whitespaces along with ";,?" and the  
>>>> delimiters
>>>> like
>>>> "./&".
>>>>
>>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown  
>>>> away
>>>> from
>>>> Lexem level,
>>>> whereas the token emitting characters like "./&" should be kept in  
>>>> Lexem
>>>> level.
>>>>
>>>> Your comments, gurus?
>>>>
>>>> regards,
>>>> Valery
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> -- 
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066540.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Ken Krugler <kk...@transpac.com>.

Hi Valery,

 From our experience at Krugle, we wound up having to create our own  
tokenizers (actually kind of  specialized parser) for the different  
languages. It didn't seem like a good option to try to twist one of  
the existing tokenizers into something that would work well enough. We  
wound up using ANTLR for this.

-- Ken


On Aug 20, 2009, at 8:09am, Valery wrote:

>
> Hi Robert,
>
> thanks for the hint.
>
> Indeed, a natural way to go. Especially if one builds a Tokenizer of  
> the
> level of quality like StandardTokenizer's.
>
> OTOH, you mean that the out-of-the-box stuff is indeed not  
> customizable for
> this task?..
>
> regards
> Valery
>
>
>
> Robert Muir wrote:
>>
>> Valery,
>>
>> One thing you could try would be to create a JFlex-based tokenizer,
>> specifying a grammar with the rules you want.
>> You could use the source code & grammar of StandardTokenizer as a
>> starting point.
>>
>>
>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<kh...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>
>>> The task is known for Lucene community, but surprisingly I can't  
>>> google
>>> out
>>> somewhat good info on it.
>>>
>>> Of course, I tried to re-use Lucene's  building blocks for  
>>> Tokenizer.
>>> Here
>>> we go:
>>>
>>>  1) StandardTokenizer -- oh, this option would be just fantastic,  
>>> but
>>> "C++,
>>> C#, .NET" ends up with "c c net". Too bad.
>>>
>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually  
>>> should
>>> have been chopped into smaller pieces. Example: "C/C++" comes out  
>>> like a
>>> single lexem. If I follow this way I end-up with "Tokenization of  
>>> tokens"
>>> --
>>> that sounds a bit odd, doesn't it?
>>>
>>>  3) CharTokenizer allows me to add the '/' to be also a token- 
>>> emitting
>>> char, but then '/' gets immediately lost like those whitespace  
>>> chars. In
>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
>>> search
>>> the
>>> original char stream for the "/" char to re-build "SAP R/3" term  
>>> as a
>>> whole.
>>>
>>> Do you see any other relevant building blocks missed by me?
>>>
>>> Also, people around there have meant that such problem should be  
>>> solved
>>> by a
>>> synonym dictionary. However this hint sheds no light on which
>>> tokenization
>>> strategy should be more appropriate *before* the synonym step.
>>>
>>> So, it looks like I have to take the class CharTokenizer as for the
>>> starting
>>> point and write anew my own Tokenizer. This Tokenizer should also  
>>> react
>>> on
>>> delimiting characters and emit the token. However, it should  
>>> distinguish
>>> between delimiters like whitespaces along with ";,?" and the  
>>> delimiters
>>> like
>>> "./&".
>>>
>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown  
>>> away
>>> from
>>> Lexem level,
>>> whereas the token emitting characters like "./&" should be kept in  
>>> Lexem
>>> level.
>>>
>>> Your comments, gurus?
>>>
>>> regards,
>>> Valery
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Simon Willnauer <si...@googlemail.com>.

On Fri, Aug 21, 2009 at 2:18 PM, Valery<kh...@gmail.com> wrote:
>
>
> Simon Willnauer wrote:
>>
>> I already responded... again...
>>
> sorry, I've been in answering and seen your post right after sending.
>
>
> Simon Willnauer wrote:
>>
>> Tokenizer splits the input stream into tokens (Token.java) and
>> TokenFilter subclasses operate on those. I expect from a Tokenizer
>> that is provides me a stream of tokens :) - how those tokens are
>> created is the responsibility of the Tokenizer.
>
> According to your requirements:
>
>  * one programmer will write a simplistic Tokenizer that converts a whole
> char input into a 1 huge token.
>
>  * another programmer will write a simplistic Tokenizer that converts each
> single char of the input into a 1-char token.  It will end up in a huge
> number of 1-char tokens.
>
> Moreoever, both claim the job is done in a brilliant way, because the
> Tokenizer is based on a 1-line statement in Java...
>
> Who did the work better?
>
> Said that, I'd love to hear more specific requirements about Tokenizer to
> avoid the above odd deliveries :)
The answer is again "it depends"  if you need two tokenizers one
creating tokens by dividing at non-lettser and another one dividing at
whitespaces a Tokenizer that output every single char is a good super
class for those two.
See LetterTokenizer / WhitespaceTokenizer and their common superclass
CharTokenizer.

Asking the question who did a better job is not valid without
specifying the requirements. Anyway, does WhitespaceTokenizer solve
your problem?!
As Robert said - have a look at the smartcn stuff this is the other
extreme - it always depends.

simon
>
> regards
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25078755.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Simon Willnauer wrote:
> 
> I already responded... again...
> 
sorry, I've been in answering and seen your post right after sending.

Simon Willnauer wrote:
> 
> Tokenizer splits the input stream into tokens (Token.java) and
> TokenFilter subclasses operate on those. I expect from a Tokenizer
> that is provides me a stream of tokens :) - how those tokens are
> created is the responsibility of the Tokenizer.

According to your requirements:

 * one programmer will write a simplistic Tokenizer that converts a whole
char input into a 1 huge token. 

 * another programmer will write a simplistic Tokenizer that converts each
single char of the input into a 1-char token.  It will end up in a huge
number of 1-char tokens.

Moreoever, both claim the job is done in a brilliant way, because the
Tokenizer is based on a 1-line statement in Java...

Who did the work better?

Said that, I'd love to hear more specific requirements about Tokenizer to
avoid the above odd deliveries :)

regards
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25078755.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Simon Willnauer <si...@googlemail.com>.

On Fri, Aug 21, 2009 at 12:51 PM, Valery<kh...@gmail.com> wrote:
>
> Hi John,
>
> (aren't you the same John Byrne who is a key contributor to the great
> OpenSSI project?)
>
>
> John Byrne-3 wrote:
>>
>> I'm inclined to disagree with the idea that a token should not be split
>> again downstream. I think that is actually a much easier way to handle
>> it. I would have the tokenizer return the longest match, and then split
>> it in a token filter. In fact I have dones this before and it has worked
>> fine for me.
>>
>
> well, I could soften my position: if the token re-parsing is done by looking
> into currentlexem value only, then it might be perhaps accepted. In
> contrast, if during your re-parsing you have to look into the upstream
> characters data "several filters backwards", then, IMHO, it is rather messy
> and unacceptable.
>
>
> Regarding this part:
>
> John Byrne-3 wrote:
>>
>> I think you will have to maintain some state within the token filter
>> [...]
>>
>
> I would wait for Simon's answer to the question "What do you expect from the
> Tokenizer?"
>
I already responded...
again...
<snip>
Well, Tokenizer, TokenFilter both of them are sub are subclasses of
TokenStream while their input differ. A Tokenizer gets the input from
a reader and creates Tokens from this input. A TokenFilter uses the
tokens created by the Tokenizer and does further processing. For
instance. An Analyzer that uses WhitespaceTokenizer as an input for
LowerCaseFilter would produce the following:


Input:  C# or .NET

WhitespaceTokenizer:
[Tokenstring: "C#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".Net"; offset: 6->10; pos: 3]
LowerCaseFilter:
[Tokenstring: "c#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".net"; offset: 6->10; pos: 3]

if you wanna do any further processing with those tokens you can add
your own TokenFilter and modify the tokens as you need. you could do
the whole job in a Tokenizer but this would not be a good separation
of concerns right!?
</snip>

Tokenizer splits the input stream into tokens (Token.java) and
TokenFilter subclasses operate on those. I expect from a Tokenizer
that is provides me a stream of tokens :) - how those tokens are
created is the responsibility of the Tokenizer. to LowerCase, remove
Stopwords, adding payloads etc. is the job of the TokenFilter.

simon

> Then I will give my 2cents on this and perhaps then I could sum up all
> opinions and adopt a common conclusion.
> :)
>
> regards
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by John Byrne <jo...@propylon.com>.

Valery wrote:
> Hi John, 
>
> (aren't you the same John Byrne who is a key contributor to the great
> OpenSSI project?)
>   
Nope, never heard of him! But with a great name like that I'm sure he'll 
go a long way :)
>
> John Byrne-3 wrote:
>   
>> I'm inclined to disagree with the idea that a token should not be split 
>> again downstream. I think that is actually a much easier way to handle 
>> it. I would have the tokenizer return the longest match, and then split 
>> it in a token filter. In fact I have dones this before and it has worked 
>> fine for me.
>>
>>     
>
> well, I could soften my position: if the token re-parsing is done by looking
> into currentlexem value only, then it might be perhaps accepted. In
> contrast, if during your re-parsing you have to look into the upstream
> characters data "several filters backwards", then, IMHO, it is rather messy
> and unacceptable. 
>   
If I understand you correctly, that's pretty much what I meant. By 
having the first tokenizer pass larger tokens, and splitting them in the 
filter, you never have to look upstream while storing state. You only 
look upstream for a new token after you are finished splitting the last 
one and sending the parts downstream.

>
> Regarding this part:
>
> John Byrne-3 wrote:
>   
>> I think you will have to maintain some state within the token filter 
>> [...]
>>
>>     
>
> I would wait for Simon's answer to the question "What do you expect from the
> Tokenizer?"
>
> Then I will give my 2cents on this and perhaps then I could sum up all
> opinions and adopt a common conclusion.
> :)
>
> regards
> Valery
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.392 / Virus Database: 270.13.63/2316 - Release Date: 08/20/09 18:06:00
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Hi John, 

(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)


John Byrne-3 wrote:
> 
> I'm inclined to disagree with the idea that a token should not be split 
> again downstream. I think that is actually a much easier way to handle 
> it. I would have the tokenizer return the longest match, and then split 
> it in a token filter. In fact I have dones this before and it has worked 
> fine for me.
> 

well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable. 


Regarding this part:

John Byrne-3 wrote:
> 
> I think you will have to maintain some state within the token filter 
> [...]
> 

I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"

Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)

regards
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by John Byrne <jo...@propylon.com>.

Hi Valery,

I'm inclined to disagree with the idea that a token should not be split 
again downstream. I think that is actually a much easier way to handle 
it. I would have the tokenizer return the longest match, and then split 
it in a token filter. In fact I have dones this before and it has worked 
fine for me.

I think you will have to maintain some state within the token filter 
either way - think about how you'd do that in each case:
-To split longer tokens, you will just have to split the token into a 
list of sub-tokens, then temporarily store these, and return them on 
subsequent calls to the filter. When the list is empty, you get another 
token from the tokenizer.

-To join up shorter tokens, you would have to basically do what he 
original tokenizer did - try to match sequences of characters to patterns.

The second way sounds harder to me, and it's the job of the original 
tokenizer anyway. Do the simplest thing that could possibly work!

Regards,
-John

Valery wrote:
> Hi Robert, 
>
> so, would you expect a Tokenizer to consider '/' in both cases as a separate
> Token?
>
> Personally, I see no problem if Tokenzer would do the following job:
>
> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} 
> and come up with "C" and "C++" tokens after processing through the
> downstream tokenfilters.
>
> Similarly:
>
> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} 
> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>
> I try to follow a spirit that a token (or its lexem) usually should never be
> parsed again. One can build  more complex (compound) things from the tokens.
> However, usually one never chops a lexem into smaller pieces.
>
> What do you think, Robert?
>
> regards,
> Valery
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.392 / Virus Database: 270.13.62/2315 - Release Date: 08/20/09 06:05:00
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Robert Muir <rc...@gmail.com>.

Valery,

FWIW, to answer this question, I think the answer is still "it depends".
I agree with John, I think it is much easier for your tokenizer to
create tokens that contain all the context you need for the downstream
filters to do their job.
I don't think you can put some exact specification on what this is, it
really does depend on a lot of things, mostly the way that you need to
handle text for your application.

For an extreme example, the Tokenizer in lucene contrib's
SmartChineseAnalyzer (SentenceTokenizer) actually outputs entire
phrases as tokens.
This is because the downstream tokenfilter (WordTokenFilter) needs
that kind of context to subdivide the phrases into words.

On Fri, Aug 21, 2009 at 7:45 AM, Valery<kh...@gmail.com> wrote:
>
>
> Simon Willnauer wrote:
>>
>> you could do
>> the whole job in a Tokenizer but this would not be a good separation
>> of concerns right!?
>>
>
> right, it wouldn't be a good separation of concerns.
> That's why I wanted to know what you consider as "Tokenizer's job".
>
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25077083.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.


Simon Willnauer wrote:
> 
> you could do
> the whole job in a Tokenizer but this would not be a good separation
> of concerns right!?
> 

right, it wouldn't be a good separation of concerns. 
That's why I wanted to know what you consider as "Tokenizer's job".


-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25077083.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Simon Willnauer <si...@googlemail.com>.

On Fri, Aug 21, 2009 at 10:26 AM, Valery<kh...@gmail.com> wrote:
>
> Hi Simon,
>
>
> Simon Willnauer wrote:
>>
>> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
>> [...]?!
>>
>> simon
>>
>
> yes, I did, please find the info in the initial message. Here are the
> excerpts:
>
>
> Valery wrote:
>>
>>   2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>> -- that sounds a bit odd, doesn't it?
>>
>>   3) CharTokenizer allows me to add the '/' to be also a token-emitting
>> char, but then '/' gets immediately lost like those whitespace chars. In
>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>> the original char stream for the "/" char to re-build "SAP R/3" term as a
>> whole.
>
> regarding this part:
>
>
> Simon Willnauer wrote:
>>
>> Valery, have you tried to [...] and do any further processing in a  custom
>> TokenFilter?!
>> simon
>>
>
> yes, and that's why I have sent the initial post "Any Tokenizator friendly
> to C++, C#, .NET, etc ?"
> Actually, I am a bit confused to do a Tokenizer's job in filters and
> re-parse the char stream.
>
> Simon, what do you expect from the Tokenizer?
> (In other words, what job is exclusively "Tokenizer's Job" and should rather
> not be done in downstream filters?)

Well, Tokenizer, TokenFilter both of them are sub are subclasses of
TokenStream while their input differ. A Tokenizer gets the input from
a reader and creates Tokens from this input. A TokenFilter uses the
tokens created by the Tokenizer and does further processing. For
instance. An Analyzer that uses WhitespaceTokenizer as an input for
LowerCaseFilter would produce the following:


Input:  C# or .NET

WhitespaceTokenizer:
[Tokenstring: "C#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".Net"; offset: 6->10; pos: 3]
LowerCaseFilter:
[Tokenstring: "c#"; offset: 0->2; pos: 1]
[Tokenstring: "or"; offset: 3->5; pos: 2]
[Tokenstring: ".net"; offset: 6->10; pos: 3]

if you wanna do any further processing with those tokens you can add
your own TokenFilter and modify the tokens as you need. you could do
the whole job in a Tokenizer but this would not be a good separation
of concerns right!?

simon

>
> regards,
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Hi Simon,


Simon Willnauer wrote:
> 
> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
> [...]?!
> 
> simon
> 

yes, I did, please find the info in the initial message. Here are the
excerpts:


Valery wrote:
> 
>   2) WhitespaceTokenizer gives me a lot of lexems that are actually should
> have been chopped into smaller pieces. Example: "C/C++" comes out like a
> single lexem. If I follow this way I end-up with "Tokenization of tokens"
> -- that sounds a bit odd, doesn't it?
> 
>   3) CharTokenizer allows me to add the '/' to be also a token-emitting
> char, but then '/' gets immediately lost like those whitespace chars. In
> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
> the original char stream for the "/" char to re-build "SAP R/3" term as a
> whole. 

regards,
Valery



Simon Willnauer wrote:
> 
> Valery, have you tried to [...] and do any further processing in a  custom
> TokenFilter?!
> simon
> 

yes, and that's why I have sent the initial post "Any Tokenizator friendly
to C++, C#, .NET, etc ?"

Simon, what do you expect from the Tokenizer? 
(In other words, what job is exclusively "Tokenizer's Job" and should rather
not be done in downstream filters?)

regards, 
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Simon Willnauer <si...@googlemail.com>.

Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
do any further processing in a  custom TokenFilter?!

simon

On Thu, Aug 20, 2009 at 8:48 PM, Robert Muir<rc...@gmail.com> wrote:
> Valery, I think it all depends on how you want your search to work.
>
> when I say this, I mean for example: if a document only contains "C++"
> do you want searches on just "C" to match or not?
>
> another thing I would suggest is to take a look at the capabilities of
> Solr: it has some analysis stuff that might be beneficial for your
> needs.
> wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>
> On Thu, Aug 20, 2009 at 1:46 PM, Valery<kh...@gmail.com> wrote:
>>
>> Hi Robert,
>>
>> so, would you expect a Tokenizer to consider '/' in both cases as a separate
>> Token?
>>
>> Personally, I see no problem if Tokenzer would do the following job:
>>
>> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
>> and come up with "C" and "C++" tokens after processing through the
>> downstream tokenfilters.
>>
>> Similarly:
>>
>> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
>> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>>
>> I try to follow a spirit that a token (or its lexem) usually should never be
>> parsed again. One can build  more complex (compound) things from the tokens.
>> However, usually one never chops a lexem into smaller pieces.
>>
>> What do you think, Robert?
>>
>> regards,
>> Valery
>>
>> --
>> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Robert Muir <rc...@gmail.com>.

Valery, I think it all depends on how you want your search to work.

when I say this, I mean for example: if a document only contains "C++"
do you want searches on just "C" to match or not?

another thing I would suggest is to take a look at the capabilities of
Solr: it has some analysis stuff that might be beneficial for your
needs.
wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


On Thu, Aug 20, 2009 at 1:46 PM, Valery<kh...@gmail.com> wrote:
>
> Hi Robert,
>
> so, would you expect a Tokenizer to consider '/' in both cases as a separate
> Token?
>
> Personally, I see no problem if Tokenzer would do the following job:
>
> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
> and come up with "C" and "C++" tokens after processing through the
> downstream tokenfilters.
>
> Similarly:
>
> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>
> I try to follow a spirit that a token (or its lexem) usually should never be
> parsed again. One can build  more complex (compound) things from the tokens.
> However, usually one never chops a lexem into smaller pieces.
>
> What do you think, Robert?
>
> regards,
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Hi Robert, 

so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?

Personally, I see no problem if Tokenzer would do the following job:

"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} 
and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.

Similarly:

"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} 
and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.

I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build  more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.

What do you think, Robert?

regards,
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Robert Muir <rc...@gmail.com>.

Valery, oh I think there might be other ways to solve this.

But you provided some examples such as C/C++ and SAP R/3.
In these two examples you want the "/" to behave differently depending
upon context, so my first thought was that a grammar might be a good
way to ensure it does what you want.

On Thu, Aug 20, 2009 at 11:09 AM, Valery<kh...@gmail.com> wrote:
>
> Hi Robert,
>
> thanks for the hint.
>
> Indeed, a natural way to go. Especially if one builds a Tokenizer of the
> level of quality like StandardTokenizer's.
>
> OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
> this task?..
>
> regards
> Valery
>
>
>
> Robert Muir wrote:
>>
>> Valery,
>>
>> One thing you could try would be to create a JFlex-based tokenizer,
>> specifying a grammar with the rules you want.
>> You could use the source code & grammar of StandardTokenizer as a
>> starting point.
>>
>>
>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<kh...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>
>>> The task is known for Lucene community, but surprisingly I can't google
>>> out
>>> somewhat good info on it.
>>>
>>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>>> Here
>>> we go:
>>>
>>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>>> "C++,
>>> C#, .NET" ends up with "c c net". Too bad.
>>>
>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>>> --
>>> that sounds a bit odd, doesn't it?
>>>
>>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>>> char, but then '/' gets immediately lost like those whitespace chars. In
>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>>> the
>>> original char stream for the "/" char to re-build "SAP R/3" term as a
>>> whole.
>>>
>>> Do you see any other relevant building blocks missed by me?
>>>
>>> Also, people around there have meant that such problem should be solved
>>> by a
>>> synonym dictionary. However this hint sheds no light on which
>>> tokenization
>>> strategy should be more appropriate *before* the synonym step.
>>>
>>> So, it looks like I have to take the class CharTokenizer as for the
>>> starting
>>> point and write anew my own Tokenizer. This Tokenizer should also react
>>> on
>>> delimiting characters and emit the token. However, it should distinguish
>>> between delimiters like whitespaces along with ";,?" and the delimiters
>>> like
>>> "./&".
>>>
>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>>> from
>>> Lexem level,
>>> whereas the token emitting characters like "./&" should be kept in Lexem
>>> level.
>>>
>>> Your comments, gurus?
>>>
>>> regards,
>>> Valery
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Valery <kh...@gmail.com>.

Hi Robert, 

thanks for the hint. 

Indeed, a natural way to go. Especially if one builds a Tokenizer of the
level of quality like StandardTokenizer's. 

OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
this task?..

regards
Valery



Robert Muir wrote:
> 
> Valery,
> 
> One thing you could try would be to create a JFlex-based tokenizer,
> specifying a grammar with the rules you want.
> You could use the source code & grammar of StandardTokenizer as a
> starting point.
> 
> 
> On Thu, Aug 20, 2009 at 10:28 AM, Valery<kh...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>
>> The task is known for Lucene community, but surprisingly I can't google
>> out
>> somewhat good info on it.
>>
>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>> Here
>> we go:
>>
>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>> "C++,
>> C#, .NET" ends up with "c c net". Too bad.
>>
>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>> --
>> that sounds a bit odd, doesn't it?
>>
>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>> char, but then '/' gets immediately lost like those whitespace chars. In
>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>> the
>> original char stream for the "/" char to re-build "SAP R/3" term as a
>> whole.
>>
>> Do you see any other relevant building blocks missed by me?
>>
>> Also, people around there have meant that such problem should be solved
>> by a
>> synonym dictionary. However this hint sheds no light on which
>> tokenization
>> strategy should be more appropriate *before* the synonym step.
>>
>> So, it looks like I have to take the class CharTokenizer as for the
>> starting
>> point and write anew my own Tokenizer. This Tokenizer should also react
>> on
>> delimiting characters and emit the token. However, it should distinguish
>> between delimiters like whitespaces along with ";,?" and the delimiters
>> like
>> "./&".
>>
>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>> from
>> Lexem level,
>> whereas the token emitting characters like "./&" should be kept in Lexem
>> level.
>>
>> Your comments, gurus?
>>
>> regards,
>> Valery
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Posted by Robert Muir <rc...@gmail.com>.

Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery<kh...@gmail.com> wrote:
>
> Hi all,
>
> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>
> The task is known for Lucene community, but surprisingly I can't google out
> somewhat good info on it.
>
> Of course, I tried to re-use Lucene's  building blocks for Tokenizer. Here
> we go:
>
>  1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
> C#, .NET" ends up with "c c net". Too bad.
>
>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
> have been chopped into smaller pieces. Example: "C/C++" comes out like a
> single lexem. If I follow this way I end-up with "Tokenization of tokens" --
> that sounds a bit odd, doesn't it?
>
>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
> char, but then '/' gets immediately lost like those whitespace chars. In
> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
> original char stream for the "/" char to re-build "SAP R/3" term as a whole.
>
> Do you see any other relevant building blocks missed by me?
>
> Also, people around there have meant that such problem should be solved by a
> synonym dictionary. However this hint sheds no light on which tokenization
> strategy should be more appropriate *before* the synonym step.
>
> So, it looks like I have to take the class CharTokenizer as for the starting
> point and write anew my own Tokenizer. This Tokenizer should also react on
> delimiting characters and emit the token. However, it should distinguish
> between delimiters like whitespaces along with ";,?" and the delimiters like
> "./&".
>
> Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
> Lexem level,
> whereas the token emitting characters like "./&" should be kept in Lexem
> level.
>
> Your comments, gurus?
>
> regards,
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org