You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Madhu Satyanarayana Panitini <Ma...@pass-consulting.com> on 2005/09/13 11:45:42 UTC

Spliting of words

Hai all
 
I want know the split pattern of text before indexing in Lucene, its
splits where ever there is space in between the words Or is there any
pattern in splitting the words of text document. In which program I can
find the code on the splitting of the word.
 
Madhu
 
Madhu Satyanarayana. Panitini
PASS GCA Solution Centre Pvt Ltd.
601 Aditya Trade Centre, Ameerpet, 
Hyderabad, India. 
www.pass-consulting.com

Re: Splitting of words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Sep 27, 2005, at 6:29 AM, Endre Stølsvik wrote:

> On Thu, 22 Sep 2005, Erik Hatcher wrote:
>
> |
> | On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
> |
> | >
> | > | The StandardTokenizer is the most sophisticated one built  
> into Lucene.
> | > You
> | > | can see the types of tokens it emits by looking at the  
> javadoc here:
> | > |
> | > <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
> analysis/standard/StandardTokenizer.html>
> | > |
> | > | It recognizes e-mail addresses, interior apostrophe words  
> (like o'clock),
> | > | hostnames/IP addresses (like lucene.apache.org), acronyms,  
> and CJK
> | > characters.
> | >
> | > It would be great if it also separated "UpperCamelCase" and
> | > "lowerCamelCase" words into both the different words, and one  
> long word.
> | > Several uppercase, followed by lowercase, would most probably  
> be best done
> | > like HTTPUnit -> http unit.
> | >  This is of course due to, for my part, java language  
> influence. But I
> | > believe it is custom in many programming languages to use  
> lowerCamelCase
> | > for e.g. variables. Filenames too.
> |
> | I strongly disagree.  It would not be good at all for  
> StandardTokenizer to do
> | this.
>
> ...
>
> |
> | It is important to design filters and tokenizers in the most  
> single-purpose
> | way to allow them to be combined for various scenarios.
>
> Okay. Why? Just wondering what the reasoning behind this is? What  
> is the
> logic behind the StandardTokenizer as it stands? (Note: There are  
> strong
> reasons to believe that I'm just not quite up to speed on how this all
> fits together..!)

The StandardTokenizer is a general purpose tokenizer designed to  
split text not just at whitespace boundaries, but also to keep CJK  
characters separate, e-mail addresses as a unit, and to deal with  
common things like part numbers where alphabetic and numeric  
characters are mixed.  It's a good tokenizer to start with and evolve  
from there.

The StandardTokenizer does more than certain scenarios demand, and  
less than other situations - it's a nice happy medium.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting of words

Posted by Endre Stølsvik <En...@Stolsvik.com>.

On Thu, 22 Sep 2005, Erik Hatcher wrote:

| 
| On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
| 
| > 
| > | The StandardTokenizer is the most sophisticated one built into Lucene.
| > You
| > | can see the types of tokens it emits by looking at the javadoc here:
| > |
| > <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| > |
| > | It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| > | hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK
| > characters.
| > 
| > It would be great if it also separated "UpperCamelCase" and
| > "lowerCamelCase" words into both the different words, and one long word.
| > Several uppercase, followed by lowercase, would most probably be best done
| > like HTTPUnit -> http unit.
| >  This is of course due to, for my part, java language influence. But I
| > believe it is custom in many programming languages to use lowerCamelCase
| > for e.g. variables. Filenames too.
| 
| I strongly disagree.  It would not be good at all for StandardTokenizer to do
| this. 

...

|
| It is important to design filters and tokenizers in the most single-purpose
| way to allow them to be combined for various scenarios.

Okay. Why? Just wondering what the reasoning behind this is? What is the 
logic behind the StandardTokenizer as it stands? (Note: There are strong 
reasons to believe that I'm just not quite up to speed on how this all 
fits together..!)

| It would be easy to write a CamelCaseSplitFilter that could be used in 
| conjunction with any tokenizer.

Thanks for the tip!

Regards,
Endre

Re: Splitting of words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:

>
> | The StandardTokenizer is the most sophisticated one built into  
> Lucene.  You
> | can see the types of tokens it emits by looking at the javadoc here:
> |    <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
> analysis/standard/StandardTokenizer.html>
> |
> | It recognizes e-mail addresses, interior apostrophe words (like  
> o'clock),
> | hostnames/IP addresses (like lucene.apache.org), acronyms, and  
> CJK characters.
>
> It would be great if it also separated "UpperCamelCase" and
> "lowerCamelCase" words into both the different words, and one long  
> word.
> Several uppercase, followed by lowercase, would most probably be  
> best done
> like HTTPUnit -> http unit.
>   This is of course due to, for my part, java language influence.  
> But I
> believe it is custom in many programming languages to use  
> lowerCamelCase
> for e.g. variables. Filenames too.

I strongly disagree.  It would not be good at all for  
StandardTokenizer to do this.  It would be easy to write a  
CamelCaseSplitFilter that could be used in conjunction with any  
tokenizer.

It is important to design filters and tokenizers in the most single- 
purpose way to allow them to be combined for various scenarios.

If such a filter is contributed, I'd happily add it to contrib/ 
analyzers - seems useful to have around.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting of words

Posted by Endre Stølsvik <En...@Stolsvik.com>.

| The StandardTokenizer is the most sophisticated one built into Lucene.  You
| can see the types of tokens it emits by looking at the javadoc here:
|    <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| 
| It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK characters.

It would be great if it also separated "UpperCamelCase" and 
"lowerCamelCase" words into both the different words, and one long word. 
Several uppercase, followed by lowercase, would most probably be best done 
like HTTPUnit -> http unit.
  This is of course due to, for my part, java language influence. But I 
believe it is custom in many programming languages to use lowerCamelCase 
for e.g. variables. Filenames too.

Regards,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting of words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Sep 13, 2005, at 7:24 AM, Madhu Satyanarayana Panitini wrote:

> Hi Paul,
>
> I agree with u "Analyzer is the magic word"
> Lets look it in depth and clear, I would consider three parts in the
> analyzer
>
> 1. Tokenization (splitting of words)
> 2. Stopwords removal (depends up on the language)
> 3. stemming of the words (depends up on the language)
>
> First to start analyze we have split the text, for example I like  
> split
> the text wherever I find the following non alphabets
> "\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\."
> That means I would like to split the text wherever I find
> space,:,;,",',<,>,?,  etc....
>
> And then we remove the stopwords and then stemming goes on.
>
> Coming my question is clear now how Lucene splits the text? only when
> ever it encounter the space between the words or it consider the non
> alphabetic characters as well.
>
> What is the whole grammar Standard analyzer has to split the words ?

Madhu - you'd do well to try out the AnalyzerDemo that comes with the  
"Lucene in Action" code.  You can download it from http:// 
www.lucenebook.com - here's an example run:

$ ant AnalyzerDemo

     ...

AnalyzerDemo:
      [echo]
      [echo]       Demonstrates analysis of sample text.
      [echo]
      [echo]       Refer to the "Analysis" chapter for much more on this
      [echo]       extremely crucial topic.
      [echo]
     [input] Press return to continue...

     [input] String to analyze: [This string will be analyzed.]

      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "This string will be analyzed."
      [java]   WhitespaceAnalyzer:
      [java]     [This] [string] [will] [be] [analyzed.]

      [java]   SimpleAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   StopAnalyzer:
      [java]     [string] [analyzed]

      [java]   StandardAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [will] [be] [analyz]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [wil] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [thi] [string] [will] [be] [analyz]


BUILD SUCCESSFUL
Total time: 13 seconds

The StandardTokenizer is the most sophisticated one built into  
Lucene.  You can see the types of tokens it emits by looking at the  
javadoc here:
     <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
analysis/standard/StandardTokenizer.html>

It recognizes e-mail addresses, interior apostrophe words (like  
o'clock), hostnames/IP addresses (like lucene.apache.org), acronyms,  
and CJK characters.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Splitting of words

Posted by Madhu Satyanarayana Panitini <Ma...@pass-consulting.com>.

Hi Paul,

I agree with u "Analyzer is the magic word"
Lets look it in depth and clear, I would consider three parts in the
analyzer 

1. Tokenization (splitting of words)
2. Stopwords removal (depends up on the language)
3. stemming of the words (depends up on the language) 

First to start analyze we have split the text, for example I like split
the text wherever I find the following non alphabets 
"\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\." 
That means I would like to split the text wherever I find
space,:,;,",',<,>,?,  etc....

And then we remove the stopwords and then stemming goes on.

Coming my question is clear now how Lucene splits the text? only when
ever it encounter the space between the words or it consider the non
alphabetic characters as well.

What is the whole grammar Standard analyzer has to split the words ?

Madhu






Madhu Satyanarayana. Panitini
PASS GCA Solution Centre Pvt Ltd.
601 Aditya Trade Centre, Ameerpet, 
Hyderabad, India. 
www.pass-consulting.com 



-----Original Message-----
From: Paul Libbrecht [mailto:paul@activemath.org] 
Sent: Tuesday, September 13, 2005 3:40 PM
To: java-user@lucene.apache.org
Subject: Re: Spliting of words

Madhu,

Analyzer is the magic word here.

Lucene's StandardAnalyzer has a whole grammar to split words into 
tokens. There are many more analyzers, most of which are language 
specific (e.g. based the Snowball or Porter-stemmers, see contribs or 
javadoc of core).

For which language do wish to use that ?

paul


Le 13 sept. 05, à 11:45, Madhu Satyanarayana Panitini a écrit :

> Hai all
>
> I want know the split pattern of text before indexing in Lucene, its
> splits where ever there is space in between the words Or is there any
> pattern in splitting the words of text document. In which program I
can
> find the code on the splitting of the word.
>
> Madhu
>
> Madhu Satyanarayana. Panitini
> PASS GCA Solution Centre Pvt Ltd.
> 601 Aditya Trade Centre, Ameerpet,
> Hyderabad, India.
> www.pass-consulting.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Spliting of words

Posted by Paul Libbrecht <pa...@activemath.org>.

Madhu,

Analyzer is the magic word here.

Lucene's StandardAnalyzer has a whole grammar to split words into 
tokens. There are many more analyzers, most of which are language 
specific (e.g. based the Snowball or Porter-stemmers, see contribs or 
javadoc of core).

For which language do wish to use that ?

paul


Le 13 sept. 05, à 11:45, Madhu Satyanarayana Panitini a écrit :

> Hai all
>
> I want know the split pattern of text before indexing in Lucene, its
> splits where ever there is space in between the words Or is there any
> pattern in splitting the words of text document. In which program I can
> find the code on the splitting of the word.
>
> Madhu
>
> Madhu Satyanarayana. Panitini
> PASS GCA Solution Centre Pvt Ltd.
> 601 Aditya Trade Centre, Ameerpet,
> Hyderabad, India.
> www.pass-consulting.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org