You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Beady Geraghty <be...@gmail.com> on 2005/12/07 19:36:21 UTC

words with more than 1 hyphen ?

I am back to doing something with Lucene after a short break from it.

I am trying to index/search hyphenated words,
and retrieve them from a token stream.

1. I modified the StandardTokenizer.jj file.

   Essentially, I added the following to StandardTokenizer.jj
 | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>

2. I used JavaCC to get a set of .java files including a
    tokenizer.

3. I modified the file to use
   org.apache.lucene.analysis.standard classes, such as Token,CharStream
   instead of the ones provided by javaCC.

4. I was able to index and retrieve words like
merry-go-round (as oppose to merry go round).  So, I
was quite happy.
Now I want to get "merry-go-round" from the token
stream.  And that doesn't seem to work.
Note that retrieve words with 1 hyphen seems to work,
but 2 hyphens seems to represent a problem.

In getting the tokens from the stream, I get
"Merry-go-r" and "ound"  instead of "Merry-go-round"
"editor-in-c" and "hief"  instead of "editor-in-chief".
This behaviour is so strange, and I don't know how
the indexer and query processing knows about "merry-go-round",
and yet the TokenStream doesn't.

"green-monster" would work.  But not words with more than
one hyphen.

There are two snippets of code I tried, both didn't
return the desired result:

Snippet 1:
 MyStandardAnalyzer bsa = new MyStandardAnalyzer();
 TokenStream ts = bsa.tokenStream("content", rdr );
 //Token t;
 while (true) {
  org.apache.lucene.analysis.Token t = ts.next();
  if (t == null)
   break;
  System.out.println(t.termText());
  }

   MyStandardAnalyzer contains my special Tokenizer generated from the new
.jj file.
   Essentially, where replace StandardTokenizer with MyStandardTokenizer.
   Merry-go-round becomes Merry-go-r  ound


Snippet 2:
 StandardAnalyzer sa1 = new StandardAnalyzer();
 ts = sa1.tokenStream("content", new StringReader("this is a merry-go-round
with 3 children")  );
 //Token t;
 while (true) {
  org.apache.lucene.analysis.Token t = ts.next();
  if (t == null)
   break;
  System.out.println(t.termText());
 }


   Merry-go-round becomes 3 tokens as merry go round


Could someone give me some suggestions.
The reasons I need the tokens is so that I can get words before and after
the selected words to form some context.
(By the way, currently, I convert a hyphenated word into a phrase,
but to me, that seems like special casing hyphenated words, and I
just want to stay away from special casing.  People has been asking
for all sorts of punctuation, such as _ or / etc.  I thought that if I learn
how to do modify the .jj files and produce the right tokens, I am better
off.


Thank you very much in advance.

Re: words with more than 1 hyphen ?

Posted by Beady Geraghty <be...@gmail.com>.

Thanks for the advice.

It is hard to say whether the useability folks want
to distinguish between "/usr/include" as oppose to "usr include".
Actually, I am sure that they would, but whether they would
accept "usr include" is the right question to ask :-)
I'll have to sort it out with them :-(

Thanks again.


On 12/8/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
> > Since someone suggested hyphen, the next requestion
> > is underscore.  I can see more and more of these requests.
> > Also, people might like to search  for "/usr/include/wchar.h"  (hence,
> > the slash) and apostrophe etc. There really isn't a set of
> > requirements
> > upfront. In fact people wants EVERYTHING if
> > they could, and full flexibility (even though they don't know
> > whether they will need it or not.)
> > So it appears that doing something "general" is better.
>
> Keep in mind that even if you tokenize "/usr/include" as "usr" and
> "include" that a query for "/usr/include" will still match if it is
> analyzed in a compatible manner.   It is important to realize that
> just because bits and pieces get eaten during analysis that it
> doesn't make things unfindable.  Sure, if someone literally wants to
> search for "/" only then it is important to keep upfront, but
> generally this is not the case.
>
> > I have been using StandardAnalyzer for the things you mentioned, like
> > email address, and www.google.com or i.b.m.  Those are good things
> > for me to have.  Since I've used it now, if I change it now, I
> > might break
> > other
> > people's dependencies.
>
> WhitespaceTokenizer followed by a LowerCaseFilter also catches the
> cases you mention.
>
> Again, lots of tests to assert what the users expect is a strong
> recommendation I have with tokenization.
>
> > If you do have a list of pitfalls from javaCC, could you point me
> > to it,
> > that way, I can think about some of the potential issues and decide
> > whether I should just abandon using javaCC ?
>
> It all really depends.  JavaCC is complex, that is my only
> reservation for recommending it.  If you can get by with simpler
> analysis techniques, then I say go for those instead.
>
> But, JavaCC is powerful.  It simply isn't necessary for the bulk of
> analysis cases I've come across.  Works great for parsing query
> expressions though.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: words with more than 1 hyphen ?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
> Since someone suggested hyphen, the next requestion
> is underscore.  I can see more and more of these requests.
> Also, people might like to search  for "/usr/include/wchar.h"  (hence,
> the slash) and apostrophe etc. There really isn't a set of  
> requirements
> upfront. In fact people wants EVERYTHING if
> they could, and full flexibility (even though they don't know
> whether they will need it or not.)
> So it appears that doing something "general" is better.

Keep in mind that even if you tokenize "/usr/include" as "usr" and  
"include" that a query for "/usr/include" will still match if it is  
analyzed in a compatible manner.   It is important to realize that  
just because bits and pieces get eaten during analysis that it  
doesn't make things unfindable.  Sure, if someone literally wants to  
search for "/" only then it is important to keep upfront, but  
generally this is not the case.

> I have been using StandardAnalyzer for the things you mentioned, like
> email address, and www.google.com or i.b.m.  Those are good things
> for me to have.  Since I've used it now, if I change it now, I  
> might break
> other
> people's dependencies.

WhitespaceTokenizer followed by a LowerCaseFilter also catches the  
cases you mention.

Again, lots of tests to assert what the users expect is a strong  
recommendation I have with tokenization.

> If you do have a list of pitfalls from javaCC, could you point me  
> to it,
> that way, I can think about some of the potential issues and decide
> whether I should just abandon using javaCC ?

It all really depends.  JavaCC is complex, that is my only  
reservation for recommending it.  If you can get by with simpler  
analysis techniques, then I say go for those instead.

But, JavaCC is powerful.  It simply isn't necessary for the bulk of  
analysis cases I've come across.  Works great for parsing query  
expressions though.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: words with more than 1 hyphen ?

Posted by Beady Geraghty <be...@gmail.com>.

Thank you for your answer.

I would like to not give you a "general" question so that I can
understand more.
But, I have random requests from people.  For example,
this request for hyphen is originated from a colleaque who is French,
and she believes that hyphen is important, though, I don't
know whether existing users use hyphen or not.
It depends on who the users are.

Since someone suggested hyphen, the next requestion
is underscore.  I can see more and more of these requests.
Also, people might like to search  for "/usr/include/wchar.h"  (hence,
the slash) and apostrophe etc. There really isn't a set of requirements
upfront. In fact people wants EVERYTHING if
they could, and full flexibility (even though they don't know
whether they will need it or not.)
So it appears that doing something "general" is better.

I have been using StandardAnalyzer for the things you mentioned, like
email address, and www.google.com or i.b.m.  Those are good things
for me to have.  Since I've used it now, if I change it now, I might break
other
people's dependencies.

If you do have a list of pitfalls from javaCC, could you point me to it,
that way, I can think about some of the potential issues and decide
whether I should just abandon using javaCC ?

Thanks,

On 12/7/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Dec 7, 2005, at 9:08 PM, Beady Geraghty wrote:
> > In general, do the rules in javaCC work pretty well.
>
> In general, all answers would be too general to be useful :)
>
> JavaCC is great - I'm using it for a custom query parser myself.  But
> it's not for the feint of heart.  It may be more than you need, it
> all depends.  The main thing StandardTokenizer does is keep e-mail
> addresses intact, and a few other fiddly things.
>
> If you provide us with some sample text and how you want that
> tokenized, I'm sure we could offer suggestions.
>
> >   Since
> > there may be more requests  to be included punctuations
> > in the search terms, so I have to keep modifying this .jj file.
> > I wonder if there are things that I should watch out for before
> > getting overly complicated and get stuck somewhere down the
> > road ?
>
> There are many pitfalls with JavaCC grammars.  It takes practice and
> unit tests to get this stuff right.  The same could be said of any
> style of tokenization.  Make lots of tests to ensure you don't break
> expected behavior as you tweak.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: words with more than 1 hyphen ?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 7, 2005, at 9:08 PM, Beady Geraghty wrote:
> In general, do the rules in javaCC work pretty well.

In general, all answers would be too general to be useful :)

JavaCC is great - I'm using it for a custom query parser myself.  But  
it's not for the feint of heart.  It may be more than you need, it  
all depends.  The main thing StandardTokenizer does is keep e-mail  
addresses intact, and a few other fiddly things.

If you provide us with some sample text and how you want that  
tokenized, I'm sure we could offer suggestions.

>   Since
> there may be more requests  to be included punctuations
> in the search terms, so I have to keep modifying this .jj file.
> I wonder if there are things that I should watch out for before
> getting overly complicated and get stuck somewhere down the
> road ?

There are many pitfalls with JavaCC grammars.  It takes practice and  
unit tests to get this stuff right.  The same could be said of any  
style of tokenization.  Make lots of tests to ensure you don't break  
expected behavior as you tweak.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: words with more than 1 hyphen ?

Posted by Beady Geraghty <be...@gmail.com>.

Hi Erik,

Thank you so much for pointing out the error :-)
It should have been

 | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"(<LETTER>)+)*>

I missed a pair of brackets for the 3rd LETTER (and a +)
I wonder how my indexer and query parser worked before,
but not the token stream.  Anyhow, it seems to work with both
indexing/query parsing and reading off token stream.

In general, do the rules in javaCC work pretty well.  Since
there may be more requests  to be included punctuations
in the search terms, so I have to keep modifying this .jj file.
I wonder if there are things that I should watch out for before
getting overly complicated and get stuck somewhere down the
road ?

Thanks again.



On 12/7/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> O
> > 1. I modified the StandardTokenizer.jj file.
> >
> >    Essentially, I added the following to StandardTokenizer.jj
> >  | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>
>
> Is that the only change you made to the .jj file?   Where did you put
> that exactly?
>
> Don't you need a * after the second <LETTER>?
>
> > 4. I was able to index and retrieve words like
> > merry-go-round (as oppose to merry go round).  So, I
> > was quite happy.
> > Now I want to get "merry-go-round" from the token
> > stream.  And that doesn't seem to work.
> > Note that retrieve words with 1 hyphen seems to work,
> > but 2 hyphens seems to represent a problem.
> >
> > In getting the tokens from the stream, I get
> > "Merry-go-r" and "ound"  instead of "Merry-go-round"
> > "editor-in-c" and "hief"  instead of "editor-in-chief".
> > This behaviour is so strange, and I don't know how
> > the indexer and query processing knows about "merry-go-round",
> > and yet the TokenStream doesn't.
>
> I think the missing * above explains what you're seeing.
>
> > "green-monster" would work.  But not words with more than
> > one hyphen.
>
> I'm surprised this one worked - maybe some other token in JavaCC is
> catching that?
>
> JavaCC is perhaps overkill for what you want.  If you don't need any
> of the other fancy analysis tricks that StandardTokenizer has, you
> could just use WhiteSpaceAnalyzer, LowerCaseFilter, voila, your
> hyphenated tokens would come right out.
>
> > (By the way, currently, I convert a hyphenated word into a phrase,
> > but to me, that seems like special casing hyphenated words, and I
> > just want to stay away from special casing.  People has been asking
> > for all sorts of punctuation, such as _ or / etc.  I thought that
> > if I learn
> > how to do modify the .jj files and produce the right tokens, I am
> > better
> > off.
>
> Unless you need the other features of StandardTokenizer, you may be
> best staying away from JavaCC altogether.  It is it's own complex
> world that might be more than what you need.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: words with more than 1 hyphen ?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

O
> 1. I modified the StandardTokenizer.jj file.
>
>    Essentially, I added the following to StandardTokenizer.jj
>  | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>

Is that the only change you made to the .jj file?   Where did you put  
that exactly?

Don't you need a * after the second <LETTER>?

> 4. I was able to index and retrieve words like
> merry-go-round (as oppose to merry go round).  So, I
> was quite happy.
> Now I want to get "merry-go-round" from the token
> stream.  And that doesn't seem to work.
> Note that retrieve words with 1 hyphen seems to work,
> but 2 hyphens seems to represent a problem.
>
> In getting the tokens from the stream, I get
> "Merry-go-r" and "ound"  instead of "Merry-go-round"
> "editor-in-c" and "hief"  instead of "editor-in-chief".
> This behaviour is so strange, and I don't know how
> the indexer and query processing knows about "merry-go-round",
> and yet the TokenStream doesn't.

I think the missing * above explains what you're seeing.

> "green-monster" would work.  But not words with more than
> one hyphen.

I'm surprised this one worked - maybe some other token in JavaCC is  
catching that?

JavaCC is perhaps overkill for what you want.  If you don't need any  
of the other fancy analysis tricks that StandardTokenizer has, you  
could just use WhiteSpaceAnalyzer, LowerCaseFilter, voila, your  
hyphenated tokens would come right out.

> (By the way, currently, I convert a hyphenated word into a phrase,
> but to me, that seems like special casing hyphenated words, and I
> just want to stay away from special casing.  People has been asking
> for all sorts of punctuation, such as _ or / etc.  I thought that  
> if I learn
> how to do modify the .jj files and produce the right tokens, I am  
> better
> off.

Unless you need the other features of StandardTokenizer, you may be  
best staying away from JavaCC altogether.  It is it's own complex  
world that might be more than what you need.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org