You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Thimal Jayasooriya <th...@cs.york.ac.uk> on 2004/03/21 03:29:31 UTC
Token declared final ?
Hi all:
I have a question about the class structure of Tokens and
Tokenizers. Apologies, it's a bit longwinded :)
As part of my Masters research, I'm trying to use Lucene to store
different semantic classes found within documents. For this, I need to
first split sentences and then generate part of speech (POS) information
for each significant word found within a particular document. Through
separate libraries, I've already done the splitting and tagging tasks.
When I looked at the source for Token
(org.apache.lucene.analysis.token), however, I found that it has been
declared final. I had intended to subclass Token to also keep a POS
marker and use it later within the Analyzer. Could someone please give
me some information on why Token was declared as final ? I am sure I've
missed something, but I can't see what it is.. Alternately, does it
makes more sense to store the POS information elsewhere ? I would
probably need it at index time only.
My original intention was to extend the Tokenizer
(org.apache.lucene.analysis.Tokenizer), get POS information, add it to
the token and then do the normal consumption of punctuation and so on
with JavaCC. Punctuation is necessary to recognize some named entities,
so I need to do this before those tokens are consumed. Is there a better
/ more logical place to perform POS tagging ?
Thanks,
Thimal
--
Thimal Jayasooriya,
Department of Computer Science,
The University of York
http://www.cs.york.ac.uk/~thimal/
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Token declared final ?
Posted by Thimal Jayasooriya <th...@cs.york.ac.uk>.
Hi Doug,
That's brilliant :) I didn't want to use an existing field because I
wasn't sure if there was anything that relied explicitly on type
returning the default "word". There might be a few cases where I would
have liked to store multiple tags (for words with slightly ambiguous
meanings), but I can sort that out. Thanks for the pointer and also for
taking the time to explain.
As a general matter, would anyone else be interested in having POS
information for Tokens ? I use one library which isn't open sourced for
tagging (QTag), but I'd be happy to contribute the interface code if
anyone feels they could use it.
More info on the tools I use can be found here :
http://www-users.cs.york.ac.uk/~thimal/tools.php
If you have or know of an open source tagger, I'd be keen on making my
code play nicely with it too :)
Regards,
Thimal
Doug Cutting wrote:
> The 'type' field of Token would be a good place for Part-of-Speech.
> Does that work for you? If not, perhaps we should make Token non-final.
>
> As has been discussed before, Lucene uses final for two reasons. The
> first is historical: long ago it used to make things faster by
> permitting javac to inline things. The second is that some classes
> are not designed to be subclassed, e.g., subclassing Field or Document
> will generally cause more confusion than it will simplify an
> application. The problem is sometimes determining which case is which.
>
> Doug
>
> Thimal Jayasooriya wrote:
<snipped parts of the original mail>
>> When I looked at the source for Token
>> (org.apache.lucene.analysis.token), however, I found that it has been
>> declared final. I had intended to subclass Token to also keep a POS
>> marker and use it later within the Analyzer. Could someone please
>> give me some information on why Token was declared as final ? I am
>> sure I've missed something, but I can't see what it is.. Alternately,
>> does it makes more sense to store the POS information elsewhere ? I
>> would probably need it at index time only.
>>
--
Thimal Jayasooriya,
Department of Computer Science,
The University of York
http://www.cs.york.ac.uk/~thimal/
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Token declared final ?
Posted by Incze Lajos <in...@mail.matav.hu>.
On Tue, Mar 23, 2004 at 09:11:36AM -0800, Doug Cutting wrote:
> The 'type' field of Token would be a good place for Part-of-Speech.
> Does that work for you? If not, perhaps we should make Token non-final.
>
> As has been discussed before, Lucene uses final for two reasons. The
> first is historical: long ago it used to make things faster by
> permitting javac to inline things. The second is that some classes are
> not designed to be subclassed, e.g., subclassing Field or Document will
> generally cause more confusion than it will simplify an application.
> The problem is sometimes determining which case is which.
>
> Doug
Wouldn't it worth to define an "Object data" general purpose
free field for the Token? I'm using type to hold some "A = B"
type properties, but in general this is neither convenient nor
scales well.
incze
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Token declared final ?
Posted by Doug Cutting <cu...@apache.org>.
The 'type' field of Token would be a good place for Part-of-Speech.
Does that work for you? If not, perhaps we should make Token non-final.
As has been discussed before, Lucene uses final for two reasons. The
first is historical: long ago it used to make things faster by
permitting javac to inline things. The second is that some classes are
not designed to be subclassed, e.g., subclassing Field or Document will
generally cause more confusion than it will simplify an application.
The problem is sometimes determining which case is which.
Doug
Thimal Jayasooriya wrote:
> Hi all:
> I have a question about the class structure of Tokens and
> Tokenizers. Apologies, it's a bit longwinded :)
>
> As part of my Masters research, I'm trying to use Lucene to store
> different semantic classes found within documents. For this, I need to
> first split sentences and then generate part of speech (POS) information
> for each significant word found within a particular document. Through
> separate libraries, I've already done the splitting and tagging tasks.
>
> When I looked at the source for Token
> (org.apache.lucene.analysis.token), however, I found that it has been
> declared final. I had intended to subclass Token to also keep a POS
> marker and use it later within the Analyzer. Could someone please give
> me some information on why Token was declared as final ? I am sure I've
> missed something, but I can't see what it is.. Alternately, does it
> makes more sense to store the POS information elsewhere ? I would
> probably need it at index time only.
>
> My original intention was to extend the Tokenizer
> (org.apache.lucene.analysis.Tokenizer), get POS information, add it to
> the token and then do the normal consumption of punctuation and so on
> with JavaCC. Punctuation is necessary to recognize some named entities,
> so I need to do this before those tokens are consumed. Is there a better
> / more logical place to perform POS tagging ?
>
> Thanks,
> Thimal
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org