You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/11/22 09:36:52 UTC

Problematic documentation in o.a.l.analysis.package.html

Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai

RE: Problematic documentation in o.a.l.analysis.package.html

Posted by Uwe Schindler <uw...@thetaphi.de>.
Abvout clearAttributes: Just the warning if your clear the attributes one by
one, you have two problems:

-          you can only clear attributes you know about. E.g. most
Tokenizers just set TermAttribute and OffsetAttribute (because only these
two attributes are interesting). The PosIncr attribute does not need to be
touched, as it defaults to 1, which is correct for generating Tokens from a
Reader. If you only clear those two attributes in the Tokenizer, the posIncr
attribute is not resetet. If you add a StopFilter or Sysnonymfilter after
this tokenizer,  who uses the PosIncr attribute, it would not see this
attribute reseted. This leads to the bug of the SmartChineseAnalyzer, that
exactly missed to reset *all* attributes, and therefore the StopFilter
incremented the PosIncr each time, and suddenly overflowing with invalid
values. So please, whenever you inject new Tokens, call clearAttributes on
the TokenStream.

-          If you clear the attributes one by one, you may clear them
multiple times. If all 6 attributes are implemented by Token, and you clear
them one-by one, Token.clear() is called 6 times. Not very effective.
Because of that, no attribute interface contains copyTo() and clone() or
equals/hashcode.

 

I know the documentation can be improved. Sorry about that, but you are
tyring to use expert features. Normal TokenFilters just add some attributes
in the ctor and use them. Token may be the implementation behind, but you
cannot rely on that!!! If the Tokenizer already added attributes to it, the
Token impl added by the Filter will never be added (because the interfaces
are still available). So Just for "keeping" Token, do *not* try to cast all
attributes to Token, this will fail (see next mail).

 

Uwe

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 12:06 PM
To: java-dev@lucene.apache.org
Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Perhaps copyTo works for me because I reference Token, but like I said it's
working for me ...

Thanks for the tip regarding clearAttributes(). I assume I'll get the same
behavior if I clear the attributes one by one, defaulting their values to
whatever are my defaults.

Well ... IMO as a user, there is a semantic difference. I understand that
eventually both are the same. And even though I don't understand the new TS
API as much as you do, I think I have a fair understanding of it. And I also
think that naive users will be confused ... In Java, when you call
object.dosomething vs. dosomething, you understand that the former is
performed on object and the latter on your instance. That that you (in the
new API code) made sure both are the same is invisible to the user, unless
he/she carefully reads the javadocs. The current impl is smart and useful, I
just think some better documentation can help new users.

Shai

On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

About (3): CopyTo is also only availabe for AttributeImpls, but not for the
interfaces (you have to cast first) and then you are warned. If you copyTo()
on a TernmAttribute, it may also copy other attributes with it, if
TermAttribute and PosIncr. Attribute are all implemented by the same
AttributeImpl. With captureState you can be sure to have copyied all
attributes.

 

There is one important thing: *always* ever call
AttributeSource.clearAttributes when you generate new tokens in
incrementToken. This affects mainly Tokenizers (they always have to
clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
always call clearAttributes. If you do not do this and e.g. only Touch the
TermAttribute, the PosIncr may stay the same as before and does not get the
default, so e.g. StopFilter gets broken when you inject Zokens (the
SmartChinese analyzer had that problem).

 

About (2). There is no semantic difference, the call to input.addAttribute()
does exactly the same like this.addAttribute() because all TokenFilters in a
chain share the same attributes. It is exactly the same effect (you get the
same instance of the attribute) and so they are semantically and in reality
the same. If you think this is different, you have not understood the whole
attributes API.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 11:39 AM


To: java-dev@lucene.apache.org

Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processed somehow by the input TokenStream. And this may
not be true. There is a semantic difference between: addAttribute, and
input.addAttribute. The former means I add the attribute to *my* TS
instance, while the latter to the input TS. Even though both are the same
(as TokenFilter attributes share the same instance of TokenStream
attributes), these are sematically different. I'd expect to do the former
just to ensure this attribute is in the 'attributes' map, while the latter
to ensure the input TS will process it.

BTW, remember that whatever looks clear to you, the implementer of this, may
not be cleared to users. I still think that the current documentation may
confuse people, and adding an extra "NOTE: this does not ensure the input TS
will do anything with the passed Attribute. If need to, call
input.hasAttribute to determine if that attribute is handled by the inpu TS"
won't hurt.

Shai

On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi Shai,

 

Thanks for the suggestions!

 

About your points: 

1)       This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
already in the vote phase and 2.9x is also already out.

2)       Maybe the explanation is not so good. This text comes especially
from the 2.9 old to new TS migration. The current indexer normally uses all
attributes and the default Token implementation implements all six (in 2.9).
We had a lot of TokenStreams in core that used getAttribute without
checking, if the attribute is really there. With Token/TokenWrapper as
implementation, this was always ok, but as soon as you switched to new API
only, the attributes were implemented by single instances, so after adding a
TermAttribute you were not be sure, to also have a PosIncr attribute. Even
the indexer of Lucene had that problem. So for the basic six attributes you
can always use addAttribute(), because if you not add the attribute in the
instantiation of the stream, all attributes are added by the indexer. This
applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
are enabled). So if the chain does not add them, they are added by the
indexer (consumer). Because all attributes have a default value, this is no
problem and the indexer code is even faster (e.g. it does not make sense to
every time check for a PositionIncrementAttribute, as reading the default
int value of 1 is read in the same speed, even faster without the extra null
check. And with current index format it is always tracked and indexed). If
you have own attribute, it may correctly make sense to check with
hasAttribute() and throw IAE. But after this check, you can also use
addAttribute() to get the reference to the attribute (there were a
discussion about removing getAttribute at all). getAttribute and
addAttribute have the same speed when fetching an already existing
attribute. So it is up to yours, how you would implement your tokenstreams,
that are just hints for beginners. The thing with hasAttribute() and
addAttribute() is also mentioned in the docs a few lines down at the
"optimization" part.

3)       Attribute interfaces cannot be cloned (the Attribute interface has
no clone()). You can only clone implementations (AttributeImpl), but it may
happen that you not only clone the TermAttribute, but some other attributes
also (e.g. if you use Token as impl, like 2.9 does by default for backwards
compatinility). Because of that, attribute instances cannot be simply
cloned, only their implementing classes. What you generally should do (see
caching TokenFilter): use AttributeSource.captureState() to capture all
current attributes in the filter chain. In incrementToken(), you can restore
these captured states again into the existing attributes. Such states can be
handles in local instance variables or Lists<State> etc.
The second point I do not understand: Consumers always call addAttribute
(maybe hasAttribute/getAttribute) exactly one time before consuming! So they
only have the current attribute instance, which can never change during the
lifetime of a TokenStream chain. Also all
TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
the TS chain works. If every TokenStream would have its own attributes, they
could not communicate! The important thing is, that you cannot relay on the
fact that a chain may contain attributes or not, because even if you created
a the Tokenizer and one TokenFilter in the chain and checked for
hasAttribute() in the ctor of the filter, it may happen, that the third
TokenFilter adds a new Attribute, which is then suddenly available (but the
second filter does not see it - ok, it may not need to know about it,
because the Tokenizer will never set it). Because of that its always best to
use addAttribute, if you are "interested" on an attribute. If it does not
exist, you just get default values (like termLength()==0 for TermAttribute,
posIncr==1,.).

 

Hope that explains your questions.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 9:37 AM
To: java-dev@lucene.apache.org
Subject: Problematic documentation in o.a.l.analysis.package.html

 

Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai

 

 


Re: Problematic documentation in o.a.l.analysis.package.html

Posted by Shai Erera <se...@gmail.com>.
Perhaps copyTo works for me because I reference Token, but like I said it's
working for me ...

Thanks for the tip regarding clearAttributes(). I assume I'll get the same
behavior if I clear the attributes one by one, defaulting their values to
whatever are my defaults.

Well ... IMO as a user, there is a semantic difference. I understand that
eventually both are the same. And even though I don't understand the new TS
API as much as you do, I think I have a fair understanding of it. And I also
think that naive users will be confused ... In Java, when you call
object.dosomething vs. dosomething, you understand that the former is
performed on object and the latter on your instance. That that you (in the
new API code) made sure both are the same is invisible to the user, unless
he/she carefully reads the javadocs. The current impl is smart and useful, I
just think some better documentation can help new users.

Shai

On Sun, Nov 22, 2009 at 12:51 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  About (3): CopyTo is also only availabe for AttributeImpls, but not for
> the interfaces (you have to cast first) and then you are warned. If you
> copyTo() on a TernmAttribute, it may also copy other attributes with it, if
> TermAttribute and PosIncr. Attribute are all implemented by the same
> AttributeImpl. With captureState you can be sure to have copyied all
> attributes.
>
>
>
> There is one important thing: **always** ever call
> AttributeSource.clearAttributes when you generate new tokens in
> incrementToken. This affects mainly Tokenizers (they always have to
> clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
> always call clearAttributes. If you do not do this and e.g. only Touch the
> TermAttribute, the PosIncr may stay the same as before and does not get the
> default, so e.g. StopFilter gets broken when you inject Zokens (the
> SmartChinese analyzer had that problem).
>
>
>
> About (2). There is no semantic difference, the call to
> input.addAttribute() does exactly the same like this.addAttribute() because
> all TokenFilters in a chain share the same attributes. It is exactly the
> same effect (you get the same instance of the attribute) and so they are
> semantically and in reality the same. If you think this is different, you
> have not understood the whole attributes API.
>
>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Sunday, November 22, 2009 11:39 AM
>
> *To:* java-dev@lucene.apache.org
> *Subject:* Re: Problematic documentation in o.a.l.analysis.package.html
>
>
>
> Thanks Uwe.
>
> About (3), I use copyTo, not clone. I used the word 'clone' just out of
> habit. I'll read more about captureState, but I think copyTo works fine for
> me.
>
> Abour (2), I still think it's confusing. When I read addAttribute, I get an
> impression as if by calling this method, it is guaranteed for me that that
> attribute will be processed somehow by the input TokenStream. And this may
> not be true. There is a semantic difference between: addAttribute, and
> input.addAttribute. The former means I add the attribute to *my* TS
> instance, while the latter to the input TS. Even though both are the same
> (as TokenFilter attributes share the same instance of TokenStream
> attributes), these are sematically different. I'd expect to do the former
> just to ensure this attribute is in the 'attributes' map, while the latter
> to ensure the input TS will process it.
>
> BTW, remember that whatever looks clear to you, the implementer of this,
> may not be cleared to users. I still think that the current documentation
> may confuse people, and adding an extra "NOTE: this does not ensure the
> input TS will do anything with the passed Attribute. If need to, call
> input.hasAttribute to determine if that attribute is handled by the inpu TS"
> won't hurt.
>
> Shai
>
> On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi Shai,
>
>
>
> Thanks for the suggestions!
>
>
>
> About your points:
>
> 1)       This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
> already in the vote phase and 2.9x is also already out.
>
> 2)       Maybe the explanation is not so good. This text comes especially
> from the 2.9 old to new TS migration. The current indexer normally uses all
> attributes and the default Token implementation implements all six (in 2.9).
> We had a lot of TokenStreams in core that used getAttribute without
> checking, if the attribute is really there. With Token/TokenWrapper as
> implementation, this was always ok, but as soon as you switched to new API
> only, the attributes were implemented by single instances, so after adding a
> TermAttribute you were not be sure, to also have a PosIncr attribute. Even
> the indexer of Lucene had that problem. So for the basic six attributes you
> can always use addAttribute(), because if you not add the attribute in the
> instantiation of the stream, all attributes are added by the indexer. This
> applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
> are enabled). So if the chain does not add them, they are added by the
> indexer (consumer). Because all attributes have a default value, this is no
> problem and the indexer code is even faster (e.g. it does not make sense to
> every time check for a PositionIncrementAttribute, as reading the default
> int value of 1 is read in the same speed, even faster without the extra null
> check. And with current index format it is always tracked and indexed). If
> you have own attribute, it may correctly make sense to check with
> hasAttribute() and throw IAE. But after this check, you can also use
> addAttribute() to get the reference to the attribute (there were a
> discussion about removing getAttribute at all). getAttribute and
> addAttribute have the same speed when fetching an already existing
> attribute. So it is up to yours, how you would implement your tokenstreams,
> that are just hints for beginners. The thing with hasAttribute() and
> addAttribute() is also mentioned in the docs a few lines down at the
> “optimization” part.
>
> 3)       Attribute interfaces cannot be cloned (the Attribute interface
> has no clone()). You can only clone implementations (AttributeImpl), but it
> may happen that you not only clone the TermAttribute, but some other
> attributes also (e.g. if you use Token as impl, like 2.9 does by default for
> backwards compatinility). Because of that, attribute instances cannot be
> simply cloned, only their implementing classes. What you generally should do
> (see caching TokenFilter): use AttributeSource.captureState() to capture all
> current attributes in the filter chain. In incrementToken(), you can restore
> these captured states again into the existing attributes. Such states can be
> handles in local instance variables or Lists<State> etc.
> The second point I do not understand: Consumers always call addAttribute
> (maybe hasAttribute/getAttribute) exactly one time before consuming! So they
> only have the current attribute instance, which can never change during the
> lifetime of a TokenStream chain. Also all
> TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
> the TS chain works. If every TokenStream would have its own attributes, they
> could not communicate! The important thing is, that you cannot relay on the
> fact that a chain may contain attributes or not, because even if you created
> a the Tokenizer and one TokenFilter in the chain and checked for
> hasAttribute() in the ctor of the filter, it may happen, that the third
> TokenFilter adds a new Attribute, which is then suddenly available (but the
> second filter does not see it – ok, it may not need to know about it,
> because the Tokenizer will never set it). Because of that its always best to
> use addAttribute, if you are “interested” on an attribute. If it does not
> exist, you just get default values (like termLength()==0 for TermAttribute,
> posIncr==1,…).
>
>
>
> Hope that explains your questions.
>
>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Sunday, November 22, 2009 9:37 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Problematic documentation in o.a.l.analysis.package.html
>
>
>
> Hi
>
> I've read the analysis package.html and I found two issues:
>
> 1) The code sample under Invoking the Analyzer is broken. It calls
> incrementToken() but inside the while it prints 'ts' (which is TokenStream)
> and then do "t = ts.next()", which no longer works. That's an easy fix, so I
> don't think a JIRA issue is needed.
>
> 2) The documentation specifies that "Even consumers of TokenStreams should
> normally call addAttribute() instead of getAttribute(), because it would not
> fail if the TokenStream does not have this Attribute". IMO this is wrong and
> will give the wrong impression about how this API should be used. What if
> the TokenStream does not care about this attribute? It will not fill it with
> any information. The example with LengthFilter which calls addAttribute
> instead of has/getAttribute is a good one regarding why you shouldn't just
> call addAttribute. LegthFilter relies on the given TokenStream to fill
> TermAttribute with some information, so that it can later filter out terms
> of length < threshold. But what if I create a LengthFilter and give it a
> TokenStream which creates just PartOfSpeechAttribute? Or output terms that
> are not TermAttribute? Obviously it would be silly for me to do it, but no
> one restricts me from doing so. LengthFilter should either document that it
> expects TermAttribute to be returned from the input TokenStream, or better
> yet, enforce it in the constructor --> if you pass a TokenStream that does
> not return TermAttribute, throw an IllegalArgumentException.
>
> But anyway, the current documentation is, IMO, wrong and may lead to wrong
> impression. I don't know if this warrants a larger issue to investigate all
> the current TokenFilters and validate the input TokenStream. In my filters,
> I enforce the existence of a certain attribute. If I've misunderstood
> something, please correct me.
>
> 3) I think it would help if there will be some documentation/example about
> how TokenFilters are expected to process an Attribute before they return it.
> For example, if I have a TokenFilter which processes a certain TermAttribute
> by returning two other TermAttributes, then according to my understanding,
> upon calling incrementToken() it should:
> 3.1) If first call, clone the TokenStream's TermAttribute in an instance
> variable. Then process it and store in the TokenStream's TermAttribute the
> first TA it should return.
> 3.2) If second call, process it again and store the second TA in the
> TokenStream's TermAttribute.
> That's because the consumer will call incrementToken and then getAttribute.
> That getAttribute will return the TokenStream's attribute and not the
> filter's. I think I've read it somewhere, but it doesn't appear in this
> package.html.
>
> Shai
>
>
>

RE: Problematic documentation in o.a.l.analysis.package.html

Posted by Uwe Schindler <uw...@thetaphi.de>.
About (3): CopyTo is also only availabe for AttributeImpls, but not for the
interfaces (you have to cast first) and then you are warned. If you copyTo()
on a TernmAttribute, it may also copy other attributes with it, if
TermAttribute and PosIncr. Attribute are all implemented by the same
AttributeImpl. With captureState you can be sure to have copyied all
attributes.

 

There is one important thing: *always* ever call
AttributeSource.clearAttributes when you generate new tokens in
incrementToken. This affects mainly Tokenizers (they always have to
clearAttributes, see docs), but if you inject tokens inside a TokenFilter,
always call clearAttributes. If you do not do this and e.g. only Touch the
TermAttribute, the PosIncr may stay the same as before and does not get the
default, so e.g. StopFilter gets broken when you inject Zokens (the
SmartChinese analyzer had that problem).

 

About (2). There is no semantic difference, the call to input.addAttribute()
does exactly the same like this.addAttribute() because all TokenFilters in a
chain share the same attributes. It is exactly the same effect (you get the
same instance of the attribute) and so they are semantically and in reality
the same. If you think this is different, you have not understood the whole
attributes API.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 11:39 AM
To: java-dev@lucene.apache.org
Subject: Re: Problematic documentation in o.a.l.analysis.package.html

 

Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processed somehow by the input TokenStream. And this may
not be true. There is a semantic difference between: addAttribute, and
input.addAttribute. The former means I add the attribute to *my* TS
instance, while the latter to the input TS. Even though both are the same
(as TokenFilter attributes share the same instance of TokenStream
attributes), these are sematically different. I'd expect to do the former
just to ensure this attribute is in the 'attributes' map, while the latter
to ensure the input TS will process it.

BTW, remember that whatever looks clear to you, the implementer of this, may
not be cleared to users. I still think that the current documentation may
confuse people, and adding an extra "NOTE: this does not ensure the input TS
will do anything with the passed Attribute. If need to, call
input.hasAttribute to determine if that attribute is handled by the inpu TS"
won't hurt.

Shai

On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi Shai,

 

Thanks for the suggestions!

 

About your points: 

1)       This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
already in the vote phase and 2.9x is also already out.

2)       Maybe the explanation is not so good. This text comes especially
from the 2.9 old to new TS migration. The current indexer normally uses all
attributes and the default Token implementation implements all six (in 2.9).
We had a lot of TokenStreams in core that used getAttribute without
checking, if the attribute is really there. With Token/TokenWrapper as
implementation, this was always ok, but as soon as you switched to new API
only, the attributes were implemented by single instances, so after adding a
TermAttribute you were not be sure, to also have a PosIncr attribute. Even
the indexer of Lucene had that problem. So for the basic six attributes you
can always use addAttribute(), because if you not add the attribute in the
instantiation of the stream, all attributes are added by the indexer. This
applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
are enabled). So if the chain does not add them, they are added by the
indexer (consumer). Because all attributes have a default value, this is no
problem and the indexer code is even faster (e.g. it does not make sense to
every time check for a PositionIncrementAttribute, as reading the default
int value of 1 is read in the same speed, even faster without the extra null
check. And with current index format it is always tracked and indexed). If
you have own attribute, it may correctly make sense to check with
hasAttribute() and throw IAE. But after this check, you can also use
addAttribute() to get the reference to the attribute (there were a
discussion about removing getAttribute at all). getAttribute and
addAttribute have the same speed when fetching an already existing
attribute. So it is up to yours, how you would implement your tokenstreams,
that are just hints for beginners. The thing with hasAttribute() and
addAttribute() is also mentioned in the docs a few lines down at the
"optimization" part.

3)       Attribute interfaces cannot be cloned (the Attribute interface has
no clone()). You can only clone implementations (AttributeImpl), but it may
happen that you not only clone the TermAttribute, but some other attributes
also (e.g. if you use Token as impl, like 2.9 does by default for backwards
compatinility). Because of that, attribute instances cannot be simply
cloned, only their implementing classes. What you generally should do (see
caching TokenFilter): use AttributeSource.captureState() to capture all
current attributes in the filter chain. In incrementToken(), you can restore
these captured states again into the existing attributes. Such states can be
handles in local instance variables or Lists<State> etc.
The second point I do not understand: Consumers always call addAttribute
(maybe hasAttribute/getAttribute) exactly one time before consuming! So they
only have the current attribute instance, which can never change during the
lifetime of a TokenStream chain. Also all
TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
the TS chain works. If every TokenStream would have its own attributes, they
could not communicate! The important thing is, that you cannot relay on the
fact that a chain may contain attributes or not, because even if you created
a the Tokenizer and one TokenFilter in the chain and checked for
hasAttribute() in the ctor of the filter, it may happen, that the third
TokenFilter adds a new Attribute, which is then suddenly available (but the
second filter does not see it - ok, it may not need to know about it,
because the Tokenizer will never set it). Because of that its always best to
use addAttribute, if you are "interested" on an attribute. If it does not
exist, you just get default values (like termLength()==0 for TermAttribute,
posIncr==1,.).

 

Hope that explains your questions.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 9:37 AM
To: java-dev@lucene.apache.org
Subject: Problematic documentation in o.a.l.analysis.package.html

 

Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai

 


Re: Problematic documentation in o.a.l.analysis.package.html

Posted by Shai Erera <se...@gmail.com>.
Thanks Uwe.

About (3), I use copyTo, not clone. I used the word 'clone' just out of
habit. I'll read more about captureState, but I think copyTo works fine for
me.

Abour (2), I still think it's confusing. When I read addAttribute, I get an
impression as if by calling this method, it is guaranteed for me that that
attribute will be processed somehow by the input TokenStream. And this may
not be true. There is a semantic difference between: addAttribute, and
input.addAttribute. The former means I add the attribute to *my* TS
instance, while the latter to the input TS. Even though both are the same
(as TokenFilter attributes share the same instance of TokenStream
attributes), these are sematically different. I'd expect to do the former
just to ensure this attribute is in the 'attributes' map, while the latter
to ensure the input TS will process it.

BTW, remember that whatever looks clear to you, the implementer of this, may
not be cleared to users. I still think that the current documentation may
confuse people, and adding an extra "NOTE: this does not ensure the input TS
will do anything with the passed Attribute. If need to, call
input.hasAttribute to determine if that attribute is handled by the inpu TS"
won't hurt.

Shai

On Sun, Nov 22, 2009 at 12:12 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  Hi Shai,
>
>
>
> Thanks for the suggestions!
>
>
>
> About your points:
>
> 1)       This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
> already in the vote phase and 2.9x is also already out.
>
> 2)       Maybe the explanation is not so good. This text comes especially
> from the 2.9 old to new TS migration. The current indexer normally uses all
> attributes and the default Token implementation implements all six (in 2.9).
> We had a lot of TokenStreams in core that used getAttribute without
> checking, if the attribute is really there. With Token/TokenWrapper as
> implementation, this was always ok, but as soon as you switched to new API
> only, the attributes were implemented by single instances, so after adding a
> TermAttribute you were not be sure, to also have a PosIncr attribute. Even
> the indexer of Lucene had that problem. So for the basic six attributes you
> can always use addAttribute(), because if you not add the attribute in the
> instantiation of the stream, all attributes are added by the indexer. This
> applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
> are enabled). So if the chain does not add them, they are added by the
> indexer (consumer). Because all attributes have a default value, this is no
> problem and the indexer code is even faster (e.g. it does not make sense to
> every time check for a PositionIncrementAttribute, as reading the default
> int value of 1 is read in the same speed, even faster without the extra null
> check. And with current index format it is always tracked and indexed). If
> you have own attribute, it may correctly make sense to check with
> hasAttribute() and throw IAE. But after this check, you can also use
> addAttribute() to get the reference to the attribute (there were a
> discussion about removing getAttribute at all). getAttribute and
> addAttribute have the same speed when fetching an already existing
> attribute. So it is up to yours, how you would implement your tokenstreams,
> that are just hints for beginners. The thing with hasAttribute() and
> addAttribute() is also mentioned in the docs a few lines down at the
> “optimization” part.
>
> 3)       Attribute interfaces cannot be cloned (the Attribute interface
> has no clone()). You can only clone implementations (AttributeImpl), but it
> may happen that you not only clone the TermAttribute, but some other
> attributes also (e.g. if you use Token as impl, like 2.9 does by default for
> backwards compatinility). Because of that, attribute instances cannot be
> simply cloned, only their implementing classes. What you generally should do
> (see caching TokenFilter): use AttributeSource.captureState() to capture all
> current attributes in the filter chain. In incrementToken(), you can restore
> these captured states again into the existing attributes. Such states can be
> handles in local instance variables or Lists<State> etc.
> The second point I do not understand: Consumers always call addAttribute
> (maybe hasAttribute/getAttribute) exactly one time before consuming! So they
> only have the current attribute instance, which can never change during the
> lifetime of a TokenStream chain. Also all
> TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
> the TS chain works. If every TokenStream would have its own attributes, they
> could not communicate! The important thing is, that you cannot relay on the
> fact that a chain may contain attributes or not, because even if you created
> a the Tokenizer and one TokenFilter in the chain and checked for
> hasAttribute() in the ctor of the filter, it may happen, that the third
> TokenFilter adds a new Attribute, which is then suddenly available (but the
> second filter does not see it – ok, it may not need to know about it,
> because the Tokenizer will never set it). Because of that its always best to
> use addAttribute, if you are “interested” on an attribute. If it does not
> exist, you just get default values (like termLength()==0 for TermAttribute,
> posIncr==1,…).
>
>
>
> Hope that explains your questions.
>
>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Sunday, November 22, 2009 9:37 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Problematic documentation in o.a.l.analysis.package.html
>
>
>
> Hi
>
> I've read the analysis package.html and I found two issues:
>
> 1) The code sample under Invoking the Analyzer is broken. It calls
> incrementToken() but inside the while it prints 'ts' (which is TokenStream)
> and then do "t = ts.next()", which no longer works. That's an easy fix, so I
> don't think a JIRA issue is needed.
>
> 2) The documentation specifies that "Even consumers of TokenStreams should
> normally call addAttribute() instead of getAttribute(), because it would not
> fail if the TokenStream does not have this Attribute". IMO this is wrong and
> will give the wrong impression about how this API should be used. What if
> the TokenStream does not care about this attribute? It will not fill it with
> any information. The example with LengthFilter which calls addAttribute
> instead of has/getAttribute is a good one regarding why you shouldn't just
> call addAttribute. LegthFilter relies on the given TokenStream to fill
> TermAttribute with some information, so that it can later filter out terms
> of length < threshold. But what if I create a LengthFilter and give it a
> TokenStream which creates just PartOfSpeechAttribute? Or output terms that
> are not TermAttribute? Obviously it would be silly for me to do it, but no
> one restricts me from doing so. LengthFilter should either document that it
> expects TermAttribute to be returned from the input TokenStream, or better
> yet, enforce it in the constructor --> if you pass a TokenStream that does
> not return TermAttribute, throw an IllegalArgumentException.
>
> But anyway, the current documentation is, IMO, wrong and may lead to wrong
> impression. I don't know if this warrants a larger issue to investigate all
> the current TokenFilters and validate the input TokenStream. In my filters,
> I enforce the existence of a certain attribute. If I've misunderstood
> something, please correct me.
>
> 3) I think it would help if there will be some documentation/example about
> how TokenFilters are expected to process an Attribute before they return it.
> For example, if I have a TokenFilter which processes a certain TermAttribute
> by returning two other TermAttributes, then according to my understanding,
> upon calling incrementToken() it should:
> 3.1) If first call, clone the TokenStream's TermAttribute in an instance
> variable. Then process it and store in the TokenStream's TermAttribute the
> first TA it should return.
> 3.2) If second call, process it again and store the second TA in the
> TokenStream's TermAttribute.
> That's because the consumer will call incrementToken and then getAttribute.
> That getAttribute will return the TokenStream's attribute and not the
> filter's. I think I've read it somewhere, but it doesn't appear in this
> package.html.
>
> Shai
>

RE: Problematic documentation in o.a.l.analysis.package.html

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Shai,

 

Thanks for the suggestions!

 

About your points: 

1)       This is really wrong, we can easily fix it for 3.1. Lucene 3.0 is
already in the vote phase and 2.9x is also already out.

2)       Maybe the explanation is not so good. This text comes especially
from the 2.9 old to new TS migration. The current indexer normally uses all
attributes and the default Token implementation implements all six (in 2.9).
We had a lot of TokenStreams in core that used getAttribute without
checking, if the attribute is really there. With Token/TokenWrapper as
implementation, this was always ok, but as soon as you switched to new API
only, the attributes were implemented by single instances, so after adding a
TermAttribute you were not be sure, to also have a PosIncr attribute. Even
the indexer of Lucene had that problem. So for the basic six attributes you
can always use addAttribute(), because if you not add the attribute in the
instantiation of the stream, all attributes are added by the indexer. This
applies to TermAttribute, PosIncrementAttribute and OffsetAttribute (if TV
are enabled). So if the chain does not add them, they are added by the
indexer (consumer). Because all attributes have a default value, this is no
problem and the indexer code is even faster (e.g. it does not make sense to
every time check for a PositionIncrementAttribute, as reading the default
int value of 1 is read in the same speed, even faster without the extra null
check. And with current index format it is always tracked and indexed). If
you have own attribute, it may correctly make sense to check with
hasAttribute() and throw IAE. But after this check, you can also use
addAttribute() to get the reference to the attribute (there were a
discussion about removing getAttribute at all). getAttribute and
addAttribute have the same speed when fetching an already existing
attribute. So it is up to yours, how you would implement your tokenstreams,
that are just hints for beginners. The thing with hasAttribute() and
addAttribute() is also mentioned in the docs a few lines down at the
"optimization" part.

3)       Attribute interfaces cannot be cloned (the Attribute interface has
no clone()). You can only clone implementations (AttributeImpl), but it may
happen that you not only clone the TermAttribute, but some other attributes
also (e.g. if you use Token as impl, like 2.9 does by default for backwards
compatinility). Because of that, attribute instances cannot be simply
cloned, only their implementing classes. What you generally should do (see
caching TokenFilter): use AttributeSource.captureState() to capture all
current attributes in the filter chain. In incrementToken(), you can restore
these captured states again into the existing attributes. Such states can be
handles in local instance variables or Lists<State> etc.
The second point I do not understand: Consumers always call addAttribute
(maybe hasAttribute/getAttribute) exactly one time before consuming! So they
only have the current attribute instance, which can never change during the
lifetime of a TokenStream chain. Also all
TokenFilters/TokenStreams/Tokenizers use the same instances! Because of this
the TS chain works. If every TokenStream would have its own attributes, they
could not communicate! The important thing is, that you cannot relay on the
fact that a chain may contain attributes or not, because even if you created
a the Tokenizer and one TokenFilter in the chain and checked for
hasAttribute() in the ctor of the filter, it may happen, that the third
TokenFilter adds a new Attribute, which is then suddenly available (but the
second filter does not see it - ok, it may not need to know about it,
because the Tokenizer will never set it). Because of that its always best to
use addAttribute, if you are "interested" on an attribute. If it does not
exist, you just get default values (like termLength()==0 for TermAttribute,
posIncr==1,.).

 

Hope that explains your questions.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Sunday, November 22, 2009 9:37 AM
To: java-dev@lucene.apache.org
Subject: Problematic documentation in o.a.l.analysis.package.html

 

Hi

I've read the analysis package.html and I found two issues:

1) The code sample under Invoking the Analyzer is broken. It calls
incrementToken() but inside the while it prints 'ts' (which is TokenStream)
and then do "t = ts.next()", which no longer works. That's an easy fix, so I
don't think a JIRA issue is needed.

2) The documentation specifies that "Even consumers of TokenStreams should
normally call addAttribute() instead of getAttribute(), because it would not
fail if the TokenStream does not have this Attribute". IMO this is wrong and
will give the wrong impression about how this API should be used. What if
the TokenStream does not care about this attribute? It will not fill it with
any information. The example with LengthFilter which calls addAttribute
instead of has/getAttribute is a good one regarding why you shouldn't just
call addAttribute. LegthFilter relies on the given TokenStream to fill
TermAttribute with some information, so that it can later filter out terms
of length < threshold. But what if I create a LengthFilter and give it a
TokenStream which creates just PartOfSpeechAttribute? Or output terms that
are not TermAttribute? Obviously it would be silly for me to do it, but no
one restricts me from doing so. LengthFilter should either document that it
expects TermAttribute to be returned from the input TokenStream, or better
yet, enforce it in the constructor --> if you pass a TokenStream that does
not return TermAttribute, throw an IllegalArgumentException.

But anyway, the current documentation is, IMO, wrong and may lead to wrong
impression. I don't know if this warrants a larger issue to investigate all
the current TokenFilters and validate the input TokenStream. In my filters,
I enforce the existence of a certain attribute. If I've misunderstood
something, please correct me.

3) I think it would help if there will be some documentation/example about
how TokenFilters are expected to process an Attribute before they return it.
For example, if I have a TokenFilter which processes a certain TermAttribute
by returning two other TermAttributes, then according to my understanding,
upon calling incrementToken() it should:
3.1) If first call, clone the TokenStream's TermAttribute in an instance
variable. Then process it and store in the TokenStream's TermAttribute the
first TA it should return.
3.2) If second call, process it again and store the second TA in the
TokenStream's TermAttribute.
That's because the consumer will call incrementToken and then getAttribute.
That getAttribute will return the TokenStream's attribute and not the
filter's. I think I've read it somewhere, but it doesn't appear in this
package.html.

Shai