You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2008/11/26 22:36:53 UTC

Deprecation of Token constructors in 2.4

Hi

I moved to use Lucene 2.4 and noticed that some of the Token constructors
were marked deprecated. Specifically, I'm talking about the Token(String,
int, int), where the String is the word to populate the token with, and the
two ints are startOffset and endOffset respectively.
That was an amazingly convenient constructor. I understand that it is
discouraged to use it, and that the one that accepts a char[] is better, but
there are cases, during indexing, where all you have at hand is a String and
not a char[].

For example, suppose that you want to add a certain token to a document. You
can do this by adding a Field with a TokenStream, where the TS will create a
new Token(), populate it with the value to add and return it.

Before 2.4, the code was simply:
return new Token(word, start, end);

After Lucene 2.4, the code looks like this:
Token t = new Token();
t.setTermBuffer(word, 0, word.length());
t.setStartOffset(start);
t.setEndOffset(end);
return t;

Instead of a one liner, I now have to write 5 (!) lines of code whenever I
want to do something like this. And ... the fact that I can call
setTermBuffer(String, int, int) (like I do in the 2nd line of the code) does
not prevent me from using Strings at all. The only thing that the
deprecation of the constructor achieves is complicating the code developers
need to write.

IMO, there is a huge difference between removing a convenient, yet
inefficient, method, than simply document that it's discouraged to use it.
After all, if I choose to use it at my own expense, I can do it and face
whatever consequences there will be.

In fact, when I moved to use setTermBuffer, I actually introduced a bug in
my code. The reason is that setTermBuffer accepts to integers which specify
the offset and length of the internal char[] of Token, rather than my start
and end offset I used to use when I had the Token(String, int, int)
constructor. That's confusing.

And if I raise the deprecation issue, the method termText() which returns a
String was also a convenient (eventhough inefficient) method, for all kind
of purposes amongst which are debugging or printing. But not only that -
Java makes use of String in so many places, it's really hard to stay with a
char[] for long, as soon as you start involving Lucene code with other Java
data structures. So instead of calling termText() (and knowing it's
inefficient, and even document it), I now have to write new
String(t.termBuffer(), 0, t.termLength()).

I would like to ask the developers community - what is the strategy of
deprecating methods? Again, we should document when certain methods are
inefficient, rather than deprecating them, and thus forcing people to write
more cumbersome code.

A good example for a justified deprecation is IndexWriter.docCount() method,
which recommends to use maxDoc() or numDocs() (if you want to take into
account deletions). There are two reasons it's justified:
1. Replacing docCount() with maxDoc() does not complicate my code.
2. docCount() is confusing (is it maxDoc() or numDocs()?).

On that notion, the deprecation of TokenStream.next() simply forces people
who want to store the Tokens output TS to call TokenStream.next(new
Token()). Not a huge inconvenience, but IMO an unnecessary deprecation.

I'm not sure this is new stuff to you, and perhaps if was even raised on
this list before. I'm also aware of the efforts to completely change the
tokenization process in Lucene. However, I think that if we could
undeprecate some of the deprecated methods, we'd do a great service to the
developers community.

Shai

Re: Deprecation of Token constructors in 2.4

Posted by Michael McCandless <lu...@mikemccandless.com>.

This was done under LUCENE-1333, in moving Token to the "reuse" API.

The general idea is to deprecate the String based APIs (whose
performance got worse because of the switch to char[]), in favor of
the char[] APIs.

That said, I do agree it's still useful to have String based variants,
at the expense of performance, for those cases that want convenience
and don't care much about performance.

However, we can't silently cause an existing API to have worse
performance (it's a break to back-compatability).  So I think we
should instead deprecate the old method and introduce a new one with a
clear warning about the performance penalty.  EG, we did this with
termText (now deprecated in favor of term()).  This way on upgrading,
people will have to confront the fact that its performance is now
worse than before.

I'm not sure what ctor we could add that could take String, and
doesn't already exist.

Further... the changes for LUCENE-1422 (committed to 2.9-dev) make
this question moot since the whole Token class is now deprecated.

Mike

Shai Erera wrote:

> Hi
>
> I moved to use Lucene 2.4 and noticed that some of the Token  
> constructors were marked deprecated. Specifically, I'm talking about  
> the Token(String, int, int), where the String is the word to  
> populate the token with, and the two ints are startOffset and  
> endOffset respectively.
> That was an amazingly convenient constructor. I understand that it  
> is discouraged to use it, and that the one that accepts a char[] is  
> better, but there are cases, during indexing, where all you have at  
> hand is a String and not a char[].
>
> For example, suppose that you want to add a certain token to a  
> document. You can do this by adding a Field with a TokenStream,  
> where the TS will create a new Token(), populate it with the value  
> to add and return it.
>
> Before 2.4, the code was simply:
> return new Token(word, start, end);
>
> After Lucene 2.4, the code looks like this:
> Token t = new Token();
> t.setTermBuffer(word, 0, word.length());
> t.setStartOffset(start);
> t.setEndOffset(end);
> return t;
>
> Instead of a one liner, I now have to write 5 (!) lines of code  
> whenever I want to do something like this. And ... the fact that I  
> can call setTermBuffer(String, int, int) (like I do in the 2nd line  
> of the code) does not prevent me from using Strings at all. The only  
> thing that the deprecation of the constructor achieves is  
> complicating the code developers need to write.
>
> IMO, there is a huge difference between removing a convenient, yet  
> inefficient, method, than simply document that it's discouraged to  
> use it. After all, if I choose to use it at my own expense, I can do  
> it and face whatever consequences there will be.
>
> In fact, when I moved to use setTermBuffer, I actually introduced a  
> bug in my code. The reason is that setTermBuffer accepts to integers  
> which specify the offset and length of the internal char[] of Token,  
> rather than my start and end offset I used to use when I had the  
> Token(String, int, int) constructor. That's confusing.
>
> And if I raise the deprecation issue, the method termText() which  
> returns a String was also a convenient (eventhough inefficient)  
> method, for all kind of purposes amongst which are debugging or  
> printing. But not only that - Java makes use of String in so many  
> places, it's really hard to stay with a char[] for long, as soon as  
> you start involving Lucene code with other Java data structures. So  
> instead of calling termText() (and knowing it's inefficient, and  
> even document it), I now have to write new String(t.termBuffer(), 0,  
> t.termLength()).
>
> I would like to ask the developers community - what is the strategy  
> of deprecating methods? Again, we should document when certain  
> methods are inefficient, rather than deprecating them, and thus  
> forcing people to write more cumbersome code.
>
> A good example for a justified deprecation is IndexWriter.docCount()  
> method, which recommends to use maxDoc() or numDocs() (if you want  
> to take into account deletions). There are two reasons it's justified:
> 1. Replacing docCount() with maxDoc() does not complicate my code.
> 2. docCount() is confusing (is it maxDoc() or numDocs()?).
>
> On that notion, the deprecation of TokenStream.next() simply forces  
> people who want to store the Tokens output TS to call  
> TokenStream.next(new Token()). Not a huge inconvenience, but IMO an  
> unnecessary deprecation.
>
> I'm not sure this is new stuff to you, and perhaps if was even  
> raised on this list before. I'm also aware of the efforts to  
> completely change the tokenization process in Lucene. However, I  
> think that if we could undeprecate some of the deprecated methods,  
> we'd do a great service to the developers community.
>
> Shai


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org