You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by DM Smith <dm...@gmail.com> on 2008/07/11 14:32:28 UTC

Re: Token implementation

I am now looking at this in depth.
Here is what I am finding: Token and Term are tightly paired, but  
their implementation is not.

Term holds a field and a word. Both of these are Strings. So to get  
the term out of a Token and put it into a Term, one of two constructs  
is used:
t = new Term(field, token.termText());
or
String text = new String(token.termBuffer(), 0, token.termLength());
t = new Term(field, text);

Both of these are awkward. Shouldn't Term have constructors that take  
a Token? Because of reuse, these ctors will have to copy out the term  
text into it's own field.

-- DM

On May 20, 2008, at 5:21 AM, DM Smith wrote:

>
> On May 20, 2008, at 5:01 AM, Michael McCandless wrote:
>
>>
>> DM Smith wrote:
>>>
>>> On May 19, 2008, at 4:33 PM, Michael McCandless wrote:
>>>
>>>> DM Smith <dm...@gmail.com> wrote:
>>>>> Michael McCandless wrote:
>>>>>>
>>>>>> I agree the situation is not ideal, and it's confusing.
>>>>>
>>>>> My problem as a user is that I have to read the code to figure  
>>>>> out how to
>>>>> optimally use the class. The JavaDoc is a bit wanting.
>>>>
>>>> Yeah we should fix the javadocs.
>>>>
>>>>>> This comes back to LUCENE-969.
>>>>>>
>>>>>> At the time, we decided to keep both String & char[] only to  
>>>>>> avoid
>>>>>> performance cost for those analyzer chains that use String tokens
>>>>>> exclusively.
>>>>>>
>>>>>> The idea was to allow Token to keep both text or char[] and  
>>>>>> sometimes
>>>>>> both (if they are storing the same characters, as happens if
>>>>>> termBuffer() is called when it's a String being stored)
>>>>>
>>>>> When termBuffer() is called termText is always null on return.  
>>>>> This is the
>>>>> invariant of initTermBuffer() which is called directly or  
>>>>> indirectly by
>>>>> every method that touches termBuffer.
>>>>
>>>> Sorry I meant termText().
>>>>
>>>>> It is only after calling termText() that one could have both.  
>>>>> The only
>>>>> advantage I see here is that calling it twice without any  
>>>>> intervening call
>>>>> to a termBuffer method would it return the same physical string.
>>>>
>>>> Right.
>>>>
>>>>> After calling setTermText(newText), termBuffer is set to null.
>>>>>
>>>>> I presume the purpose of a filter is to get the token content,  
>>>>> and if
>>>>> modified, set the new content. If so, the result of the setXXX  
>>>>> will be that
>>>>> either termText or termBuffer will be null.
>>>>
>>>> Right, though, if there's no change, and the next filter in the  
>>>> chain
>>>> calls termText(), we save constructing a new String by caching it.
>>>>
>>>>>> Then, in 3.0, we would make the change you are proposing (to only
>>>>>> store char[] internally).  That was the plan, anyway.   
>>>>>> Accelerating
>>>>>> this plan (to store only char[] today) is compelling, but I worry
>>>>>> about the performance hit to legacy analyzer chains...
>>>>>
>>>>> I wonder whether it is all that significant an issue. Today,  
>>>>> some of the
>>>>> Lucene analyzers have been migrated to using char[], while  
>>>>> others, notably
>>>>> contrib, continue to use text.
>>>>>
>>>>> IMHO: Prior to char[], the text was replaceable, but not  
>>>>> modifiable. From a
>>>>> practical perspective, Token reuse minimized the cost of  
>>>>> construction, but
>>>>> not much else. The performance of a Token was predictable, but  
>>>>> the filter
>>>>> was potentially sub-optimal. With char[] and supporting methods,  
>>>>> the text
>>>>> became modifiable, too.
>>>>>
>>>>> When a filter calls setTermText or setTermBuffer, it does not  
>>>>> know how the
>>>>> consumer of the token will work. It could be that it stores it  
>>>>> with
>>>>> setTermText and the next filter calls termBuffer().
>>>>>
>>>>> I may not understand this correctly, but it seems to me that the  
>>>>> following
>>>>> is plausible given a filter chain of Lucene provided filters  
>>>>> (including contrib)
>>>>> If we have a token filter chain of A -> B -> C, which uses  
>>>>> next() in any
>>>>> part of the chain, the flow of a reusable Token is stopped. A  
>>>>> given filter
>>>>> may cache a Token and reuse it. So consider the following  
>>>>> scenario:
>>>>> A overrides next(Token) and reuses the token via char[]
>>>>> B overrides next() and has a cached Token and updates text.
>>>>> C overrides next(Token) and reuses the token via char[].
>>>>>
>>>>> First run:
>>>>> After A is done, the termText in the supplied Token is null and  
>>>>> termBuffer
>>>>> has a value.
>>>>>
>>>>> B provides it's own token so it is not influenced by A.
>>>>>
>>>>> C is given the token that B returns, because the caller is  
>>>>> effectively using
>>>>> "token = input.next(token)", but because the token has text, a  
>>>>> performance
>>>>> hit is taken to put it into char[]. Both text and char[] start  
>>>>> out the same,
>>>>> but because char[] is changed, termText is set to null.
>>>>>
>>>>> Second run:
>>>>> A starts with a Token with a char[] because it is reusing the  
>>>>> token from the
>>>>> last run or because it is using a localToken from the caller. If  
>>>>> it is a
>>>>> localToken, then the scenario is as above. But if it is the end  
>>>>> result of
>>>>> the first run, then A is re-using the token that is cached in B.  
>>>>> Since C
>>>>> last modified it, it is char[].
>>>>>
>>>>> B uses its cached Token, but it was modified by A to be char[]  
>>>>> with null
>>>>> text. Now B takes a performance hit as it creates a new String.
>>>>>
>>>>> C is as in the first run.
>>>>>
>>>>> Another scenario:
>>>>> A, B and C are all legacy. This would only be filter chains that  
>>>>> are not
>>>>> provided by core Lucene as the core filter chains have been  
>>>>> migrated. This
>>>>> would be a performance hit.
>>>>
>>>> Why is this a performance hit?  If they are all legacy they would  
>>>> all
>>>> implemented the non-reuse next() API, and would use setTermText,  
>>>> and
>>>> no conversion to char[] would happen (except inside  
>>>> DocumentsWriter)?
>>>
>>> I was thinking faster than I was typing. I meant to say "not be a  
>>> performance hit".
>>
>> Ahh, ok.
>>
>>>>
>>>>
>>>>> A last scenario:
>>>>> A, B and C are all char[]. This would not take a performance hit.
>>>>>
>>>>> It seems to me that in a mixed chain, that there will always be a
>>>>> performance hit.
>>>>
>>>> Right, though since we cache the String we should save on multiple
>>>> calls to termText().  And in a non-mixed chain (all new or all  
>>>> old) I
>>>> think there wouldn't be a hit.
>>>>
>>>>> But my guess is that maintaining termBuffer once used would
>>>>> be a good thing.
>>>>
>>>> Interesting...
>>>>
>>>>> So, a modified suggestion to maintain the performance but  
>>>>> improve the first
>>>>> scenario. Do you see any problem with the following?
>>>>>
>>>>> If termBuffer is used in a token, then it is maintained and  
>>>>> never set to  null.
>>>>> Note also that resizeTermBuffer(size) maintains the termBuffer.  
>>>>> That is, it
>>>>> copies the text when the array is grown. When it is known that  
>>>>> it is going
>>>>> to be slammed this is unnecessary. One can implement a helper  
>>>>> function that
>>>>> merely grows the array. There are a couple of places this
>>>>>
>>>>> Thus setTokenText would be something like:
>>>>> public void setTermText(String text) {
>>>>> termText = text;
>>>>> if (termBuffer != null) {
>>>>>  growTermBuffer(termText.length());
>>>>>  termText.getChars(0, termText.length(), termBuffer, 0);
>>>>> }
>>>>> }
>>>>>
>>>>> A possible implementation to grow the array would be:
>>>>> private void growTermBuffer(int newSize)
>>>>> {
>>>>> // determine the best size
>>>>> if (newSize < MIN_BUFFER_SIZE) {
>>>>> newSize = MIN_BUFFER_SIZE;
>>>>> }
>>>>>
>>>>> // if the buffer exists and is too small, then determine a  
>>>>> better size.
>>>>> // this is the current doubling algorithm. it could be better.
>>>>> int tbLength = termBuffer == null ? 0 : termBuffer.length;
>>>>> if (tbLength > 0 && newSize > tbLength) {
>>>>> int size = tbLength;
>>>>> while (size < newSize) {
>>>>>   size *= 2;
>>>>> }
>>>>> newSize = size;
>>>>> }
>>>>>
>>>>> // Check to see if the buffer needs to be resized
>>>>> if (newSize > tbLength)
>>>>> {
>>>>> termBuffer = new char[newSize];
>>>>> }
>>>>> }
>>>>
>>>> That looks good.  Though, if Token is 100% re-used (full chain is
>>>> "new") then I would expect growing the char[] to be very low cost
>>>> (happens very few times).
>>>
>>> Without peeking at your Python sample below ;) the doubling  
>>> algorithm is costly as it over allocates too much. But it has the  
>>> nice behavior that reallocations are fewer.
>>>
>>> But, now peeking ahead, a less aggressive allocation policy would  
>>> cause more allocations and the array copy becomes more important.  
>>> Though not significantly. But in a mixed environment with some  
>>> sharing, it could be magnified.
>>>
>>>
>>> I didn't dig into it, but in the core, are reusable Tokens per  
>>> document or per DocumentWriter. If it is the former then the copy  
>>> becomes more significant. If it is the latter, then it is almost  
>>> moot, as you point out.
>>
>> It's actually per DocumentsWriter per thread per field.
>
> This is excellent.
>
>>
>>
>>>>
>>>>
>>>>> More below....
>>>>>>
>>>>>> More responses below:
>>>>>>
>>>>>> DM Smith <dm...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I think the Token implementation as it stands can use some  
>>>>>>> improvement and
>>>>>>> I'd be willing to do it. I'd like some input, though.  
>>>>>>> Especially because it
>>>>>>> is core to Lucene.
>>>>>>>
>>>>>>> I've been working on eliminating deprecations from my user  
>>>>>>> code and I ran
>>>>>>> across Token.getText() as being deprecated.
>>>>>>>
>>>>>>> This is not about my code, but the code in Token.
>>>>>>>
>>>>>>> In Token, it allows one of two representations of the actual  
>>>>>>> token, either
>>>>>>> an immutable String, or a mutable char[]. One can flip back  
>>>>>>> and forth
>>>>>>> between these all to easily.
>>>>>>>
>>>>>>> termText() is deprecated so termBuffer() is suggested as a  
>>>>>>> replacement.
>>>>>>>
>>>>>>> Calling termBuffer() will potentially copy the text out of the  
>>>>>>> String
>>>>>>> termText and into the newly created buffer and return it.
>>>>>>>
>>>>>>> Calling setTermText(str), which is not deprecated, will drop  
>>>>>>> the buffer and
>>>>>>> save the str in termText.
>>>>>>>
>>>>>>> It appears that the invariant that is trying to be established  
>>>>>>> is
>>>>>>> either termText or termBuffer holds the content, but not both.
>>>>>>> However, termBuffer() potentially violates this by loading  
>>>>>>> termText
>>>>>>> with the termBuffer, but not nulling out the buffer.
>>>>>>>
>>>>>>
>>>>>> Actually, both are allowed to be set, as long as they are the  
>>>>>> same.
>>>>>> So termBuffer() is allowed to leave both non-null.
>>>>>>
>>>>>
>>>>> I'm not sure of the advantage.
>>>>
>>>> Covered above (mixed chains).
>>>>
>>>>>>> Also, in my code, I am not manipulating char[] so getting the  
>>>>>>> buffer, I need
>>>>>>> to immediately convert it to a string to process it. And then  
>>>>>>> when I'm done,
>>>>>>> I have a String possibly of some other length. To stuff it  
>>>>>>> back into the
>>>>>>> termBuffer, requires a call to:
>>>>>>> setTermBuffer(s.toCharArray(), o, s.length())
>>>>>>
>>>>>> It would be better to call Token.resizeTermBuffer(...), then
>>>>>> s.getChars into the Token's term buffer (saves a buffer copy).
>>>>>
>>>>> Many thanks. I saw the comment, but missed getChars. This is  
>>>>> better, but
>>>>> even better would be growTermBuffer (above) since resizeTermBuffer
>>>>> potentially has an unnecessary copy.
>>>>
>>>> True, but remember growing should be very rare in 100% reuse  
>>>> (new) case.
>>>
>>> I'm presuming then that the DocumentWriter is the holder of the  
>>> one reusable Token.
>>>
>>>>
>>>>
>>>>>>> I was looking at this in light of TokenFilter's next(Token)  
>>>>>>> method and how
>>>>>>> it was being used. In looking at the contrib filters, they  
>>>>>>> have not been
>>>>>>> modified. Further, most of them, if they work with the content  
>>>>>>> analysis and
>>>>>>> generation, do their work in strings. Some of these appear to  
>>>>>>> be good
>>>>>>> candidates for using char[] rather than strings, such as the  
>>>>>>> NGram filter.
>>>>>>> But others look like they'd just as well remain with String  
>>>>>>> manipulation.
>>>>>>
>>>>>> It would be great to upgrade all contrib filters to use the re- 
>>>>>> use APIs.
>>>>>
>>>>> I'll be happy to work toward that end. I think it affects my  
>>>>> performance
>>>>> today.
>>>>
>>>> That would be great :)
>>>>
>>>>>>> I'd like to suggest that internally, that Token be changed to  
>>>>>>> only use
>>>>>>> char[] termBuffer and eliminate termText.
>>>>>>
>>>>>> The question is what performance cost we are incurring eg on the
>>>>>> contrib (& other) sources/filters?  Every time setTermText is  
>>>>>> called,
>>>>>> we copy out the chars (instead of holding a reference to the  
>>>>>> String).
>>>>>> Every time getText() is called we create a new String(...) from  
>>>>>> the
>>>>>> char[].  I think it's potentially a high cost, and so maybe we  
>>>>>> should
>>>>>> wait until 3.0 when we drop the deprecated APIs?
>>>>>
>>>>> See above. I'll concede that. But I think that once termBuffer  
>>>>> is used
>>>>> because of a mixed chain, it should be retained.
>>>>
>>>> This may cause problems if a core Tokenizer is used to produce the
>>>> tokens, but then a big chain of old (non-reuse) filters is used  
>>>> after
>>>> that?
>>>
>>> If the reusable token is per DocumentWriter and the chain is per  
>>> Document, then it is an advantage to retain it. he buffer is there  
>>> for the next Document, likely big enough that it does not need to  
>>> be allocated again.
>>
>> Well, remember that a single non-reuse filter in the chain "breaks"  
>> the sharing, going backwards.  IE the new token source will see a  
>> different Token with every next() call.  So the buffer doesn't get  
>> reused.  So I think in this case we would spend alot of extra time  
>> converting the Strings in the non-reuse filters back to char[].
>
> IIRC, I saw somewhere, in an outer driving loop something like the  
> following:
>
> result = input.next(localToken)
>
> If my memory is correct, then reuse is re-established each time  
> through the loop. That localToken is presumably using char[] to set  
> initial content. Or at least once all core and contrib is converted  
> to work with it.
>
>
>>
>>
>>>>
>>>>
>>>>>>> And also, that termText be restored as not deprecated.
>>>>>>
>>>>>> It made me nervous keeping this method because it looks like it  
>>>>>> should
>>>>>> be cheap to call, and in the past it was very cheap to call.   
>>>>>> But,
>>>>>> maybe we could keep it, if we mark very very clearly in the  
>>>>>> javadocs
>>>>>> the performance cost you incur by using this method (it makes a  
>>>>>> new
>>>>>> String() every time)?
>>>>>>
>>>>>>
>>>>>>> But, in TokenFilter, next() should be deprecated, IMHO.
>>>>>>
>>>>>> I think this is a good idea.  After all if people don't want to  
>>>>>> bother
>>>>>> using the passed in Token, they are still allowed to return a new
>>>>>> one.
>>>>>
>>>>> Should the constructors in Token that take a String be  
>>>>> deprecated, too? The
>>>>> comments seem to suggest that.
>>>>
>>>> Actually I think they should.  We should strongly encourage the  
>>>> re-use
>>>> APIs.
>>>>
>>>>>>> I have also used a better algorithm than doubling for resizing  
>>>>>>> an
>>>>>>> array. I'd have to hunt for it.
>>>>>>
>>>>>> That would be great!
>>>>>
>>>>> I'm looking, but still haven't found it. :)
>>>>>
>>>>> I'm looking into digging into this one.
>>>>
>>>> Python has an interesting approach, for its list type.  Here's a
>>>> snippet from listobject.c from Python's sources:
>>>>
>>>> 	/* This over-allocates proportional to the list size, making room
>>>> 	 * for additional growth.  The over-allocation is mild, but is
>>>> 	 * enough to give linear-time amortized behavior over a long
>>>> 	 * sequence of appends() in the presence of a poorly-performing
>>>> 	 * system realloc().
>>>> 	 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72,  
>>>> 88, ...
>>>> 	 */
>>>> 	new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;
>>>
>>> This is a good one. It is computationally simple and it is not too  
>>> aggressive.
>>>
>>>
>>>>
>>>>
>>>> I'm sure Perl has something interesting too :)  But I think this is
>>>> probably overkill for us ...
>>>>
>>>>> Based on the outcome of this discussion, would this be one Jira  
>>>>> issue?
>>>>> Reopening LUCENE-969? A separate issue for the contrib changes?
>>>>
>>>> I would say open a new issue?
>>>
>>> OK. I'm on vacation for the next week and I don't know what kind  
>>> of connectivity I'll have. So it might wait a bit.
>>>
>>>>
>>>>
>>>>> Since this would not be an API change, would there need to be  
>>>>> changes to
>>>>> test cases? If so, what would you suggest?
>>>>
>>>> I think the existing test cases are likely sufficient though I  
>>>> always
>>>> love adding new ones ;)
>>>
>>> Me too ;)
>>>
>>> -- DM
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Karl Wettin <ka...@gmail.com>.
15 jul 2008 kl. 11.25 skrev Hiroaki Kawai:

> DM Smith <dm...@gmail.com> wrote:
>> In core and contrib, there are times where more than one token is  
>> used
>> at a time. In a few places they are put into collections.
>>
>> So a singleton wouldn't work.
>
> I'd suggest a factory method to get a thread local singleton.
> If one want to new a token, it may be possible, but it is not
> encouraged to do so.
> If there is more than one token, that is not good because
> we are focusing on reducing new object. Such code might be
> refactored to be more simple.

Some token streams require input and output buffers: shingles,  
synonyms, et c.


              karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Hiroaki Kawai <ka...@apache.org>.
DM Smith <dm...@gmail.com> wrote:
> Hiroaki Kawai wrote:
> > DM Smith <dm...@gmail.com> wrote:
> >   
> >> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
> >>
> >>     
> >>> Another suggestion from me:
> >>> How about making token object as an singleton?
> >>>       
> >> Would that work for a multi-threaded application?
> >>     
> >
> > Of cource. We should make that thread local singleton.
> >   
> 
> In core and contrib, there are times where more than one token is used 
> at a time. In a few places they are put into collections.
> 
> So a singleton wouldn't work.

I'd suggest a factory method to get a thread local singleton.
If one want to new a token, it may be possible, but it is not 
encouraged to do so. 
If there is more than one token, that is not good because 
we are focusing on reducing new object. Such code might be
refactored to be more simple.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by DM Smith <dm...@gmail.com>.
Hiroaki Kawai wrote:
> DM Smith <dm...@gmail.com> wrote:
>   
>> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
>>
>>     
>>> Another suggestion from me:
>>> How about making token object as an singleton?
>>>       
>> Would that work for a multi-threaded application?
>>     
>
> Of cource. We should make that thread local singleton.
>   

In core and contrib, there are times where more than one token is used 
at a time. In a few places they are put into collections.

So a singleton wouldn't work.

>
>   
>>>
>>>       
>>>> Maybe we should un-deprecate the termText() method but add javadocs
>>>> explaining that for better performance you should use the char[]  
>>>> reuse
>>>> methods instead?
>>>>
>>>> Mike
>>>>
>>>> DM Smith wrote:
>>>>
>>>>         
>>>>> Michael McCandless wrote:
>>>>>           
>>>>>> DM Smith wrote:
>>>>>>
>>>>>>             
>>>>>>> Shouldn't Term have constructors that take a Token?
>>>>>>>               
>>>>>> I think that makes sense, though normally Token appears during
>>>>>> analysis and Term during searching (I think?) -- how often would
>>>>>> you need to make a Term from a Token?
>>>>>>
>>>>>>             
>>>>> The problem I'm addressing is that tokens are used in contexts that
>>>>> need String and not char[].
>>>>> The call to the deprecated
>>>>> String termText = token.termText();
>>>>> needs to be replaced with:
>>>>> String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>>
>>>>> There are over 170 calls to token.termText(), each of these places
>>>>> have to be modified. In some, perhaps many, of these cases it may be
>>>>> possible to use char[] directly to get a performance gain.
>>>>>
>>>>> In the case of Term changing it to work with char[] buffer, int
>>>>> start, int length, does not seem quite right. I think the ripple
>>>>> would keep getting bigger. But logically, the Term's text is the
>>>>> text of a Token.
>>>>>
>>>>> To me it makes sense to have a method that returns the token as a
>>>>> String, but that method is deprecated and the suggested replacement
>>>>> is to directly use the buffer. So this leads to the above construct.
>>>>> Perhaps it would be good to add a new method and document that as
>>>>> one of two replacements.
>>>>> public String term() {
>>>>> return termText != null ? termText : new String(token.termBuffer(),
>>>>> 0, token.termLength());
>>>>> }
>>>>>
>>>>> Here is an example from QueryParser that has 5 instances, each
>>>>> calling the deprecated t.termText() method. In this example, there
>>>>> is the construction of a query from a token stream.
>>>>> Each of the problem lines are of the pattern:
>>>>> TermQuery currentQuery = new TermQuery(new Term(field,
>>>>> t.termText()));
>>>>>
>>>>> To remove the deprecated call to t.termText(), the Token's buffer
>>>>> needs to be marshalled with something like:
>>>>> String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>>>
>>>>> /**
>>>>> * @exception ParseException throw in overridden method to disallow
>>>>> */
>>>>> protected Query getFieldQuery(String field, String queryText)
>>>>> throws ParseException {
>>>>>  // Use the analyzer to get all the tokens, and then build a
>>>>> TermQuery,
>>>>>  // PhraseQuery, or nothing based on the term count
>>>>>
>>>>>  TokenStream source = analyzer.tokenStream(field, new
>>>>> StringReader(queryText));
>>>>>  Vector v = new Vector();
>>>>>  org.apache.lucene.analysis.Token t;
>>>>>  int positionCount = 0;
>>>>>  boolean severalTokensAtSamePosition = false;
>>>>>
>>>>>  while (true) {
>>>>>    try {
>>>>>      t = source.next();
>>>>>    }
>>>>>    catch (IOException e) {
>>>>>      t = null;
>>>>>    }
>>>>>    if (t == null)
>>>>>      break;
>>>>>    v.addElement(t);
>>>>>    if (t.getPositionIncrement() != 0)
>>>>>      positionCount += t.getPositionIncrement();
>>>>>    else
>>>>>      severalTokensAtSamePosition = true;
>>>>>  }
>>>>>  try {
>>>>>    source.close();
>>>>>  }
>>>>>  catch (IOException e) {
>>>>>    // ignore
>>>>>  }
>>>>>
>>>>>  if (v.size() == 0)
>>>>>    return null;
>>>>>  else if (v.size() == 1) {
>>>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>>>    return new TermQuery(new Term(field, t.termText()));
>>>>>  } else {
>>>>>    if (severalTokensAtSamePosition) {
>>>>>      if (positionCount == 1) {
>>>>>        // no phrase query:
>>>>>        BooleanQuery q = new BooleanQuery(true);
>>>>>        for (int i = 0; i < v.size(); i++) {
>>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>          TermQuery currentQuery = new TermQuery(
>>>>>              new Term(field, t.termText()));
>>>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>>>        }
>>>>>        return q;
>>>>>      }
>>>>>      else {
>>>>>        // phrase query:
>>>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>>>        mpq.setSlop(phraseSlop);
>>>>>        List multiTerms = new ArrayList();
>>>>>        int position = -1;
>>>>>        for (int i = 0; i < v.size(); i++) {
>>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >  
>>>>> 0) {
>>>>>            if (enablePositionIncrements) {
>>>>>              mpq.add((Term[])multiTerms.toArray(new
>>>>> Term[0]),position);
>>>>>            } else {
>>>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>>            }
>>>>>            multiTerms.clear();
>>>>>          }
>>>>>          position += t.getPositionIncrement();
>>>>>          multiTerms.add(new Term(field, t.termText()));
>>>>>        }
>>>>>        if (enablePositionIncrements) {
>>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>>>        } else {
>>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>>        }
>>>>>        return mpq;
>>>>>      }
>>>>>    }
>>>>>    else {
>>>>>      PhraseQuery pq = new PhraseQuery();
>>>>>      pq.setSlop(phraseSlop);
>>>>>      int position = -1;
>>>>>      for (int i = 0; i < v.size(); i++) {
>>>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>        if (enablePositionIncrements) {
>>>>>          position += t.getPositionIncrement();
>>>>>          pq.add(new Term(field, t.termText()),position);
>>>>>        } else {
>>>>>          pq.add(new Term(field, t.termText()));
>>>>>        }
>>>>>      }
>>>>>      return pq;
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> Here is an example that works around the deprecated code:
>>>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>>>> Exception {
>>>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
>>>>> WhitespaceAnalyzer(), 2);
>>>>>  searcher = setUpSearcher(analyzer);
>>>>>
>>>>>  PhraseQuery q = new PhraseQuery();
>>>>>
>>>>>  TokenStream ts = analyzer.tokenStream("content",
>>>>>                                        new StringReader("this
>>>>> sentence"));
>>>>>  Token token;
>>>>>  int j = -1;
>>>>>  while ((token = ts.next()) != null) {
>>>>>    j += token.getPositionIncrement();
>>>>>    String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>>    q.add(new Term("content", termText), j);
>>>>>  }
>>>>>
>>>>>  Hits hits = searcher.search(q);
>>>>>  int[] ranks = new int[] { 0 };
>>>>>  compareRanks(hits, ranks);
>>>>> }
>>>>>
>>>>> -- DM
>>>>>
>>>>>           


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Hiroaki Kawai <ka...@apache.org>.
DM Smith <dm...@gmail.com> wrote:
> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
> 
> > Another suggestion from me:
> > How about making token object as an singleton?
> 
> Would that work for a multi-threaded application?

Of cource. We should make that thread local singleton.


> >
> >
> >
> >> Maybe we should un-deprecate the termText() method but add javadocs
> >> explaining that for better performance you should use the char[]  
> >> reuse
> >> methods instead?
> >>
> >> Mike
> >>
> >> DM Smith wrote:
> >>
> >>> Michael McCandless wrote:
> >>>>
> >>>> DM Smith wrote:
> >>>>
> >>>>> Shouldn't Term have constructors that take a Token?
> >>>>
> >>>> I think that makes sense, though normally Token appears during
> >>>> analysis and Term during searching (I think?) -- how often would
> >>>> you need to make a Term from a Token?
> >>>>
> >>> The problem I'm addressing is that tokens are used in contexts that
> >>> need String and not char[].
> >>> The call to the deprecated
> >>> String termText = token.termText();
> >>> needs to be replaced with:
> >>> String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>>
> >>> There are over 170 calls to token.termText(), each of these places
> >>> have to be modified. In some, perhaps many, of these cases it may be
> >>> possible to use char[] directly to get a performance gain.
> >>>
> >>> In the case of Term changing it to work with char[] buffer, int
> >>> start, int length, does not seem quite right. I think the ripple
> >>> would keep getting bigger. But logically, the Term's text is the
> >>> text of a Token.
> >>>
> >>> To me it makes sense to have a method that returns the token as a
> >>> String, but that method is deprecated and the suggested replacement
> >>> is to directly use the buffer. So this leads to the above construct.
> >>> Perhaps it would be good to add a new method and document that as
> >>> one of two replacements.
> >>> public String term() {
> >>> return termText != null ? termText : new String(token.termBuffer(),
> >>> 0, token.termLength());
> >>> }
> >>>
> >>> Here is an example from QueryParser that has 5 instances, each
> >>> calling the deprecated t.termText() method. In this example, there
> >>> is the construction of a query from a token stream.
> >>> Each of the problem lines are of the pattern:
> >>> TermQuery currentQuery = new TermQuery(new Term(field,
> >>> t.termText()));
> >>>
> >>> To remove the deprecated call to t.termText(), the Token's buffer
> >>> needs to be marshalled with something like:
> >>> String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
> >>>
> >>> /**
> >>> * @exception ParseException throw in overridden method to disallow
> >>> */
> >>> protected Query getFieldQuery(String field, String queryText)
> >>> throws ParseException {
> >>>  // Use the analyzer to get all the tokens, and then build a
> >>> TermQuery,
> >>>  // PhraseQuery, or nothing based on the term count
> >>>
> >>>  TokenStream source = analyzer.tokenStream(field, new
> >>> StringReader(queryText));
> >>>  Vector v = new Vector();
> >>>  org.apache.lucene.analysis.Token t;
> >>>  int positionCount = 0;
> >>>  boolean severalTokensAtSamePosition = false;
> >>>
> >>>  while (true) {
> >>>    try {
> >>>      t = source.next();
> >>>    }
> >>>    catch (IOException e) {
> >>>      t = null;
> >>>    }
> >>>    if (t == null)
> >>>      break;
> >>>    v.addElement(t);
> >>>    if (t.getPositionIncrement() != 0)
> >>>      positionCount += t.getPositionIncrement();
> >>>    else
> >>>      severalTokensAtSamePosition = true;
> >>>  }
> >>>  try {
> >>>    source.close();
> >>>  }
> >>>  catch (IOException e) {
> >>>    // ignore
> >>>  }
> >>>
> >>>  if (v.size() == 0)
> >>>    return null;
> >>>  else if (v.size() == 1) {
> >>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
> >>>    return new TermQuery(new Term(field, t.termText()));
> >>>  } else {
> >>>    if (severalTokensAtSamePosition) {
> >>>      if (positionCount == 1) {
> >>>        // no phrase query:
> >>>        BooleanQuery q = new BooleanQuery(true);
> >>>        for (int i = 0; i < v.size(); i++) {
> >>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>          TermQuery currentQuery = new TermQuery(
> >>>              new Term(field, t.termText()));
> >>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
> >>>        }
> >>>        return q;
> >>>      }
> >>>      else {
> >>>        // phrase query:
> >>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
> >>>        mpq.setSlop(phraseSlop);
> >>>        List multiTerms = new ArrayList();
> >>>        int position = -1;
> >>>        for (int i = 0; i < v.size(); i++) {
> >>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >  
> >>> 0) {
> >>>            if (enablePositionIncrements) {
> >>>              mpq.add((Term[])multiTerms.toArray(new
> >>> Term[0]),position);
> >>>            } else {
> >>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >>>            }
> >>>            multiTerms.clear();
> >>>          }
> >>>          position += t.getPositionIncrement();
> >>>          multiTerms.add(new Term(field, t.termText()));
> >>>        }
> >>>        if (enablePositionIncrements) {
> >>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
> >>>        } else {
> >>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >>>        }
> >>>        return mpq;
> >>>      }
> >>>    }
> >>>    else {
> >>>      PhraseQuery pq = new PhraseQuery();
> >>>      pq.setSlop(phraseSlop);
> >>>      int position = -1;
> >>>      for (int i = 0; i < v.size(); i++) {
> >>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >>>        if (enablePositionIncrements) {
> >>>          position += t.getPositionIncrement();
> >>>          pq.add(new Term(field, t.termText()),position);
> >>>        } else {
> >>>          pq.add(new Term(field, t.termText()));
> >>>        }
> >>>      }
> >>>      return pq;
> >>>    }
> >>>  }
> >>> }
> >>>
> >>>
> >>> Here is an example that works around the deprecated code:
> >>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
> >>> Exception {
> >>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
> >>> WhitespaceAnalyzer(), 2);
> >>>  searcher = setUpSearcher(analyzer);
> >>>
> >>>  PhraseQuery q = new PhraseQuery();
> >>>
> >>>  TokenStream ts = analyzer.tokenStream("content",
> >>>                                        new StringReader("this
> >>> sentence"));
> >>>  Token token;
> >>>  int j = -1;
> >>>  while ((token = ts.next()) != null) {
> >>>    j += token.getPositionIncrement();
> >>>    String termText = new String(token.termBuffer(), 0,
> >>> token.termLength());
> >>>    q.add(new Term("content", termText), j);
> >>>  }
> >>>
> >>>  Hits hits = searcher.search(q);
> >>>  int[] ranks = new int[] { 0 };
> >>>  compareRanks(hits, ranks);
> >>> }
> >>>
> >>> -- DM
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by DM Smith <dm...@gmail.com>.
On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:

> Another suggestion from me:
> How about making token object as an singleton?

Would that work for a multi-threaded application?

>
>
>
>> Maybe we should un-deprecate the termText() method but add javadocs
>> explaining that for better performance you should use the char[]  
>> reuse
>> methods instead?
>>
>> Mike
>>
>> DM Smith wrote:
>>
>>> Michael McCandless wrote:
>>>>
>>>> DM Smith wrote:
>>>>
>>>>> Shouldn't Term have constructors that take a Token?
>>>>
>>>> I think that makes sense, though normally Token appears during
>>>> analysis and Term during searching (I think?) -- how often would
>>>> you need to make a Term from a Token?
>>>>
>>> The problem I'm addressing is that tokens are used in contexts that
>>> need String and not char[].
>>> The call to the deprecated
>>> String termText = token.termText();
>>> needs to be replaced with:
>>> String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>>
>>> There are over 170 calls to token.termText(), each of these places
>>> have to be modified. In some, perhaps many, of these cases it may be
>>> possible to use char[] directly to get a performance gain.
>>>
>>> In the case of Term changing it to work with char[] buffer, int
>>> start, int length, does not seem quite right. I think the ripple
>>> would keep getting bigger. But logically, the Term's text is the
>>> text of a Token.
>>>
>>> To me it makes sense to have a method that returns the token as a
>>> String, but that method is deprecated and the suggested replacement
>>> is to directly use the buffer. So this leads to the above construct.
>>> Perhaps it would be good to add a new method and document that as
>>> one of two replacements.
>>> public String term() {
>>> return termText != null ? termText : new String(token.termBuffer(),
>>> 0, token.termLength());
>>> }
>>>
>>> Here is an example from QueryParser that has 5 instances, each
>>> calling the deprecated t.termText() method. In this example, there
>>> is the construction of a query from a token stream.
>>> Each of the problem lines are of the pattern:
>>> TermQuery currentQuery = new TermQuery(new Term(field,
>>> t.termText()));
>>>
>>> To remove the deprecated call to t.termText(), the Token's buffer
>>> needs to be marshalled with something like:
>>> String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>
>>> /**
>>> * @exception ParseException throw in overridden method to disallow
>>> */
>>> protected Query getFieldQuery(String field, String queryText)
>>> throws ParseException {
>>>  // Use the analyzer to get all the tokens, and then build a
>>> TermQuery,
>>>  // PhraseQuery, or nothing based on the term count
>>>
>>>  TokenStream source = analyzer.tokenStream(field, new
>>> StringReader(queryText));
>>>  Vector v = new Vector();
>>>  org.apache.lucene.analysis.Token t;
>>>  int positionCount = 0;
>>>  boolean severalTokensAtSamePosition = false;
>>>
>>>  while (true) {
>>>    try {
>>>      t = source.next();
>>>    }
>>>    catch (IOException e) {
>>>      t = null;
>>>    }
>>>    if (t == null)
>>>      break;
>>>    v.addElement(t);
>>>    if (t.getPositionIncrement() != 0)
>>>      positionCount += t.getPositionIncrement();
>>>    else
>>>      severalTokensAtSamePosition = true;
>>>  }
>>>  try {
>>>    source.close();
>>>  }
>>>  catch (IOException e) {
>>>    // ignore
>>>  }
>>>
>>>  if (v.size() == 0)
>>>    return null;
>>>  else if (v.size() == 1) {
>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>    return new TermQuery(new Term(field, t.termText()));
>>>  } else {
>>>    if (severalTokensAtSamePosition) {
>>>      if (positionCount == 1) {
>>>        // no phrase query:
>>>        BooleanQuery q = new BooleanQuery(true);
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          TermQuery currentQuery = new TermQuery(
>>>              new Term(field, t.termText()));
>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>        }
>>>        return q;
>>>      }
>>>      else {
>>>        // phrase query:
>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>        mpq.setSlop(phraseSlop);
>>>        List multiTerms = new ArrayList();
>>>        int position = -1;
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >  
>>> 0) {
>>>            if (enablePositionIncrements) {
>>>              mpq.add((Term[])multiTerms.toArray(new
>>> Term[0]),position);
>>>            } else {
>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>            }
>>>            multiTerms.clear();
>>>          }
>>>          position += t.getPositionIncrement();
>>>          multiTerms.add(new Term(field, t.termText()));
>>>        }
>>>        if (enablePositionIncrements) {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>        } else {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>        }
>>>        return mpq;
>>>      }
>>>    }
>>>    else {
>>>      PhraseQuery pq = new PhraseQuery();
>>>      pq.setSlop(phraseSlop);
>>>      int position = -1;
>>>      for (int i = 0; i < v.size(); i++) {
>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>        if (enablePositionIncrements) {
>>>          position += t.getPositionIncrement();
>>>          pq.add(new Term(field, t.termText()),position);
>>>        } else {
>>>          pq.add(new Term(field, t.termText()));
>>>        }
>>>      }
>>>      return pq;
>>>    }
>>>  }
>>> }
>>>
>>>
>>> Here is an example that works around the deprecated code:
>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>> Exception {
>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
>>> WhitespaceAnalyzer(), 2);
>>>  searcher = setUpSearcher(analyzer);
>>>
>>>  PhraseQuery q = new PhraseQuery();
>>>
>>>  TokenStream ts = analyzer.tokenStream("content",
>>>                                        new StringReader("this
>>> sentence"));
>>>  Token token;
>>>  int j = -1;
>>>  while ((token = ts.next()) != null) {
>>>    j += token.getPositionIncrement();
>>>    String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>>    q.add(new Term("content", termText), j);
>>>  }
>>>
>>>  Hits hits = searcher.search(q);
>>>  int[] ranks = new int[] { 0 };
>>>  compareRanks(hits, ranks);
>>> }
>>>
>>> -- DM
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Hiroaki Kawai <ka...@apache.org>.
Another suggestion from me:
How about making token object as an singleton?


> Maybe we should un-deprecate the termText() method but add javadocs  
> explaining that for better performance you should use the char[] reuse  
> methods instead?
> 
> Mike
> 
> DM Smith wrote:
> 
> > Michael McCandless wrote:
> >>
> >> DM Smith wrote:
> >>
> >>> Shouldn't Term have constructors that take a Token?
> >>
> >> I think that makes sense, though normally Token appears during  
> >> analysis and Term during searching (I think?) -- how often would  
> >> you need to make a Term from a Token?
> >>
> > The problem I'm addressing is that tokens are used in contexts that  
> > need String and not char[].
> > The call to the deprecated
> >  String termText = token.termText();
> > needs to be replaced with:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >
> > There are over 170 calls to token.termText(), each of these places  
> > have to be modified. In some, perhaps many, of these cases it may be  
> > possible to use char[] directly to get a performance gain.
> >
> > In the case of Term changing it to work with char[] buffer, int  
> > start, int length, does not seem quite right. I think the ripple  
> > would keep getting bigger. But logically, the Term's text is the  
> > text of a Token.
> >
> > To me it makes sense to have a method that returns the token as a  
> > String, but that method is deprecated and the suggested replacement  
> > is to directly use the buffer. So this leads to the above construct.  
> > Perhaps it would be good to add a new method and document that as  
> > one of two replacements.
> > public String term() {
> > return termText != null ? termText : new String(token.termBuffer(),  
> > 0, token.termLength());
> > }
> >
> > Here is an example from QueryParser that has 5 instances, each  
> > calling the deprecated t.termText() method. In this example, there  
> > is the construction of a query from a token stream.
> > Each of the problem lines are of the pattern:
> >  TermQuery currentQuery = new TermQuery(new Term(field,  
> > t.termText()));
> >
> > To remove the deprecated call to t.termText(), the Token's buffer  
> > needs to be marshalled with something like:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
> >
> > /**
> >  * @exception ParseException throw in overridden method to disallow
> >  */
> > protected Query getFieldQuery(String field, String queryText)   
> > throws ParseException {
> >   // Use the analyzer to get all the tokens, and then build a  
> > TermQuery,
> >   // PhraseQuery, or nothing based on the term count
> >
> >   TokenStream source = analyzer.tokenStream(field, new  
> > StringReader(queryText));
> >   Vector v = new Vector();
> >   org.apache.lucene.analysis.Token t;
> >   int positionCount = 0;
> >   boolean severalTokensAtSamePosition = false;
> >
> >   while (true) {
> >     try {
> >       t = source.next();
> >     }
> >     catch (IOException e) {
> >       t = null;
> >     }
> >     if (t == null)
> >       break;
> >     v.addElement(t);
> >     if (t.getPositionIncrement() != 0)
> >       positionCount += t.getPositionIncrement();
> >     else
> >       severalTokensAtSamePosition = true;
> >   }
> >   try {
> >     source.close();
> >   }
> >   catch (IOException e) {
> >     // ignore
> >   }
> >
> >   if (v.size() == 0)
> >     return null;
> >   else if (v.size() == 1) {
> >     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
> >     return new TermQuery(new Term(field, t.termText()));
> >   } else {
> >     if (severalTokensAtSamePosition) {
> >       if (positionCount == 1) {
> >         // no phrase query:
> >         BooleanQuery q = new BooleanQuery(true);
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           TermQuery currentQuery = new TermQuery(
> >               new Term(field, t.termText()));
> >           q.add(currentQuery, BooleanClause.Occur.SHOULD);
> >         }
> >         return q;
> >       }
> >       else {
> >         // phrase query:
> >         MultiPhraseQuery mpq = new MultiPhraseQuery();
> >         mpq.setSlop(phraseSlop);
> >         List multiTerms = new ArrayList();
> >         int position = -1;
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
> >             if (enablePositionIncrements) {
> >               mpq.add((Term[])multiTerms.toArray(new  
> > Term[0]),position);
> >             } else {
> >               mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >             }
> >             multiTerms.clear();
> >           }
> >           position += t.getPositionIncrement();
> >           multiTerms.add(new Term(field, t.termText()));
> >         }
> >         if (enablePositionIncrements) {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
> >         } else {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >         }
> >         return mpq;
> >       }
> >     }
> >     else {
> >       PhraseQuery pq = new PhraseQuery();
> >       pq.setSlop(phraseSlop);
> >       int position = -1;
> >       for (int i = 0; i < v.size(); i++) {
> >         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >         if (enablePositionIncrements) {
> >           position += t.getPositionIncrement();
> >           pq.add(new Term(field, t.termText()),position);
> >         } else {
> >           pq.add(new Term(field, t.termText()));
> >         }
> >       }
> >       return pq;
> >     }
> >   }
> > }
> >
> >
> > Here is an example that works around the deprecated code:
> > public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
> >   Analyzer analyzer = new ShingleAnalyzerWrapper(new  
> > WhitespaceAnalyzer(), 2);
> >   searcher = setUpSearcher(analyzer);
> >
> >   PhraseQuery q = new PhraseQuery();
> >
> >   TokenStream ts = analyzer.tokenStream("content",
> >                                         new StringReader("this  
> > sentence"));
> >   Token token;
> >   int j = -1;
> >   while ((token = ts.next()) != null) {
> >     j += token.getPositionIncrement();
> >     String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >     q.add(new Term("content", termText), j);
> >   }
> >
> >   Hits hits = searcher.search(q);
> >   int[] ranks = new int[] { 0 };
> >   compareRanks(hits, ranks);
> > }
> >
> > -- DM
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Hiroaki Kawai <ka...@iij.ad.jp>.
Another suggestion from me:
How about making token object as an singleton?


> Maybe we should un-deprecate the termText() method but add javadocs  
> explaining that for better performance you should use the char[] reuse  
> methods instead?
> 
> Mike
> 
> DM Smith wrote:
> 
> > Michael McCandless wrote:
> >>
> >> DM Smith wrote:
> >>
> >>> Shouldn't Term have constructors that take a Token?
> >>
> >> I think that makes sense, though normally Token appears during  
> >> analysis and Term during searching (I think?) -- how often would  
> >> you need to make a Term from a Token?
> >>
> > The problem I'm addressing is that tokens are used in contexts that  
> > need String and not char[].
> > The call to the deprecated
> >  String termText = token.termText();
> > needs to be replaced with:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >
> > There are over 170 calls to token.termText(), each of these places  
> > have to be modified. In some, perhaps many, of these cases it may be  
> > possible to use char[] directly to get a performance gain.
> >
> > In the case of Term changing it to work with char[] buffer, int  
> > start, int length, does not seem quite right. I think the ripple  
> > would keep getting bigger. But logically, the Term's text is the  
> > text of a Token.
> >
> > To me it makes sense to have a method that returns the token as a  
> > String, but that method is deprecated and the suggested replacement  
> > is to directly use the buffer. So this leads to the above construct.  
> > Perhaps it would be good to add a new method and document that as  
> > one of two replacements.
> > public String term() {
> > return termText != null ? termText : new String(token.termBuffer(),  
> > 0, token.termLength());
> > }
> >
> > Here is an example from QueryParser that has 5 instances, each  
> > calling the deprecated t.termText() method. In this example, there  
> > is the construction of a query from a token stream.
> > Each of the problem lines are of the pattern:
> >  TermQuery currentQuery = new TermQuery(new Term(field,  
> > t.termText()));
> >
> > To remove the deprecated call to t.termText(), the Token's buffer  
> > needs to be marshalled with something like:
> >  String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
> >
> > /**
> >  * @exception ParseException throw in overridden method to disallow
> >  */
> > protected Query getFieldQuery(String field, String queryText)   
> > throws ParseException {
> >   // Use the analyzer to get all the tokens, and then build a  
> > TermQuery,
> >   // PhraseQuery, or nothing based on the term count
> >
> >   TokenStream source = analyzer.tokenStream(field, new  
> > StringReader(queryText));
> >   Vector v = new Vector();
> >   org.apache.lucene.analysis.Token t;
> >   int positionCount = 0;
> >   boolean severalTokensAtSamePosition = false;
> >
> >   while (true) {
> >     try {
> >       t = source.next();
> >     }
> >     catch (IOException e) {
> >       t = null;
> >     }
> >     if (t == null)
> >       break;
> >     v.addElement(t);
> >     if (t.getPositionIncrement() != 0)
> >       positionCount += t.getPositionIncrement();
> >     else
> >       severalTokensAtSamePosition = true;
> >   }
> >   try {
> >     source.close();
> >   }
> >   catch (IOException e) {
> >     // ignore
> >   }
> >
> >   if (v.size() == 0)
> >     return null;
> >   else if (v.size() == 1) {
> >     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
> >     return new TermQuery(new Term(field, t.termText()));
> >   } else {
> >     if (severalTokensAtSamePosition) {
> >       if (positionCount == 1) {
> >         // no phrase query:
> >         BooleanQuery q = new BooleanQuery(true);
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           TermQuery currentQuery = new TermQuery(
> >               new Term(field, t.termText()));
> >           q.add(currentQuery, BooleanClause.Occur.SHOULD);
> >         }
> >         return q;
> >       }
> >       else {
> >         // phrase query:
> >         MultiPhraseQuery mpq = new MultiPhraseQuery();
> >         mpq.setSlop(phraseSlop);
> >         List multiTerms = new ArrayList();
> >         int position = -1;
> >         for (int i = 0; i < v.size(); i++) {
> >           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
> >             if (enablePositionIncrements) {
> >               mpq.add((Term[])multiTerms.toArray(new  
> > Term[0]),position);
> >             } else {
> >               mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >             }
> >             multiTerms.clear();
> >           }
> >           position += t.getPositionIncrement();
> >           multiTerms.add(new Term(field, t.termText()));
> >         }
> >         if (enablePositionIncrements) {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
> >         } else {
> >           mpq.add((Term[])multiTerms.toArray(new Term[0]));
> >         }
> >         return mpq;
> >       }
> >     }
> >     else {
> >       PhraseQuery pq = new PhraseQuery();
> >       pq.setSlop(phraseSlop);
> >       int position = -1;
> >       for (int i = 0; i < v.size(); i++) {
> >         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
> >         if (enablePositionIncrements) {
> >           position += t.getPositionIncrement();
> >           pq.add(new Term(field, t.termText()),position);
> >         } else {
> >           pq.add(new Term(field, t.termText()));
> >         }
> >       }
> >       return pq;
> >     }
> >   }
> > }
> >
> >
> > Here is an example that works around the deprecated code:
> > public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
> >   Analyzer analyzer = new ShingleAnalyzerWrapper(new  
> > WhitespaceAnalyzer(), 2);
> >   searcher = setUpSearcher(analyzer);
> >
> >   PhraseQuery q = new PhraseQuery();
> >
> >   TokenStream ts = analyzer.tokenStream("content",
> >                                         new StringReader("this  
> > sentence"));
> >   Token token;
> >   int j = -1;
> >   while ((token = ts.next()) != null) {
> >     j += token.getPositionIncrement();
> >     String termText = new String(token.termBuffer(), 0,  
> > token.termLength());
> >     q.add(new Term("content", termText), j);
> >   }
> >
> >   Hits hits = searcher.search(q);
> >   int[] ranks = new int[] { 0 };
> >   compareRanks(hits, ranks);
> > }
> >
> > -- DM
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by DM Smith <dm...@gmail.com>.
This would be better, especially when we get to Java 5, with covariant 
returns. Then we could also rename termBuffer() to term() (or just leave 
it :).
DM

Michael McCandless wrote:
>
> Or we could leave termText() deprecated, add term() which does the 
> same thing sub-optimally (ie, always creates new String from the 
> byte[]), and in the javadocs for termText() state that you can migrate 
> either term() (if you really want a String and you understand the 
> performance cost of doing so) or to the re-use APIs?
>
> Mike
>
> DM Smith wrote:
>
>> Michael McCandless wrote:
>>>
>>> Maybe we should un-deprecate the termText() method but add javadocs 
>>> explaining that for better performance you should use the char[] 
>>> reuse methods instead?
>> I think so, too. Should we leave it as deprecated until 3.0? With the 
>> performance note and the encouragement to go for re-use, but also 
>> with a note that the current implementation is deprecated not the 
>> interface.
>>
>> That's not quite what deprecated means. My thought on this is that it 
>> will give everyone a heads up that the current implementation is 
>> going away and that the replacement is sub-optimal.
>>
>> (I use Eclipse and have it set to flag all deprecated uses. This 
>> helps me look for places to change.)
>>
>> I think that this will make migration to 3.0 be much easier.
>>
>> With this changing Term to add Term(String, Token) won't be necessary.
>>
>> -- DM
>>>
>>> Mike
>>>
>>> DM Smith wrote:
>>>
>>>> Michael McCandless wrote:
>>>>>
>>>>> DM Smith wrote:
>>>>>
>>>>>> Shouldn't Term have constructors that take a Token?
>>>>>
>>>>> I think that makes sense, though normally Token appears during 
>>>>> analysis and Term during searching (I think?) -- how often would 
>>>>> you need to make a Term from a Token?
>>>>>
>>>> The problem I'm addressing is that tokens are used in contexts that 
>>>> need String and not char[].
>>>> The call to the deprecated
>>>> String termText = token.termText();
>>>> needs to be replaced with:
>>>> String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>>
>>>> There are over 170 calls to token.termText(), each of these places 
>>>> have to be modified. In some, perhaps many, of these cases it may 
>>>> be possible to use char[] directly to get a performance gain.
>>>>
>>>> In the case of Term changing it to work with char[] buffer, int 
>>>> start, int length, does not seem quite right. I think the ripple 
>>>> would keep getting bigger. But logically, the Term's text is the 
>>>> text of a Token.
>>>>
>>>> To me it makes sense to have a method that returns the token as a 
>>>> String, but that method is deprecated and the suggested replacement 
>>>> is to directly use the buffer. So this leads to the above 
>>>> construct. Perhaps it would be good to add a new method and 
>>>> document that as one of two replacements.
>>>> public String term() {
>>>> return termText != null ? termText : new String(token.termBuffer(), 
>>>> 0, token.termLength());
>>>> }
>>>>
>>>> Here is an example from QueryParser that has 5 instances, each 
>>>> calling the deprecated t.termText() method. In this example, there 
>>>> is the construction of a query from a token stream.
>>>> Each of the problem lines are of the pattern:
>>>> TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));
>>>>
>>>> To remove the deprecated call to t.termText(), the Token's buffer 
>>>> needs to be marshalled with something like:
>>>> String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>>
>>>> /**
>>>> * @exception ParseException throw in overridden method to disallow
>>>> */
>>>> protected Query getFieldQuery(String field, String queryText)  
>>>> throws ParseException {
>>>>  // Use the analyzer to get all the tokens, and then build a 
>>>> TermQuery,
>>>>  // PhraseQuery, or nothing based on the term count
>>>>
>>>>  TokenStream source = analyzer.tokenStream(field, new 
>>>> StringReader(queryText));
>>>>  Vector v = new Vector();
>>>>  org.apache.lucene.analysis.Token t;
>>>>  int positionCount = 0;
>>>>  boolean severalTokensAtSamePosition = false;
>>>>
>>>>  while (true) {
>>>>    try {
>>>>      t = source.next();
>>>>    }
>>>>    catch (IOException e) {
>>>>      t = null;
>>>>    }
>>>>    if (t == null)
>>>>      break;
>>>>    v.addElement(t);
>>>>    if (t.getPositionIncrement() != 0)
>>>>      positionCount += t.getPositionIncrement();
>>>>    else
>>>>      severalTokensAtSamePosition = true;
>>>>  }
>>>>  try {
>>>>    source.close();
>>>>  }
>>>>  catch (IOException e) {
>>>>    // ignore
>>>>  }
>>>>
>>>>  if (v.size() == 0)
>>>>    return null;
>>>>  else if (v.size() == 1) {
>>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>>    return new TermQuery(new Term(field, t.termText()));
>>>>  } else {
>>>>    if (severalTokensAtSamePosition) {
>>>>      if (positionCount == 1) {
>>>>        // no phrase query:
>>>>        BooleanQuery q = new BooleanQuery(true);
>>>>        for (int i = 0; i < v.size(); i++) {
>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>          TermQuery currentQuery = new TermQuery(
>>>>              new Term(field, t.termText()));
>>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>>        }
>>>>        return q;
>>>>      }
>>>>      else {
>>>>        // phrase query:
>>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>>        mpq.setSlop(phraseSlop);
>>>>        List multiTerms = new ArrayList();
>>>>        int position = -1;
>>>>        for (int i = 0; i < v.size(); i++) {
>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
>>>>            if (enablePositionIncrements) {
>>>>              mpq.add((Term[])multiTerms.toArray(new 
>>>> Term[0]),position);
>>>>            } else {
>>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>            }
>>>>            multiTerms.clear();
>>>>          }
>>>>          position += t.getPositionIncrement();
>>>>          multiTerms.add(new Term(field, t.termText()));
>>>>        }
>>>>        if (enablePositionIncrements) {
>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>>        } else {
>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>        }
>>>>        return mpq;
>>>>      }
>>>>    }
>>>>    else {
>>>>      PhraseQuery pq = new PhraseQuery();
>>>>      pq.setSlop(phraseSlop);
>>>>      int position = -1;
>>>>      for (int i = 0; i < v.size(); i++) {
>>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>        if (enablePositionIncrements) {
>>>>          position += t.getPositionIncrement();
>>>>          pq.add(new Term(field, t.termText()),position);
>>>>        } else {
>>>>          pq.add(new Term(field, t.termText()));
>>>>        }
>>>>      }
>>>>      return pq;
>>>>    }
>>>>  }
>>>> }
>>>>
>>>>
>>>> Here is an example that works around the deprecated code:
>>>> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new 
>>>> WhitespaceAnalyzer(), 2);
>>>>  searcher = setUpSearcher(analyzer);
>>>>
>>>>  PhraseQuery q = new PhraseQuery();
>>>>
>>>>  TokenStream ts = analyzer.tokenStream("content",
>>>>                                        new StringReader("this 
>>>> sentence"));
>>>>  Token token;
>>>>  int j = -1;
>>>>  while ((token = ts.next()) != null) {
>>>>    j += token.getPositionIncrement();
>>>>    String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>>    q.add(new Term("content", termText), j);
>>>>  }
>>>>
>>>>  Hits hits = searcher.search(q);
>>>>  int[] ranks = new int[] { 0 };
>>>>  compareRanks(hits, ranks);
>>>> }
>>>>
>>>> -- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Michael McCandless <lu...@mikemccandless.com>.
Or we could leave termText() deprecated, add term() which does the  
same thing sub-optimally (ie, always creates new String from the  
byte[]), and in the javadocs for termText() state that you can migrate  
either term() (if you really want a String and you understand the  
performance cost of doing so) or to the re-use APIs?

Mike

DM Smith wrote:

> Michael McCandless wrote:
>>
>> Maybe we should un-deprecate the termText() method but add javadocs  
>> explaining that for better performance you should use the char[]  
>> reuse methods instead?
> I think so, too. Should we leave it as deprecated until 3.0? With  
> the performance note and the encouragement to go for re-use, but  
> also with a note that the current implementation is deprecated not  
> the interface.
>
> That's not quite what deprecated means. My thought on this is that  
> it will give everyone a heads up that the current implementation is  
> going away and that the replacement is sub-optimal.
>
> (I use Eclipse and have it set to flag all deprecated uses. This  
> helps me look for places to change.)
>
> I think that this will make migration to 3.0 be much easier.
>
> With this changing Term to add Term(String, Token) won't be necessary.
>
> -- DM
>>
>> Mike
>>
>> DM Smith wrote:
>>
>>> Michael McCandless wrote:
>>>>
>>>> DM Smith wrote:
>>>>
>>>>> Shouldn't Term have constructors that take a Token?
>>>>
>>>> I think that makes sense, though normally Token appears during  
>>>> analysis and Term during searching (I think?) -- how often would  
>>>> you need to make a Term from a Token?
>>>>
>>> The problem I'm addressing is that tokens are used in contexts  
>>> that need String and not char[].
>>> The call to the deprecated
>>> String termText = token.termText();
>>> needs to be replaced with:
>>> String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>>
>>> There are over 170 calls to token.termText(), each of these places  
>>> have to be modified. In some, perhaps many, of these cases it may  
>>> be possible to use char[] directly to get a performance gain.
>>>
>>> In the case of Term changing it to work with char[] buffer, int  
>>> start, int length, does not seem quite right. I think the ripple  
>>> would keep getting bigger. But logically, the Term's text is the  
>>> text of a Token.
>>>
>>> To me it makes sense to have a method that returns the token as a  
>>> String, but that method is deprecated and the suggested  
>>> replacement is to directly use the buffer. So this leads to the  
>>> above construct. Perhaps it would be good to add a new method and  
>>> document that as one of two replacements.
>>> public String term() {
>>> return termText != null ? termText : new  
>>> String(token.termBuffer(), 0, token.termLength());
>>> }
>>>
>>> Here is an example from QueryParser that has 5 instances, each  
>>> calling the deprecated t.termText() method. In this example, there  
>>> is the construction of a query from a token stream.
>>> Each of the problem lines are of the pattern:
>>> TermQuery currentQuery = new TermQuery(new Term(field,  
>>> t.termText()));
>>>
>>> To remove the deprecated call to t.termText(), the Token's buffer  
>>> needs to be marshalled with something like:
>>> String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>
>>> /**
>>> * @exception ParseException throw in overridden method to disallow
>>> */
>>> protected Query getFieldQuery(String field, String queryText)   
>>> throws ParseException {
>>>  // Use the analyzer to get all the tokens, and then build a  
>>> TermQuery,
>>>  // PhraseQuery, or nothing based on the term count
>>>
>>>  TokenStream source = analyzer.tokenStream(field, new  
>>> StringReader(queryText));
>>>  Vector v = new Vector();
>>>  org.apache.lucene.analysis.Token t;
>>>  int positionCount = 0;
>>>  boolean severalTokensAtSamePosition = false;
>>>
>>>  while (true) {
>>>    try {
>>>      t = source.next();
>>>    }
>>>    catch (IOException e) {
>>>      t = null;
>>>    }
>>>    if (t == null)
>>>      break;
>>>    v.addElement(t);
>>>    if (t.getPositionIncrement() != 0)
>>>      positionCount += t.getPositionIncrement();
>>>    else
>>>      severalTokensAtSamePosition = true;
>>>  }
>>>  try {
>>>    source.close();
>>>  }
>>>  catch (IOException e) {
>>>    // ignore
>>>  }
>>>
>>>  if (v.size() == 0)
>>>    return null;
>>>  else if (v.size() == 1) {
>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>    return new TermQuery(new Term(field, t.termText()));
>>>  } else {
>>>    if (severalTokensAtSamePosition) {
>>>      if (positionCount == 1) {
>>>        // no phrase query:
>>>        BooleanQuery q = new BooleanQuery(true);
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          TermQuery currentQuery = new TermQuery(
>>>              new Term(field, t.termText()));
>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>        }
>>>        return q;
>>>      }
>>>      else {
>>>        // phrase query:
>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>        mpq.setSlop(phraseSlop);
>>>        List multiTerms = new ArrayList();
>>>        int position = -1;
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >  
>>> 0) {
>>>            if (enablePositionIncrements) {
>>>              mpq.add((Term[])multiTerms.toArray(new  
>>> Term[0]),position);
>>>            } else {
>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>            }
>>>            multiTerms.clear();
>>>          }
>>>          position += t.getPositionIncrement();
>>>          multiTerms.add(new Term(field, t.termText()));
>>>        }
>>>        if (enablePositionIncrements) {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>        } else {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>        }
>>>        return mpq;
>>>      }
>>>    }
>>>    else {
>>>      PhraseQuery pq = new PhraseQuery();
>>>      pq.setSlop(phraseSlop);
>>>      int position = -1;
>>>      for (int i = 0; i < v.size(); i++) {
>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>        if (enablePositionIncrements) {
>>>          position += t.getPositionIncrement();
>>>          pq.add(new Term(field, t.termText()),position);
>>>        } else {
>>>          pq.add(new Term(field, t.termText()));
>>>        }
>>>      }
>>>      return pq;
>>>    }
>>>  }
>>> }
>>>
>>>
>>> Here is an example that works around the deprecated code:
>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>> Exception {
>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new  
>>> WhitespaceAnalyzer(), 2);
>>>  searcher = setUpSearcher(analyzer);
>>>
>>>  PhraseQuery q = new PhraseQuery();
>>>
>>>  TokenStream ts = analyzer.tokenStream("content",
>>>                                        new StringReader("this  
>>> sentence"));
>>>  Token token;
>>>  int j = -1;
>>>  while ((token = ts.next()) != null) {
>>>    j += token.getPositionIncrement();
>>>    String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>>    q.add(new Term("content", termText), j);
>>>  }
>>>
>>>  Hits hits = searcher.search(q);
>>>  int[] ranks = new int[] { 0 };
>>>  compareRanks(hits, ranks);
>>> }
>>>
>>> -- DM
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by DM Smith <dm...@gmail.com>.
Michael McCandless wrote:
>
> Maybe we should un-deprecate the termText() method but add javadocs 
> explaining that for better performance you should use the char[] reuse 
> methods instead?
I think so, too. Should we leave it as deprecated until 3.0? With the 
performance note and the encouragement to go for re-use, but also with a 
note that the current implementation is deprecated not the interface.

That's not quite what deprecated means. My thought on this is that it 
will give everyone a heads up that the current implementation is going 
away and that the replacement is sub-optimal.

(I use Eclipse and have it set to flag all deprecated uses. This helps 
me look for places to change.)

I think that this will make migration to 3.0 be much easier.

With this changing Term to add Term(String, Token) won't be necessary.

-- DM
>
> Mike
>
> DM Smith wrote:
>
>> Michael McCandless wrote:
>>>
>>> DM Smith wrote:
>>>
>>>> Shouldn't Term have constructors that take a Token?
>>>
>>> I think that makes sense, though normally Token appears during 
>>> analysis and Term during searching (I think?) -- how often would you 
>>> need to make a Term from a Token?
>>>
>> The problem I'm addressing is that tokens are used in contexts that 
>> need String and not char[].
>> The call to the deprecated
>>  String termText = token.termText();
>> needs to be replaced with:
>>  String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>
>> There are over 170 calls to token.termText(), each of these places 
>> have to be modified. In some, perhaps many, of these cases it may be 
>> possible to use char[] directly to get a performance gain.
>>
>> In the case of Term changing it to work with char[] buffer, int 
>> start, int length, does not seem quite right. I think the ripple 
>> would keep getting bigger. But logically, the Term's text is the text 
>> of a Token.
>>
>> To me it makes sense to have a method that returns the token as a 
>> String, but that method is deprecated and the suggested replacement 
>> is to directly use the buffer. So this leads to the above construct. 
>> Perhaps it would be good to add a new method and document that as one 
>> of two replacements.
>> public String term() {
>> return termText != null ? termText : new String(token.termBuffer(), 
>> 0, token.termLength());
>> }
>>
>> Here is an example from QueryParser that has 5 instances, each 
>> calling the deprecated t.termText() method. In this example, there is 
>> the construction of a query from a token stream.
>> Each of the problem lines are of the pattern:
>>  TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));
>>
>> To remove the deprecated call to t.termText(), the Token's buffer 
>> needs to be marshalled with something like:
>>  String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>
>> /**
>>  * @exception ParseException throw in overridden method to disallow
>>  */
>> protected Query getFieldQuery(String field, String queryText)  throws 
>> ParseException {
>>   // Use the analyzer to get all the tokens, and then build a TermQuery,
>>   // PhraseQuery, or nothing based on the term count
>>
>>   TokenStream source = analyzer.tokenStream(field, new 
>> StringReader(queryText));
>>   Vector v = new Vector();
>>   org.apache.lucene.analysis.Token t;
>>   int positionCount = 0;
>>   boolean severalTokensAtSamePosition = false;
>>
>>   while (true) {
>>     try {
>>       t = source.next();
>>     }
>>     catch (IOException e) {
>>       t = null;
>>     }
>>     if (t == null)
>>       break;
>>     v.addElement(t);
>>     if (t.getPositionIncrement() != 0)
>>       positionCount += t.getPositionIncrement();
>>     else
>>       severalTokensAtSamePosition = true;
>>   }
>>   try {
>>     source.close();
>>   }
>>   catch (IOException e) {
>>     // ignore
>>   }
>>
>>   if (v.size() == 0)
>>     return null;
>>   else if (v.size() == 1) {
>>     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>     return new TermQuery(new Term(field, t.termText()));
>>   } else {
>>     if (severalTokensAtSamePosition) {
>>       if (positionCount == 1) {
>>         // no phrase query:
>>         BooleanQuery q = new BooleanQuery(true);
>>         for (int i = 0; i < v.size(); i++) {
>>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>           TermQuery currentQuery = new TermQuery(
>>               new Term(field, t.termText()));
>>           q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>         }
>>         return q;
>>       }
>>       else {
>>         // phrase query:
>>         MultiPhraseQuery mpq = new MultiPhraseQuery();
>>         mpq.setSlop(phraseSlop);
>>         List multiTerms = new ArrayList();
>>         int position = -1;
>>         for (int i = 0; i < v.size(); i++) {
>>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
>>             if (enablePositionIncrements) {
>>               mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>             } else {
>>               mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>             }
>>             multiTerms.clear();
>>           }
>>           position += t.getPositionIncrement();
>>           multiTerms.add(new Term(field, t.termText()));
>>         }
>>         if (enablePositionIncrements) {
>>           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>         } else {
>>           mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>         }
>>         return mpq;
>>       }
>>     }
>>     else {
>>       PhraseQuery pq = new PhraseQuery();
>>       pq.setSlop(phraseSlop);
>>       int position = -1;
>>       for (int i = 0; i < v.size(); i++) {
>>         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>         if (enablePositionIncrements) {
>>           position += t.getPositionIncrement();
>>           pq.add(new Term(field, t.termText()),position);
>>         } else {
>>           pq.add(new Term(field, t.termText()));
>>         }
>>       }
>>       return pq;
>>     }
>>   }
>> }
>>
>>
>> Here is an example that works around the deprecated code:
>> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>>   Analyzer analyzer = new ShingleAnalyzerWrapper(new 
>> WhitespaceAnalyzer(), 2);
>>   searcher = setUpSearcher(analyzer);
>>
>>   PhraseQuery q = new PhraseQuery();
>>
>>   TokenStream ts = analyzer.tokenStream("content",
>>                                         new StringReader("this 
>> sentence"));
>>   Token token;
>>   int j = -1;
>>   while ((token = ts.next()) != null) {
>>     j += token.getPositionIncrement();
>>     String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>     q.add(new Term("content", termText), j);
>>   }
>>
>>   Hits hits = searcher.search(q);
>>   int[] ranks = new int[] { 0 };
>>   compareRanks(hits, ranks);
>> }
>>
>> -- DM
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Michael McCandless <lu...@mikemccandless.com>.
Maybe we should un-deprecate the termText() method but add javadocs  
explaining that for better performance you should use the char[] reuse  
methods instead?

Mike

DM Smith wrote:

> Michael McCandless wrote:
>>
>> DM Smith wrote:
>>
>>> Shouldn't Term have constructors that take a Token?
>>
>> I think that makes sense, though normally Token appears during  
>> analysis and Term during searching (I think?) -- how often would  
>> you need to make a Term from a Token?
>>
> The problem I'm addressing is that tokens are used in contexts that  
> need String and not char[].
> The call to the deprecated
>  String termText = token.termText();
> needs to be replaced with:
>  String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>
> There are over 170 calls to token.termText(), each of these places  
> have to be modified. In some, perhaps many, of these cases it may be  
> possible to use char[] directly to get a performance gain.
>
> In the case of Term changing it to work with char[] buffer, int  
> start, int length, does not seem quite right. I think the ripple  
> would keep getting bigger. But logically, the Term's text is the  
> text of a Token.
>
> To me it makes sense to have a method that returns the token as a  
> String, but that method is deprecated and the suggested replacement  
> is to directly use the buffer. So this leads to the above construct.  
> Perhaps it would be good to add a new method and document that as  
> one of two replacements.
> public String term() {
> return termText != null ? termText : new String(token.termBuffer(),  
> 0, token.termLength());
> }
>
> Here is an example from QueryParser that has 5 instances, each  
> calling the deprecated t.termText() method. In this example, there  
> is the construction of a query from a token stream.
> Each of the problem lines are of the pattern:
>  TermQuery currentQuery = new TermQuery(new Term(field,  
> t.termText()));
>
> To remove the deprecated call to t.termText(), the Token's buffer  
> needs to be marshalled with something like:
>  String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>
> /**
>  * @exception ParseException throw in overridden method to disallow
>  */
> protected Query getFieldQuery(String field, String queryText)   
> throws ParseException {
>   // Use the analyzer to get all the tokens, and then build a  
> TermQuery,
>   // PhraseQuery, or nothing based on the term count
>
>   TokenStream source = analyzer.tokenStream(field, new  
> StringReader(queryText));
>   Vector v = new Vector();
>   org.apache.lucene.analysis.Token t;
>   int positionCount = 0;
>   boolean severalTokensAtSamePosition = false;
>
>   while (true) {
>     try {
>       t = source.next();
>     }
>     catch (IOException e) {
>       t = null;
>     }
>     if (t == null)
>       break;
>     v.addElement(t);
>     if (t.getPositionIncrement() != 0)
>       positionCount += t.getPositionIncrement();
>     else
>       severalTokensAtSamePosition = true;
>   }
>   try {
>     source.close();
>   }
>   catch (IOException e) {
>     // ignore
>   }
>
>   if (v.size() == 0)
>     return null;
>   else if (v.size() == 1) {
>     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>     return new TermQuery(new Term(field, t.termText()));
>   } else {
>     if (severalTokensAtSamePosition) {
>       if (positionCount == 1) {
>         // no phrase query:
>         BooleanQuery q = new BooleanQuery(true);
>         for (int i = 0; i < v.size(); i++) {
>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>           TermQuery currentQuery = new TermQuery(
>               new Term(field, t.termText()));
>           q.add(currentQuery, BooleanClause.Occur.SHOULD);
>         }
>         return q;
>       }
>       else {
>         // phrase query:
>         MultiPhraseQuery mpq = new MultiPhraseQuery();
>         mpq.setSlop(phraseSlop);
>         List multiTerms = new ArrayList();
>         int position = -1;
>         for (int i = 0; i < v.size(); i++) {
>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
>             if (enablePositionIncrements) {
>               mpq.add((Term[])multiTerms.toArray(new  
> Term[0]),position);
>             } else {
>               mpq.add((Term[])multiTerms.toArray(new Term[0]));
>             }
>             multiTerms.clear();
>           }
>           position += t.getPositionIncrement();
>           multiTerms.add(new Term(field, t.termText()));
>         }
>         if (enablePositionIncrements) {
>           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>         } else {
>           mpq.add((Term[])multiTerms.toArray(new Term[0]));
>         }
>         return mpq;
>       }
>     }
>     else {
>       PhraseQuery pq = new PhraseQuery();
>       pq.setSlop(phraseSlop);
>       int position = -1;
>       for (int i = 0; i < v.size(); i++) {
>         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>         if (enablePositionIncrements) {
>           position += t.getPositionIncrement();
>           pq.add(new Term(field, t.termText()),position);
>         } else {
>           pq.add(new Term(field, t.termText()));
>         }
>       }
>       return pq;
>     }
>   }
> }
>
>
> Here is an example that works around the deprecated code:
> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>   Analyzer analyzer = new ShingleAnalyzerWrapper(new  
> WhitespaceAnalyzer(), 2);
>   searcher = setUpSearcher(analyzer);
>
>   PhraseQuery q = new PhraseQuery();
>
>   TokenStream ts = analyzer.tokenStream("content",
>                                         new StringReader("this  
> sentence"));
>   Token token;
>   int j = -1;
>   while ((token = ts.next()) != null) {
>     j += token.getPositionIncrement();
>     String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>     q.add(new Term("content", termText), j);
>   }
>
>   Hits hits = searcher.search(q);
>   int[] ranks = new int[] { 0 };
>   compareRanks(hits, ranks);
> }
>
> -- DM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by DM Smith <dm...@gmail.com>.
Michael McCandless wrote:
>
> DM Smith wrote:
>
>>  Shouldn't Term have constructors that take a Token?
>
> I think that makes sense, though normally Token appears during 
> analysis and Term during searching (I think?) -- how often would you 
> need to make a Term from a Token?
>
The problem I'm addressing is that tokens are used in contexts that need 
String and not char[].
The call to the deprecated
   String termText = token.termText();
needs to be replaced with:
   String termText = new String(token.termBuffer(), 0, token.termLength());

There are over 170 calls to token.termText(), each of these places have 
to be modified. In some, perhaps many, of these cases it may be possible 
to use char[] directly to get a performance gain.

In the case of Term changing it to work with char[] buffer, int start, 
int length, does not seem quite right. I think the ripple would keep 
getting bigger. But logically, the Term's text is the text of a Token.

To me it makes sense to have a method that returns the token as a 
String, but that method is deprecated and the suggested replacement is 
to directly use the buffer. So this leads to the above construct. 
Perhaps it would be good to add a new method and document that as one of 
two replacements.
public String term() {
  return termText != null ? termText : new String(token.termBuffer(), 0, 
token.termLength());
}

Here is an example from QueryParser that has 5 instances, each calling 
the deprecated t.termText() method. In this example, there is the 
construction of a query from a token stream.
Each of the problem lines are of the pattern:
   TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));

To remove the deprecated call to t.termText(), the Token's buffer needs 
to be marshalled with something like:
   String termText = new String(token.termBuffer(), 0, token.termLength());
   TermQuery currentQuery = new TermQuery(new Term(field, termText)));

  /**
   * @exception ParseException throw in overridden method to disallow
   */
  protected Query getFieldQuery(String field, String queryText)  throws 
ParseException {
    // Use the analyzer to get all the tokens, and then build a TermQuery,
    // PhraseQuery, or nothing based on the term count

    TokenStream source = analyzer.tokenStream(field, new 
StringReader(queryText));
    Vector v = new Vector();
    org.apache.lucene.analysis.Token t;
    int positionCount = 0;
    boolean severalTokensAtSamePosition = false;

    while (true) {
      try {
        t = source.next();
      }
      catch (IOException e) {
        t = null;
      }
      if (t == null)
        break;
      v.addElement(t);
      if (t.getPositionIncrement() != 0)
        positionCount += t.getPositionIncrement();
      else
        severalTokensAtSamePosition = true;
    }
    try {
      source.close();
    }
    catch (IOException e) {
      // ignore
    }

    if (v.size() == 0)
      return null;
    else if (v.size() == 1) {
      t = (org.apache.lucene.analysis.Token) v.elementAt(0);
      return new TermQuery(new Term(field, t.termText()));
    } else {
      if (severalTokensAtSamePosition) {
        if (positionCount == 1) {
          // no phrase query:
          BooleanQuery q = new BooleanQuery(true);
          for (int i = 0; i < v.size(); i++) {
            t = (org.apache.lucene.analysis.Token) v.elementAt(i);
            TermQuery currentQuery = new TermQuery(
                new Term(field, t.termText()));
            q.add(currentQuery, BooleanClause.Occur.SHOULD);
          }
          return q;
        }
        else {
          // phrase query:
          MultiPhraseQuery mpq = new MultiPhraseQuery();
          mpq.setSlop(phraseSlop);
          List multiTerms = new ArrayList();
          int position = -1;
          for (int i = 0; i < v.size(); i++) {
            t = (org.apache.lucene.analysis.Token) v.elementAt(i);
            if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
              if (enablePositionIncrements) {
                mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
              } else {
                mpq.add((Term[])multiTerms.toArray(new Term[0]));
              }
              multiTerms.clear();
            }
            position += t.getPositionIncrement();
            multiTerms.add(new Term(field, t.termText()));
          }
          if (enablePositionIncrements) {
            mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
          } else {
            mpq.add((Term[])multiTerms.toArray(new Term[0]));
          }
          return mpq;
        }
      }
      else {
        PhraseQuery pq = new PhraseQuery();
        pq.setSlop(phraseSlop);
        int position = -1;
        for (int i = 0; i < v.size(); i++) {
          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
          if (enablePositionIncrements) {
            position += t.getPositionIncrement();
            pq.add(new Term(field, t.termText()),position);
          } else {
            pq.add(new Term(field, t.termText()));
          }
        }
        return pq;
      }
    }
  }


Here is an example that works around the deprecated code:
  public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
    Analyzer analyzer = new ShingleAnalyzerWrapper(new 
WhitespaceAnalyzer(), 2);
    searcher = setUpSearcher(analyzer);

    PhraseQuery q = new PhraseQuery();

    TokenStream ts = analyzer.tokenStream("content",
                                          new StringReader("this 
sentence"));
    Token token;
    int j = -1;
    while ((token = ts.next()) != null) {
      j += token.getPositionIncrement();
      String termText = new String(token.termBuffer(), 0, 
token.termLength());
      q.add(new Term("content", termText), j);
    }

    Hits hits = searcher.search(q);
    int[] ranks = new int[] { 0 };
    compareRanks(hits, ranks);
  }

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Token implementation

Posted by Michael McCandless <lu...@mikemccandless.com>.
DM Smith wrote:

>  Shouldn't Term have constructors that take a Token?

I think that makes sense, though normally Token appears during  
analysis and Term during searching (I think?) -- how often would you  
need to make a Term from a Token?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org