You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2007/12/08 23:44:20 UTC

Getting the actual token from Token's term buffer

Hi,

It's been a while since I've written a custom TokenFilter, and I'm not having luck getting tokens out of the TokenStream using 2.3-dev.
I'm hitting that default term buffer of the size 10 using the following:

    public final Token next(Token result) throws IOException {
        result = input.next(result);
        if (result != null) {
            final int len = result.termLength();       // gives me the actual term length, not the buffer of length 10
            result.setTermLength(len);
            System.out.println("LEN 1: " + len);     // prints the actual length
            final char[] buffer = result.termBuffer(); // this still gives me the buffer of length 10
            System.out.println("LEN 2: " + buffer.length); // and this prints 10


Is the idea to:
1) get the char[] buffer from Token
2) get its real length via termLength()
3) manually fill a new char[]  with the content of the buffer, minus the extra buffering?

I'm looking at Token to see how to get the *actual* term, but don't see anything, so it looks like a Filter writer has to do one of these for each term buffer:

    public final Token next(Token result) throws IOException {

        result = input.next(result);

        if (result != null) {
            final int len = result.termLength();
            final char[] buffer = result.termBuffer();
            final char[] token = new char[len];
            System.arraycopy(buffer, 0, token, 0, len);

Am I missing a Token method I could use instead, or is this the new way to go?

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: help required ... ~ operator

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 10, 2007, at 4:48 AM, Shakti_Sareen wrote:
>      I am using StandardAnalyzer() to index the data. I am getting  
> false
> hits in ~ operator query.
>
> Actual data is: "signals by magnets of different strength"
> and when I am parsing a query: "signals strength"~2  , I am getting a
> hit which is a false result.
>
> I am using QueryParser.
>
> Please help on this issue.

Chances are that you've got a stop word remover in the mix, and "by"  
and "of" are being removed, thus making the words close enough for a  
match.  The built in stop filter does not leave gaps for removed  
words.  So you could either use a custom stop filter or remove it  
altogether to keep those words there.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

help required ... ~ operator

Posted by Shakti_Sareen <Sh...@satyam.com>.

Hi all,

     I am using StandardAnalyzer() to index the data. I am getting false
hits in ~ operator query.

Actual data is: "signals by magnets of different strength"
and when I am parsing a query: "signals strength"~2  , I am getting a
hit which is a false result.

I am using QueryParser.

Please help on this issue.

Thanks
Shakti Sareen





DISCLAIMER:
This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting the actual token from Token's term buffer

Posted by Michael McCandless <lu...@mikemccandless.com>.

Otis Gospodnetic wrote:

 > Is the idea to:
 >   1) get the char[] buffer from Token
 >   2) get its real length via termLength()

Yes.  And, on getting the char[] buffer, if you need more space than
its current length, call resizeTermBuffer(int newSize), which returns  
a buffer
of size >= newSize.

 > 3) manually fill a new char[]  with the content of the buffer,  
minus the extra buffering?

Or, better, directly alter the char[] buffer you just got, in place.

If you really need/want to do a new buffer, then you can call
Token.setTermBuffer and it will do the copy (into its buffer) for
you.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org