You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Peter Karich <pe...@yahoo.de> on 2011/12/19 17:03:12 UTC

Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

Hi Uwe,

thanks for the talk suggestion(s)*.

I was using it for faster term lookups of a long 'id'. How would this be
done with 4.0? Before I did it via Term:

new Term(fieldName, NumericUtils.longToPrefixCoded(longValue));

How should I generally do "term lookup" in 4.0 as you said in the video
that 'Term' gets removed somewhen :)? What is the most recommended way
and what is the fastest? Or where can I find "most recent" code in
lucene tests to be used as an example?

I also heard the suggestion to use the pulsing codec for id retrieval**.
Is this the correct way nowadays to achive this:

indexWriterCfg.setCodec(new Lucene40Codec() {
   @Override public PostingsFormat getPostingsFormatForField(String field) {
       if("_id".equals(field)) return new Pulsing40PostingsFormat();
       else ?
   }});

Regards,
Peter.

*
http://vimeo.com/32065505

**
http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html


> Hi,
>
> NumericUtils is an internal implementation class, you should not use it.
> What do you want to do? There is no need to call any of its methods during
> indexing or searching. Everything else is advanced. I the latter case you
> should RTFM of BytesRef and realted classes (possibly watch the flexible
> indexing talk done by me in Berlin, Barcelona or San Francisco). Lucene
> moved to binary terms in 4.0 and no longer uses character based terms, so
> the code is different. BytesRef is just a wrapper around a byte[].
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: inspecting chinese index using luke

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

Please look at:
http://people.apache.org/~hossman/#threadhijack

Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention. It makes following discussions in the mailing list archives 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Peyman Faratin [mailto:peyman@robustlinks.com]
> Sent: Monday, December 19, 2011 6:11 PM
> To: java-user@lucene.apache.org
> Subject: inspecting chinese index using luke
> 
> hi
> 
> We are indexing some chinese text (using the following outputstreamwriter
> with UTF-8 enconding).
> 
> OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new
> FileOutputStream(outputFile), "utf8");
> 
> We are trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8
option
> in Luke), but it seems to be garbled. Any advice would be appreciated
> 
> thank you
> 
> Peyman


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

inspecting chinese index using luke

Posted by Peyman Faratin <pe...@robustlinks.com>.

hi

We are indexing some chinese text (using the following outputstreamwriter with UTF-8 enconding). 

OutputStreamWriter outputFileWriter  = new OutputStreamWriter(new FileOutputStream(outputFile), "utf8");

We are trying to inspect the index in Luke 3.4.0 (have chosen the UTF-8 option in Luke), but it seems to be garbled. Any advice would be appreciated

thank you

Peyman

RE: Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Hi Uwe,
> 
> thanks for the talk suggestion(s)*.
> 
> I was using it for faster term lookups of a long 'id'. How would this be
done with
> 4.0? Before I did it via Term:
> 
> new Term(fieldName, NumericUtils.longToPrefixCoded(longValue));

If you want to query on a single numeric term value, use
NumericRangeQuery.newLongRange(field, ..., value, value, true, true), this
rewrites to a simple TermQuery.

Otherwise you have to create a BytesRef() object:

final BytesRef bytes = new BytesRef(); // for reuse!
NumericUtils.longToPrefixCoded(longValue, 0, bytes); // 0 is shift value
new Term(fieldName, bytes);

> How should I generally do "term lookup" in 4.0 as you said in the video
that
> 'Term' gets removed somewhen :)? What is the most recommended way and
> what is the fastest? Or where can I find "most recent" code in lucene
tests to be
> used as an example?

Term lookup can be done by field and BytesRef: get a TermsEnum for the field
and seek to the BytesRef. For strings you can create a UTF-8 encoded
Bytesref using new BytesRef(CharSequence). If you need docFreq, ask
IndexReader with field name and BytesRef. And so on, it's always the same
:-)

> > NumericUtils is an internal implementation class, you should not use it.
> > What do you want to do? There is no need to call any of its methods
> > during indexing or searching. Everything else is advanced. I the
> > latter case you should RTFM of BytesRef and realted classes (possibly
> > watch the flexible indexing talk done by me in Berlin, Barcelona or
> > San Francisco). Lucene moved to binary terms in 4.0 and no longer uses
> > character based terms, so the code is different. BytesRef is just a
wrapper
> around a byte[].
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Dec 19, 2011 at 9:04 PM, Simon Willnauer
<si...@googlemail.com> wrote:
> On Mon, Dec 19, 2011 at 5:03 PM, Peter Karich <pe...@yahoo.de> wrote:
>> Hi Uwe,
>>
>> thanks for the talk suggestion(s)*.
>>
>> I was using it for faster term lookups of a long 'id'. How would this be
>> done with 4.0? Before I did it via Term:
>>
>> new Term(fieldName, NumericUtils.longToPrefixCoded(longValue));
>>
>> How should I generally do "term lookup" in 4.0 as you said in the video
>> that 'Term' gets removed somewhen :)? What is the most recommended way
>> and what is the fastest? Or where can I find "most recent" code in
>> lucene tests to be used as an example?
>>
>> I also heard the suggestion to use the pulsing codec for id retrieval**.
>> Is this the correct way nowadays to achive this:
>>
>> indexWriterCfg.setCodec(new Lucene40Codec() {
>>   @Override public PostingsFormat getPostingsFormatForField(String field) {
>>       if("_id".equals(field)) return new Pulsing40PostingsFormat();
>>       else ?
>>   }});
>
> do something like this:
>
>  public static final class CustomPerFieldCodec extends Lucene40Codec {
>    private final PostingsFormat pulsing = PostingsFormat.forName("Pulsing40");
>    private final PostingsFormat defaultFormat =
> PostingsFormat.forName("Lucene40");
>
>    @Override
>    public PostingsFormat getPostingsFormatForField(String field) {
>      if (field.equals("id")) {
>        return pulsing;
>      } else {
>        return defaultFormat;
>      }
>    }
>  }
>
> simon

Actually, if you look for fast ID lookups you could consider using
Memory PostingsFormat. This keeps everything in memory and should be
the fastest alternative but costly in terms of RAM.

private final PostingsFormat memory = PostingsFormat.forName("Memory");

simon

>>
>> Regards,
>> Peter.
>>
>> *
>> http://vimeo.com/32065505
>>
>> **
>> http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
>>
>>
>>> Hi,
>>>
>>> NumericUtils is an internal implementation class, you should not use it.
>>> What do you want to do? There is no need to call any of its methods during
>>> indexing or searching. Everything else is advanced. I the latter case you
>>> should RTFM of BytesRef and realted classes (possibly watch the flexible
>>> indexing talk done by me in Berlin, Barcelona or San Francisco). Lucene
>>> moved to binary terms in 4.0 and no longer uses character based terms, so
>>> the code is different. BytesRef is just a wrapper around a byte[].
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.0 questions, was: shift bug in possibly invalid use of NumericTokenStream

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Dec 19, 2011 at 5:03 PM, Peter Karich <pe...@yahoo.de> wrote:
> Hi Uwe,
>
> thanks for the talk suggestion(s)*.
>
> I was using it for faster term lookups of a long 'id'. How would this be
> done with 4.0? Before I did it via Term:
>
> new Term(fieldName, NumericUtils.longToPrefixCoded(longValue));
>
> How should I generally do "term lookup" in 4.0 as you said in the video
> that 'Term' gets removed somewhen :)? What is the most recommended way
> and what is the fastest? Or where can I find "most recent" code in
> lucene tests to be used as an example?
>
> I also heard the suggestion to use the pulsing codec for id retrieval**.
> Is this the correct way nowadays to achive this:
>
> indexWriterCfg.setCodec(new Lucene40Codec() {
>   @Override public PostingsFormat getPostingsFormatForField(String field) {
>       if("_id".equals(field)) return new Pulsing40PostingsFormat();
>       else ?
>   }});

do something like this:

  public static final class CustomPerFieldCodec extends Lucene40Codec {
    private final PostingsFormat pulsing = PostingsFormat.forName("Pulsing40");
    private final PostingsFormat defaultFormat =
PostingsFormat.forName("Lucene40");

    @Override
    public PostingsFormat getPostingsFormatForField(String field) {
      if (field.equals("id")) {
        return pulsing;
      } else {
        return defaultFormat;
      }
    }
  }

simon
>
> Regards,
> Peter.
>
> *
> http://vimeo.com/32065505
>
> **
> http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
>
>
>> Hi,
>>
>> NumericUtils is an internal implementation class, you should not use it.
>> What do you want to do? There is no need to call any of its methods during
>> indexing or searching. Everything else is advanced. I the latter case you
>> should RTFM of BytesRef and realted classes (possibly watch the flexible
>> indexing talk done by me in Berlin, Barcelona or San Francisco). Lucene
>> moved to binary terms in 4.0 and no longer uses character based terms, so
>> the code is different. BytesRef is just a wrapper around a byte[].
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org