You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2010/04/17 17:43:19 UTC

Re: Fix to contrib/misc/HighFreqTerms.java

Ahh you're right!

Though, really, we should not be converting to String (flex terms in
general are an arbitrary byte[], not necessarily utf8).  We should
just use a BytesRef directly in the key.

Can you open an issue for this Tom?  Thanks!

Mike

On Fri, Apr 16, 2010 at 2:41 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi Mike,
>
> Thanks for making the fix and changing the display from bytes to utf8.  It needs a very minor change:
> The latest fix converts to utf8 if you give a field argument on the command line but still shows bytes if you don't.
>
> Line 89 should parallel line 70 and use term.utf8ToString() instead of term.toString;
>
> 70       tiq.insertWithOverflow(new TermInfo(new Term(field, term.utf8ToString()), termsEnum.docFreq()));
> 89       tiq.insertWithOverflow(new TermInfo(new Term(field, term.toString()), terms.docFreq()));
>
> Tom
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 3:50 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Bug in contrib/misc/HighFreqTerms.java?
>
> OK I committed the fix.  I ran it on a flex wikipedia index I had...
> it produces output like this:
>
> body:[3c 21 2d 2d] 509050
> body:[73 68 6f 75 6c 64] 515495
> body:[74 68 65 6e] 525176
> body:[74 69 74 6c 65] 525361
> body:[5b 5b 55 6e 69 74 65 64] 532586
> body:[6b 6e 6f 77 6e] 533558
> body:[75 6e 64 65 72] 536480
> body:[55 6e 69 74 65 64] 543746
>
> Which is not very readable, but, it does this because flex terms are
> arbitrary byte[], not necessarily utf8... maybe we should fix it to
> print both hex and String if we assume bytes are utf8?
>
> Mike
>
> On Wed, Apr 14, 2010 at 3:25 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Ugh, I'll fix this.
>>
>> With the new flex API, you can't ask a composite (Multi/DirReader) for
>> its postings -- you have to go through the static methods on
>> MultiFields.  I'm trying to put some distance b/w IndexReader and
>> composite readers... because I'd like to eventually deprecate them.
>> Ie, the composite readers should "hold" an ordered collection of
>> sub-readers, but should not themselves implement IndexReader's API, I
>> think.
>>
>> Thanks for raising this Tom,
>>
>> Mike
>>
>> On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>> When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the
>>> the exception appended below.  I believe the line of code involved is a
>>> result of the flex indexing merge. Should I post this as a comment to
>>> LUCENE-2370 (Reintegrate flex branch into trunk)?
>>>
>>> Or is there simply something wrong with my configuration?
>>>
>>> Exception in thread "main" java.lang.UnsupportedOperationException: please
>>> use MultiFields.getFields if you really need a top level Fields (NOTE that
>>> it's usually better to work per segment instead)
>>>         at
>>> org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
>>>         at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)
>>>
>>> Tom Burton-West
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Fix to contrib/misc/HighFreqTerms.java

Posted by Michael McCandless <lu...@mikemccandless.com>.
Thanks Tom!

On Mon, Apr 19, 2010 at 4:27 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Ok opened LUCENE-2403.
>
> I could make the change to make the two lines consistent but to use a BytesRef directly wouldn't Term.java need to use BytesRef instead of String, or is there a new flex "Term" class that uses a BytesRef to use?

There is no Term class that takes a BytesRef... other things need
this, too (eg TermQuery needs to accept a BytesRef).  But we are
considering deprecating Term entirely (it's not used in that many
further places).

> Otherwise, TermInfo could change to use the name of the field and a BytesRef instead of a term.

+1 -- I think we should take this approach?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Fix to contrib/misc/HighFreqTerms.java

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Ok opened LUCENE-2403.

I could make the change to make the two lines consistent but to use a BytesRef directly wouldn't Term.java need to use BytesRef instead of String, or is there a new flex "Term" class that uses a BytesRef to use?

Otherwise, TermInfo could change to use the name of the field and a BytesRef instead of a term.

Tom

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Saturday, April 17, 2010 11:43 AM
To: java-dev@lucene.apache.org
Subject: Re: Fix to contrib/misc/HighFreqTerms.java

Ahh you're right!

Though, really, we should not be converting to String (flex terms in
general are an arbitrary byte[], not necessarily utf8).  We should
just use a BytesRef directly in the key.

Can you open an issue for this Tom?  Thanks!

Mike

On Fri, Apr 16, 2010 at 2:41 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi Mike,
>
> Thanks for making the fix and changing the display from bytes to utf8.  It needs a very minor change:
> The latest fix converts to utf8 if you give a field argument on the command line but still shows bytes if you don't.
>
> Line 89 should parallel line 70 and use term.utf8ToString() instead of term.toString;
>
> 70       tiq.insertWithOverflow(new TermInfo(new Term(field, term.utf8ToString()), termsEnum.docFreq()));
> 89       tiq.insertWithOverflow(new TermInfo(new Term(field, term.toString()), terms.docFreq()));
>
> Tom
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, April 14, 2010 3:50 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Bug in contrib/misc/HighFreqTerms.java?
>
> OK I committed the fix.  I ran it on a flex wikipedia index I had...
> it produces output like this:
>
> body:[3c 21 2d 2d] 509050
> body:[73 68 6f 75 6c 64] 515495
> body:[74 68 65 6e] 525176
> body:[74 69 74 6c 65] 525361
> body:[5b 5b 55 6e 69 74 65 64] 532586
> body:[6b 6e 6f 77 6e] 533558
> body:[75 6e 64 65 72] 536480
> body:[55 6e 69 74 65 64] 543746
>
> Which is not very readable, but, it does this because flex terms are
> arbitrary byte[], not necessarily utf8... maybe we should fix it to
> print both hex and String if we assume bytes are utf8?
>
> Mike
>
> On Wed, Apr 14, 2010 at 3:25 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Ugh, I'll fix this.
>>
>> With the new flex API, you can't ask a composite (Multi/DirReader) for
>> its postings -- you have to go through the static methods on
>> MultiFields.  I'm trying to put some distance b/w IndexReader and
>> composite readers... because I'd like to eventually deprecate them.
>> Ie, the composite readers should "hold" an ordered collection of
>> sub-readers, but should not themselves implement IndexReader's API, I
>> think.
>>
>> Thanks for raising this Tom,
>>
>> Mike
>>
>> On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>>> When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the
>>> the exception appended below.  I believe the line of code involved is a
>>> result of the flex indexing merge. Should I post this as a comment to
>>> LUCENE-2370 (Reintegrate flex branch into trunk)?
>>>
>>> Or is there simply something wrong with my configuration?
>>>
>>> Exception in thread "main" java.lang.UnsupportedOperationException: please
>>> use MultiFields.getFields if you really need a top level Fields (NOTE that
>>> it's usually better to work per segment instead)
>>>         at
>>> org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
>>>         at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)
>>>
>>> Tom Burton-West
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org