You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/06/22 15:04:00 UTC

Adding IndexOutput.writeByte(byte b, int length)

I'm testing the performance of some indexing code and noticed that
NormsWriter.flush() calls IndexOutput.writeByte(defaultNorm) in a loop,
writing the same norm every time (lines: 139-140, 157-158, 162-163).

In the run I've spotted it, it occurs few thousands of times (I mean few
thousands of writeByte calls).

I was thinking that if we had writeByte(byte b, int lenght) in IndexOutput,
we can call it once and handle it effeciently where possible. For
back-compat, the default impl would just be looping and calling
writeByte(b), but for others, like BufferedIndexOutout, this could be
filling the array with b, length times. We won't use System.arraycopy which
is faster, but won't call thousands of times to writeByte either.

What do you think?

Shai

Re: Adding IndexOutput.writeByte(byte b, int length)

Posted by Shai Erera <se...@gmail.com>.

I've indexed 200K docs, fields indexed as ANALYZED (which include norms),
but the fields were sparse. The "holes" I've seen were thousands (sometimes
even 80K). Now that I understand this better, I realize that particular
indexing code is incorrect, and I should have disabled NORMS. After I did
it, performance really improved.

So if judging by the buggy indexing code, this fix is not needed. And I
guess large "holes" really represent a bug, rather than a common scenario.
So I take this proposal back :).

The code I've used is from benchmark, TrecContentSource, which takes all the
<meta> tags from the HTML files and puts them as properties on DocData, and
DocMaker later on adds them to the Document. That's what created the
sparseness. I think I'm going to add two things to benchmark:
1. Add a doc.tokenized.norms property and if set to false, it will use
Index.ANALYZED_NO_NORMS or Index.NOT_ANALYZED_NO_NORMS
2. Add to TrecContentSource a keep.properties attribute, which if set to
false will set DocData.props to null. I think for TREC, it really doesn't
make sense to index all the <meta> tags.

Shai

On Mon, Jun 22, 2009 at 5:10 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> This code isn't invoked that often, I believe.  It only happens when
> there are "holes" in the norms between docs, ie you have a field that
> has norms enabled (at least one Document had this Field w/ norms
> enabled in the past), but then you had a series of Docs that had
> disabled norms for the field and so you must fill the hole (since
> norms aren't sparse).
>
> So I think in practice it won't help much?  (And, writing long series
> of the same byte is something in general we shouldn't "try" to do ;)
> So I'm not sure I want a public API "inviting" it).
>
> Mike
>
> On Mon, Jun 22, 2009 at 9:04 AM, Shai Erera<se...@gmail.com> wrote:
> > I'm testing the performance of some indexing code and noticed that
> > NormsWriter.flush() calls IndexOutput.writeByte(defaultNorm) in a loop,
> > writing the same norm every time (lines: 139-140, 157-158, 162-163).
> >
> > In the run I've spotted it, it occurs few thousands of times (I mean few
> > thousands of writeByte calls).
> >
> > I was thinking that if we had writeByte(byte b, int lenght) in
> IndexOutput,
> > we can call it once and handle it effeciently where possible. For
> > back-compat, the default impl would just be looping and calling
> > writeByte(b), but for others, like BufferedIndexOutout, this could be
> > filling the array with b, length times. We won't use System.arraycopy
> which
> > is faster, but won't call thousands of times to writeByte either.
> >
> > What do you think?
> >
> > Shai
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Adding IndexOutput.writeByte(byte b, int length)

Posted by Michael McCandless <lu...@mikemccandless.com>.

This code isn't invoked that often, I believe.  It only happens when
there are "holes" in the norms between docs, ie you have a field that
has norms enabled (at least one Document had this Field w/ norms
enabled in the past), but then you had a series of Docs that had
disabled norms for the field and so you must fill the hole (since
norms aren't sparse).

So I think in practice it won't help much?  (And, writing long series
of the same byte is something in general we shouldn't "try" to do ;)
So I'm not sure I want a public API "inviting" it).

Mike

On Mon, Jun 22, 2009 at 9:04 AM, Shai Erera<se...@gmail.com> wrote:
> I'm testing the performance of some indexing code and noticed that
> NormsWriter.flush() calls IndexOutput.writeByte(defaultNorm) in a loop,
> writing the same norm every time (lines: 139-140, 157-158, 162-163).
>
> In the run I've spotted it, it occurs few thousands of times (I mean few
> thousands of writeByte calls).
>
> I was thinking that if we had writeByte(byte b, int lenght) in IndexOutput,
> we can call it once and handle it effeciently where possible. For
> back-compat, the default impl would just be looping and calling
> writeByte(b), but for others, like BufferedIndexOutout, this could be
> filling the array with b, length times. We won't use System.arraycopy which
> is faster, but won't call thousands of times to writeByte either.
>
> What do you think?
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org