You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Nadav Har'El (JIRA)" <ji...@apache.org> on 2009/09/08 15:32:57 UTC

[jira] Created: (LUCENE-1899) Inefficient growth of OpenBitSet

Inefficient growth of OpenBitSet
--------------------------------

                 Key: LUCENE-1899
                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
             Project: Lucene - Java
          Issue Type: Bug
          Components: Store
    Affects Versions: 2.9
            Reporter: Nadav Har'El
            Priority: Minor


Hi, I found a potentially serious efficiency problem with OpenBitSet.

One typical (I think) way to build a bit set is to set() the bits one by one -
e.g., have a HitCollector set() the bit for each matching document.
The underlying array of longs needs to grow as more as more bits are set, of
course.

But looking at the code, it appears to me that the array grows very
ineefficiently - in the worst case (when doc ids are sorted, as they would
normally be in the HitCollector case for example), copying the array again
and again for every added bit... The relevant code in OpenBitSet.java is:

  public void set(long index) {
    int wordNum = expandingWordNum(index);
    ...
  }

  protected int expandingWordNum(long index) {
    int wordNum = (int)(index >> 6);
    if (wordNum>=wlen) {
      ensureCapacity(index+1);
    ...
  }
  public void ensureCapacityWords(int numWords) {
    if (bits.length < numWords) {
      long[] newBits = new long[numWords];
      System.arraycopy(bits,0,newBits,0,wlen);
      bits = newBits;
    }
  }

As you can see, if the bits array is not long enough, a new one is
allocated at exactly the right size - and in the worst case it can grow
just one word every time...

Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
of index+1 and twice the existing size?

Alternatively, if the growth is so inefficient, this should be documented,
and it should be recommended to use the variant of the constructor with the
correct initial size (e.g., in the HitCollector case, the number of documents
in the index). and the fastSet() method instead of set().

Thanks,
Nadav.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752994#action_12752994 ] 

Michael McCandless commented on LUCENE-1899:
--------------------------------------------

bq. This formula ensures that at worst case, just 6% of the array space is wasted

You mean 12.5% right?

bq. I just wonder what is the rationale behind the specific formula in this function

It's just a standard time/space tradeoff, that favors not wasting too much space.  This code was "borrowed" from Python's "listobject.c" sources, ie, it governs how Python over-allocates the storage for its list type.

We could explore different constants though I'd be nervous about making this value much higher.  Often the consumer of this API will see rapid growth initially, and then the collection stops growing and is re-used for a long time, in which case the long-term wasted RAM is (I think) more important than the one-time short-term CPU cost of finding the "natural" size.

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1899:
---------------------------------------

    Attachment: LUCENE-1899.patch

Attached trivial patch.  I plan to commit soon...

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1899.patch
>
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752532#action_12752532 ] 

Michael McCandless commented on LUCENE-1899:
--------------------------------------------

bq. There's ArrayUtil.getNextSize (a Lucene class) which seems to grow arrays in a mild fashion. the method is well documented, and I think it should be used by ensureCapacityWords.

+1

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Nadav Har'El (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752931#action_12752931 ] 

Nadav Har'El commented on LUCENE-1899:
--------------------------------------

Hi Shai, I guess you're right that if there's such a utility function, we should probably use it.
I just wonder what is the rationale behind the specific formula in this function - basically newsize = oldsize * 1.125 + 6. This formula ensures that at worst case, just 6% of the array space is wasted (instead of 50% in the doubling approach), but the number of reallocations and copies is 8 times higher, and performance is proportionally slower (although obviously, both are linear in amortized time - which the current code isn't). Was there any thought given to why the factor 0.125 is better than 0.25, 0.5, 0.01 or 1.0? I'm not saying that 1.0 (doubling) is best, just that I don't know why 0.125 is.

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1899:
------------------------------------------

    Assignee: Michael McCandless

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Nadav Har'El (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753019#action_12753019 ] 

Nadav Har'El commented on LUCENE-1899:
--------------------------------------

Yes, you're right, 12.5%. Or actually, 11%=0.125/(1+0.125) of the space after an elargment is wasted. I don't know where I got this 6% from ;-)

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1899.patch
>
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753048#action_12753048 ] 

Uwe Schindler commented on LUCENE-1899:
---------------------------------------

I wanted to add just one comment:
For nomal Lucene usage, the auto-grow of the array is not needed. All internal Lucene code (collectors, filters) use IndexReader.maxDoc() as initial size param to the ctor. If you write own HitCollectors/Collectors(2.9), use the maxDoc() of the current IndexReader. With the new 2.9 Collectors, you would initialize the OpenBitSet in Collector.setNextReader().

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1899.patch
>
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752518#action_12752518 ] 

Shai Erera commented on LUCENE-1899:
------------------------------------

I don't know about doubling the array size every time. There's ArrayUtil.getNextSize (a Lucene class) which seems to grow arrays in a mild fashion. the method is well documented, and I think it should be used by ensureCapacityWords.

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1899:
---------------------------------------

    Fix Version/s: 2.9

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1899) Inefficient growth of OpenBitSet

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1899.
----------------------------------------

    Resolution: Fixed

Thanks Nadav!

> Inefficient growth of OpenBitSet
> --------------------------------
>
>                 Key: LUCENE-1899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1899
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.9
>            Reporter: Nadav Har'El
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1899.patch
>
>
> Hi, I found a potentially serious efficiency problem with OpenBitSet.
> One typical (I think) way to build a bit set is to set() the bits one by one -
> e.g., have a HitCollector set() the bit for each matching document.
> The underlying array of longs needs to grow as more as more bits are set, of
> course.
> But looking at the code, it appears to me that the array grows very
> ineefficiently - in the worst case (when doc ids are sorted, as they would
> normally be in the HitCollector case for example), copying the array again
> and again for every added bit... The relevant code in OpenBitSet.java is:
>   public void set(long index) {
>     int wordNum = expandingWordNum(index);
>     ...
>   }
>   protected int expandingWordNum(long index) {
>     int wordNum = (int)(index >> 6);
>     if (wordNum>=wlen) {
>       ensureCapacity(index+1);
>     ...
>   }
>   public void ensureCapacityWords(int numWords) {
>     if (bits.length < numWords) {
>       long[] newBits = new long[numWords];
>       System.arraycopy(bits,0,newBits,0,wlen);
>       bits = newBits;
>     }
>   }
> As you can see, if the bits array is not long enough, a new one is
> allocated at exactly the right size - and in the worst case it can grow
> just one word every time...
> Shouldn't the growth be more exponential in nature, e.g., grow to the maximum
> of index+1 and twice the existing size?
> Alternatively, if the growth is so inefficient, this should be documented,
> and it should be recommended to use the variant of the constructor with the
> correct initial size (e.g., in the HitCollector case, the number of documents
> in the index). and the fastSet() method instead of set().
> Thanks,
> Nadav.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org