You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@harmony.apache.org by Catherine Hope <ca...@googlemail.com> on 2010/08/04 14:07:50 UTC

Re: [classlib][luni] Collections classes - code reviews and optimisations

I did some analysis on what API methods access others between the RI and
Harmony (by subclassing ArrayList and adding some trace) to answer a couple
of the review points.  The differences I found were:
 - RI contains(Object) calls indexOf(Object), so we could also do this to
reduce the code duplication
 - RI add(Object), add(int, Object), addAll(Collection), addAll(int,
Collection) use ensureCapacity(int)
 - RI remove(Object) doesn't reference indexOf(Object) and remove(int) -
though I don't think this would save anything if we changed it

Cath

On Tue, Aug 3, 2010 at 11:45 PM, Mark Hindess
<ma...@googlemail.com>wrote:

>
> In message <4C...@googlemail.com>, Oliver Deakin writes:
> >
> >   Hi all,
> >
> > A little while ago Mark suggested (not sure if it was in person or
> > on the list) that it would be a good idea to take a look at our
> > Collections classes and do a kind of code review, thinking about
> > improvements and possible optimisations.
> >
> > To that end Mark Hindess, Cath Hope and myself sat down and took a
> > close look at ArrayList earlier today. We got up to the growAtEnd()
> > method, annotating the code with questions and comments as we went.
> > I've opened HARMONY-6607 [1] and attached the diff with our comments
> > in so everyone can see what we've come up with so far.
> >
> > If you have any further comments or ideas for the code (including
> > patches for improvements) please add to the discussion here - it would
> > be great to have some extra input!
>
> We had another chat today and got a bit further.  I've updated the
> review patch on JIRA.
>
> I think there was general consensus that we should move all the
> exceptions to the beginning of methods.  (And in the case of removeRange
> fix the exception to be meaningful.)  I'll probably look at doing this
> shortly unless someone beats me too it as it was really annoying when
> reading the code.
>
> We should also consider tracking firstIndex and size rather than firstIndex
> and lastIndex because:
>
>  * Nearly every method uses size - most calculate it but many of them
>   "forget" to use it in all cases.
>
>  * Fewer methods use lastIndex.
>
>  * size() is the most commonly called method making it simpler would be
>   a useful side-effect
>
>  * the resulting serialization code might be simpler
>
> I'm still puzzled as to whether the fact that the call to growForInsert
> in addAll always grows by the size of the entire collection (rather than
> size - free space) is intentional or a bug.  The increment calculation
> should mean we always have sufficient free space even if we only ask for
> the smaller (strictly necessary) amount.  So I suspect it is a bug?
>
> Regards,
>  Mark.
>
> > [1] https://issues.apache.org/jira/browse/HARMONY-6607
>

Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Catherine Hope <ca...@googlemail.com>.
I've done some analysis of the size of the backing arrays that are
constructed by the RI, by serializing ArrayLists on the RI and then reading
them back in on Harmony (the serialization spec saves the length of the
backing array).
The RI's grow heuristics differ from Harmony in that there's not a minimum
grow size - adding an element to a zero capacity ArrayList on the RI grows
it by just one element, whereas on Harmony the ArrayList will be grown by a
minimum of 12 elements.  So this may have an impact if lots of ArrayLists
are constructed that only add a few elements.  From some experimentation the
growth rate on the RI in the general case is to grow by the larger of the
size of Collection or 0.5*capacity (and then add 1 if it's the
0.5*capacity!), so:
>create an ArrayList and call addAll with 1000 element Collection
  size is 1000, capacity is 1000
>then call addAll with 1000 element Collection
  size is 2000, capacity is 2000
>then call addAll with 50 element Collection
  size is 2050, capacity is 3001
>then call addAll with 1000 element Collection
  size is 3050, capacity is 4502

On Harmony the growth rate is the larger of the size of the Collection or
0.5*size, so:
>create an ArrayList and call addAll with 1000 element Collection
  size is 1000, capacity is 1000
>then call addAll with 1000 element Collection
  size is 2000, capacity is 2000
>then call addAll with 50 element Collection
  size is 2050, capacity is 3000
>then call addAll with 1000 element Collection
  size is 3050, capacity is 3075

Therefore the arrays will grow at a slower rate on Harmony compared to the
RI.

I notice a difference between the Java 5 and Java 6 spec for ArrayList in
that the requirement that an ArrayList created from a Collection should have
capacity 110% of the Collection size has been removed, so maybe we should
also remove this from the Java 6 branch of ArrayList.  This seems to be the
only change between the specs.

I also noticed going through the spec that the doc for the add and addAll
methods that take a location state that all subsequent elements should be
shifted to the right and therefore space should only be added onto the end
of the array, and similarly for remove with a location.  Currently we try
and use both ends of the array, so this is against the spec, but it isn't
something that can be tested for (since the backing array isn't accessible
and even when it's serialized only the length of the array is stored and not
the index of the first element).

Cath

Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Prashanth K S <pr...@gmail.com>.
I felt the same with the growth factor. We have a field called increment in
the grow methods which is set to 12 in most of the cases, attributing to the
excessive growth. Not sure how/why 12 was chosen.

-- 
Regards
Prashanth


On Wed, Aug 4, 2010 at 9:25 PM, Oliver Deakin
<ol...@googlemail.com>wrote:

>  On 04/08/2010 15:56, Mark Hindess wrote:
>
>> I've now refactored the exception code to the start of the methods and
>> shuffled some of the if else cases around a little.
>>
>> Next I plan to look at:
>>
>> 1) remove the following from addAll(...):
>>
>>          if (this == collection) {
>>              collection = (ArrayList)clone();
>>          }
>>
>>    since it is immediately followed by a call to collection.toArray() so
>>    it should be unnecessary.
>>
>> 2) Change the fields to firstIndex and size.
>>
>> In message<AA...@mail.gmail.com>
>> >,
>> Catherine Hope writes:
>>
>>> I did some analysis on what API methods access others between the RI
>>> and Harmony (by subclassing ArrayList and adding some trace) to answer
>>> a couple of the review points.
>>>
>> Thanks Cath.  I wonder if we can use serialization/deserialization
>> to figure out how the RI grows the capacity of the ArrayList.  Our
>> implementation seems to grow it quite quickly.
>>
>>  The differences I found were:
>>>  - RI contains(Object) calls indexOf(Object), so we could also do this
>>> to reduce the code duplication
>>>
>> Sounds good.
>>
>
> +1
>
>
>   - RI add(Object), add(int, Object), addAll(Collection), addAll(int,
>>> Collection) use ensureCapacity(int)
>>>
>> Interesting.  Makes our three grow* methods seems a little excessive.
>>
>
> I was just thinking the same thing :)
>
> It occurs to me that the grow methods for the start and end of the list are
> just special cases of growForInsert(). I think we could probably roll them
> all into one neater method covering all three cases. This should remove the
> necessity to check if we are at the start/end of the array in the add
> methods too - when we return from growForInsert() we will know that there
> will be the required number of empty slots at location and that
> first/lastIndex have been updated, so we can then just write our data
> straight into that position.
>
>
>
>   - RI remove(Object) doesn't reference indexOf(Object) and remove(int) -
>>> though I don't think this would save anything if we changed it
>>>
>> Agreed.
>>
> +1
>
> Regards,
> Oliver
>
>  -Mark.
>>
>>
>>
>>
> --
> Oliver Deakin
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>

Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Oliver Deakin <ol...@googlemail.com>.
  On 04/08/2010 15:56, Mark Hindess wrote:
> I've now refactored the exception code to the start of the methods and
> shuffled some of the if else cases around a little.
>
> Next I plan to look at:
>
> 1) remove the following from addAll(...):
>
>           if (this == collection) {
>               collection = (ArrayList)clone();
>           }
>
>     since it is immediately followed by a call to collection.toArray() so
>     it should be unnecessary.
>
> 2) Change the fields to firstIndex and size.
>
> In message<AA...@mail.gmail.com>,
> Catherine Hope writes:
>> I did some analysis on what API methods access others between the RI
>> and Harmony (by subclassing ArrayList and adding some trace) to answer
>> a couple of the review points.
> Thanks Cath.  I wonder if we can use serialization/deserialization
> to figure out how the RI grows the capacity of the ArrayList.  Our
> implementation seems to grow it quite quickly.
>
>> The differences I found were:
>>   - RI contains(Object) calls indexOf(Object), so we could also do this
>> to reduce the code duplication
> Sounds good.

+1

>>   - RI add(Object), add(int, Object), addAll(Collection), addAll(int,
>> Collection) use ensureCapacity(int)
> Interesting.  Makes our three grow* methods seems a little excessive.

I was just thinking the same thing :)

It occurs to me that the grow methods for the start and end of the list 
are just special cases of growForInsert(). I think we could probably 
roll them all into one neater method covering all three cases. This 
should remove the necessity to check if we are at the start/end of the 
array in the add methods too - when we return from growForInsert() we 
will know that there will be the required number of empty slots at 
location and that first/lastIndex have been updated, so we can then just 
write our data straight into that position.


>>   - RI remove(Object) doesn't reference indexOf(Object) and remove(int) -
>> though I don't think this would save anything if we changed it
> Agreed.
+1

Regards,
Oliver

> -Mark.
>
>
>

-- 
Oliver Deakin
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Mark Hindess <ma...@googlemail.com>.
I was going to add another ArrayList review comment in the code but it
was getting too long and so I'm going to write it here and reference
this message in the code.

I was looking at the growAtEnd method - fixing the things I wrote about
in r982777.  It says:

    private void growAtEnd(int required) {
        int lastIndex = firstIndex + size;
        if (required < array.length - size) {
            if (size > 0) {
[*]             System.arraycopy(array, firstIndex, array, 0, size);
                int start = size < firstIndex ? firstIndex : size;
                Arrays.fill(array, start, array.length, null);
            }
            firstIndex = 0;
        } else {
            // create big array and copy everything to it
        }
    }

At [*], we know that:

1) there is space at the front (firstIndex of space to be exact),

2) From (1) we can infer that there must have been an insertion at the
front at some point since that is the only code path to growAtFront(). [0]

In order to make space at the end (even a single entry) we use *all* the
space at the beginning.  This doesn't seem optimal because:

A) We know 2) so it is not unlikely that there will be another insertion
at the front in future.  Consider the pathological case of alternately
adding at front and end. growAtFront has the same logic so we end up
copying everything back and forth moving the free space from end to end.

B) The more we move the more we need to null out.  Consider the case
where we have 100 of free space at the front and none at the end and
we are appending one element to the end.  Currently we move all the
elements by 100 places to create 100 free space at the end.  We then
null out the old entries that we didn't overwrite during the move.  If
we'd only moved 50 places leaving a more balanced amount of free space
we'd only have to null out at most 50 old entries.

Also, we actually null out too much because:

a) We fill to array.length but we should fill to lastIndex (or
firstIndex-size) [this is already in the review comments]

b) We should assume that we are creating the required space for a
reason - that it is about to be used - so we needn't null the entries
we are about to write too anyway.

I think the excessive resetting to null fixes are a safe call so I am
going to make those soon.

I'm not exactly sure what to do about the other issue.

Comments welcome.

Regards,
 Mark.

[0] Actually just before sending this I realised that there is another
way to get free space at the front... be removing from location 0 -
i.e. treating the ArrayList as a kind of queue.  But I'm not sure this
entirely invalidates the argument it just isn't so clear cut.



Re: ArrayList usage (was: [classlib][luni]Collections classes - code reviews and optimisations)

Posted by Mark Hindess <ma...@googlemail.com>.
There are surprisingly many examples of code which should probably
use better default ArrayList sizes to avoid inevitable copying and
code which could use simpler methods...

In java/io/File.java, there are two examples of default sized arraylists
when we have a suitable upper bound that we could/should use as the initial
size.

java/net/URLConnection.java the code in addRequestProperty:

    valuesList = new ArrayList<String>();
    valuesList.add(0, newValue);

could be:

    valuesList = new ArrayList<String>();
    valuesList.add(newValue);

to avoid the more expensive add at location method when adding to the
just-created empty list.

In java/net/InetAddress.java createHostNameFromIPAddress we know that
in the common cases the

            ArrayList<String> hexStrings = new ArrayList<String>();
            ArrayList<String> decStrings = new ArrayList<String>();

sizes will be 16 and 4 respectively.

Ditto for the (identical?) code in:

  org/apache/harmony/luni/util/Inet6Util.java

In org/apache/harmony/niochar/CharsetProviderImpl.java, ArrayListr
should default to charsets.length() in:

    public Iterator<Charset> charsets() {
        ArrayList<Charset> list = new ArrayList<Charset>();


And that was just a quick scan of luni code.

Regards,
 Mark.

In message <20...@d12av03.megacenter.de.ibm.com>,
Mark Hindess writes:
> 
> The discussion about ArrayList made me wonder about ArrayList usage.
> Perhaps we should review some of our own usage of ArrayList.
> 
> consider drlvm/vm/vmcore/src/kernel_classes/javasrc/java/lang/Class.java:
> 
>     public Class[] getClasses() {
>         checkMemberAccess(Member.PUBLIC);
>         Class<?> clss = this;
> !       ArrayList<Class<?>> classes = null;
>         while (clss != null) {
>             Class<?>[] declared = VMClassRegistry.getDeclaredClasses(clss);
>             if (declared.length != 0) {
> -               if (classes == null) {
> -                   classes = new ArrayList<Class<?>>();
> -               }
>                 for (Class<?> c : declared) {
>                     if (Modifier.isPublic(c.getModifiers())) {
>                         classes.add(c);
>                     }
>                 }
>             }
>             clss = clss.getSuperclass();
>         }
>         if (classes == null) {
>             return new Class[0];
>         } else {
> !           return classes.toArray(new Class[classes.size()]);
>         }
>     }
> 
> I have to wonder whether it might not be better to write:
> 
>     public Class[] getClasses() {
>         checkMemberAccess(Member.PUBLIC);
>         Class<?> clss = this;
> !       ArrayList<Class<?>> classes = new ArrayList<Class<?>>(0);
>         while (clss != null) {
>             Class<?>[] declared = VMClassRegistry.getDeclaredClasses(clss);
>             if (declared.length != 0) {
>                 for (Class<?> c : declared) {
>                     if (Modifier.isPublic(c.getModifiers())) {
>                         classes.add(c);
>                     }
>                 }
>             }
>             clss = clss.getSuperclass();
>         }
>         if (classes == null) {
>             return new Class[0];
>         } else {
> !           return (Class[]) classes.toArray();
>         }
>     }
> 
> That is:
> 
> 1. create an empty 0 size list rather than use null or perhaps create it
>    lazily as it is originally but create it with size declared.length to
>    avoid a possible copy if the number of public classes is > 10.
> 
> 2. avoid the complexity of the toArray with generics method (which might be
>    more beneficial if the contents array was being reused) and just use
>    a cast
> 
> As another more obvious example, consider
> drlvm/vm/vmcore/src/kernel_classes/javasrc/java/lang/ClassLoader.java:
> 
>         private static void initResourceFinder() {
>             synchronized (bootstrapPath) {
>                 if (resourceFinder != null) {
>                     return;
>                 }                
>                 // -Xbootclasspath:"" should be interpreted as nothing define
> d,
>                 // like we do below:
>                 String st[] = fracture(bootstrapPath, File.pathSeparator);
>                 int l = st.length;
> !               ArrayList<URL> urlList = new ArrayList<URL>();
>                 for (int i = 0; i < l; i++) {
>                     try {
>                         urlList.add(new File(st[i]).toURI().toURL());
>                     } catch (MalformedURLException e) {
>                     }
>                 }
>                 ...
>             }
>         }
> 
> We know the maximum length of the bootclasspath is st.length and
> our default bootclasspath is longer than the default ArrayList size
> (~48 compared to 10) so we should create the ArrayList with: new
> ArrayList<URL>(st.length); to avoid the inevitable copy (copies
> actually!) when it gets too big.  Even if a couple of the URLs are
> malformed the cost of a few extra empty ArrayList entries is much
> cheaper than the cost of the grow/copy steps.
> 
> These aren't particularly good examples but are intended to facilitate
> discussion about usage.
> 
> Regards,
>  Mark.
> 



ArrayList usage (was: [classlib][luni]Collections classes - code reviews and optimisations)

Posted by Mark Hindess <ma...@googlemail.com>.
The discussion about ArrayList made me wonder about ArrayList usage.
Perhaps we should review some of our own usage of ArrayList.

consider drlvm/vm/vmcore/src/kernel_classes/javasrc/java/lang/Class.java:

    public Class[] getClasses() {
        checkMemberAccess(Member.PUBLIC);
        Class<?> clss = this;
!       ArrayList<Class<?>> classes = null;
        while (clss != null) {
            Class<?>[] declared = VMClassRegistry.getDeclaredClasses(clss);
            if (declared.length != 0) {
-               if (classes == null) {
-                   classes = new ArrayList<Class<?>>();
-               }
                for (Class<?> c : declared) {
                    if (Modifier.isPublic(c.getModifiers())) {
                        classes.add(c);
                    }
                }
            }
            clss = clss.getSuperclass();
        }
        if (classes == null) {
            return new Class[0];
        } else {
!           return classes.toArray(new Class[classes.size()]);
        }
    }

I have to wonder whether it might not be better to write:

    public Class[] getClasses() {
        checkMemberAccess(Member.PUBLIC);
        Class<?> clss = this;
!       ArrayList<Class<?>> classes = new ArrayList<Class<?>>(0);
        while (clss != null) {
            Class<?>[] declared = VMClassRegistry.getDeclaredClasses(clss);
            if (declared.length != 0) {
                for (Class<?> c : declared) {
                    if (Modifier.isPublic(c.getModifiers())) {
                        classes.add(c);
                    }
                }
            }
            clss = clss.getSuperclass();
        }
        if (classes == null) {
            return new Class[0];
        } else {
!           return (Class[]) classes.toArray();
        }
    }

That is:

1. create an empty 0 size list rather than use null or perhaps create it
   lazily as it is originally but create it with size declared.length to
   avoid a possible copy if the number of public classes is > 10.

2. avoid the complexity of the toArray with generics method (which might be
   more beneficial if the contents array was being reused) and just use
   a cast

As another more obvious example, consider
drlvm/vm/vmcore/src/kernel_classes/javasrc/java/lang/ClassLoader.java:

        private static void initResourceFinder() {
            synchronized (bootstrapPath) {
                if (resourceFinder != null) {
                    return;
                }                
                // -Xbootclasspath:"" should be interpreted as nothing defined,
                // like we do below:
                String st[] = fracture(bootstrapPath, File.pathSeparator);
                int l = st.length;
!               ArrayList<URL> urlList = new ArrayList<URL>();
                for (int i = 0; i < l; i++) {
                    try {
                        urlList.add(new File(st[i]).toURI().toURL());
                    } catch (MalformedURLException e) {
                    }
                }
                ...
            }
        }

We know the maximum length of the bootclasspath is st.length and
our default bootclasspath is longer than the default ArrayList size
(~48 compared to 10) so we should create the ArrayList with: new
ArrayList<URL>(st.length); to avoid the inevitable copy (copies
actually!) when it gets too big.  Even if a couple of the URLs are
malformed the cost of a few extra empty ArrayList entries is much
cheaper than the cost of the grow/copy steps.

These aren't particularly good examples but are intended to facilitate
discussion about usage.

Regards,
 Mark.



Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Mark Hindess <ma...@googlemail.com>.
In message <20...@d12av03.megacenter.de.ibm.com>,
Mark Hindess writes:
> 
> I've now refactored the exception code to the start of the methods and
> shuffled some of the if else cases around a little.
> 
> Next I plan to look at:
> 
> 1) remove the following from addAll(...):
> 
>          if (this == collection) {
>              collection = (ArrayList)clone();
>          }
> 
>    since it is immediately followed by a call to collection.toArray() so
>    it should be unnecessary.

Done in r982377.  I also checked in the review comments we made to make it
easier to track them.

> 2) Change the fields to firstIndex and size.

I've now done this and checked it in at r982498.

I also made a minor change (below) to ensureCapacity that Oliver
observed does (accidentally) fix a bug that caused us to call a grow
method with 0 growSize and do a pointless array copy shuffling the
entire list.

     public void ensureCapacity(int minimumCapacity) {
-        if (array.length < minimumCapacity) {
+        int required = minimumCapacity - array.length;
+        if (required > 0) {
             // REVIEW: Why do we check the firstIndex first? Growing
             //         the end makes more sense
             if (firstIndex > 0) {
-                growAtFront(minimumCapacity - array.length);
+                growAtFront(required);
             } else {
-                growAtEnd(minimumCapacity - array.length);
+                growAtEnd(required);
             }
         }
     }

My initial (dacapo) benchmark runs on the original ArrayList, the
version with exception refactoring, the version without the spurious
clone and the version with the size changes seem to show that the
exception handling change was a bad idea but that we get back the lost
performance with the size change.  I'm just re-running the benchmarks
on a quiet(er) machine to confirm.  Will post details when they've
finished.

Regards,
 Mark.



Re: [classlib][luni] Collections classes - code reviews and optimisations

Posted by Mark Hindess <ma...@googlemail.com>.
I've now refactored the exception code to the start of the methods and
shuffled some of the if else cases around a little.

Next I plan to look at:

1) remove the following from addAll(...):

         if (this == collection) {
             collection = (ArrayList)clone();
         }

   since it is immediately followed by a call to collection.toArray() so
   it should be unnecessary.

2) Change the fields to firstIndex and size.

In message <AA...@mail.gmail.com>,
Catherine Hope writes:
> 
> I did some analysis on what API methods access others between the RI
> and Harmony (by subclassing ArrayList and adding some trace) to answer
> a couple of the review points.

Thanks Cath.  I wonder if we can use serialization/deserialization
to figure out how the RI grows the capacity of the ArrayList.  Our
implementation seems to grow it quite quickly.

> The differences I found were:
>  - RI contains(Object) calls indexOf(Object), so we could also do this
> to reduce the code duplication

Sounds good.

>  - RI add(Object), add(int, Object), addAll(Collection), addAll(int,
> Collection) use ensureCapacity(int)

Interesting.  Makes our three grow* methods seems a little excessive.

>  - RI remove(Object) doesn't reference indexOf(Object) and remove(int) -
> though I don't think this would save anything if we changed it

Agreed.

-Mark.