You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dawid Weiss <da...@gmail.com> on 2010/01/12 13:19:55 UTC

Collections of primitives.

Hi guys,

I see Benson working really hard on converting Colt primitive
collections to Mahout -- this is great effort, really, since no such
library currently exists with an Apache or BSD license.

I wanted to ask you if compatibility with Java Collections is
something you consider crucial for a set of collection classes. There
are pros and cons of this compatibility.

1) compatibility with the Java Collections API gives you tons of other
libraries (Google Collections, for example) which you can use out of
the box with primitives,

2) compatibility with the Java Collections API means boxing/unboxing
conversions on standard API calls and awkward other methods, should
you wish to avoid such conversions,

3) collection classes have many strong contracts, including
fast-failing iterators, etc. These are largerly unnecessary for
computational code.

4) you may be fond of certain idioms you might have grown accustomed
to (subList().clear()).

5) resigning from certain API elements can yield much faster code
(resigning from bounds check, exposing the internal implementation of
a given type for custom processing).

I'm asking because we at Carrot Search have been working on a similar
library for managing basic collection types for primitives (and
generics), namely:

- hash maps (open addressing),
- sets (open addressing),
- efficient bit set and bit operations (we imported stuff from Lucene),
- stacks, dequeues, arrays.

Our line of thinking eventually led us to create a library that is
MOSTLY API-compatible, but NOT interface compatible. That is, for
example, you get put(byte, int) methods on a hash map specialized for
byte keys and int values, but this hash map does not implement
Map<Byte, Integer>. It is therefore not a general purpose Java
Collections replacement, but for computational code we found our
implementation very efficient and straightforward.

I have the code ready, tested and we'd be willing to contribute this
entirely to the Apache foundation. The upside is that it's
royalty-free (white box reimplementation of basic data structures).
Some of the code was borrowed from Lucene (BitSet) and the method of
open addressing is quadratic rather than double-hashing, which was
inspired by the work done on Google sparse hashes.

 I hope Benson won't feel offended -- he's done a great job working on
Colt's code, but if you think the above assumptions are fine (the
primary one being breaking the compatibility with Java Collections),
then perhaps we should apply for a commons sub-project (we currently
call this library "high performance primitive collections") and join
the efforts under a single umbrella for everyone's benefit?

Dawid

Re: Collections of primitives.

Posted by Jake Mannix <ja...@gmail.com>.
On Jan 13, 2010 7:42 AM, "Grant Ingersoll" <gs...@apache.org> wrote:

For better or worse, and as has been evidenced by the state of our use (or
trial) of several different matrix/collections libs at this point already,
I'm kind of liking controlling or own destiny when it comes to something so
core to our performance (not that I couldn't be talked out of it, I don't
believe in "Not Invented Here" for the most part).

That being said, Dawid, maybe you'd consider reverting your Emeritus status
and working here to build out the library that serves both of our needs?
 Then there is no need for the extra layer of going somewhere else and being
dependent on yet another third party w/ diff. release cycles etc.  The only
change for C2 would then be a dependency on Mahout Math (why does the
thought of Elephants counting always come to mind when I say that?)  Of
course, maybe that isn't ideal for C2.

I don't feel strongly about this, though.  We have plenty of other deps. as
it is and it seems there is a fair need for this between both projects.

Food for thought,
Grant

On Jan 12, 2010, at 9:56 AM, Benson Margulies wrote: > Dawid, > > Like I
said, I'm not sure we're...
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Collections of primitives.

Posted by Dawid Weiss <da...@gmail.com>.
Ha! In that case fanfares to all of you.

D.

On Thu, Jan 14, 2010 at 1:09 PM, Benson Margulies <bi...@gmail.com> wrote:
> Hey, credit where credit is due. I did \not/ do the initial
> license-sorting port. I've been cleaning up and filling in after that
> (aside from one little licensing problem that was not related to
> LGPL). I think that Jake and Sean get the credit for the heavy
> lifting.
>
> On Thu, Jan 14, 2010 at 6:52 AM, Dawid Weiss <da...@gmail.com> wrote:
>> Oh, as a side note to Benson -- your effort on porting these COLT
>> collections is appreciated from more than one angle, regardless of the
>> HPPC discussion. We have been using COLT's math/ matrix packages in C2
>> and had a long-open  issue of getting rid of the LGPL parts. Solved
>> now, thanks!
>>
>> http://issues.carrot2.org/browse/CARROT-614
>>
>> Dawid
>>
>> On Thu, Jan 14, 2010 at 9:07 AM, Dawid Weiss <da...@gmail.com> wrote:
>>> Let's do this, guys: I have finished the implementation of basic data
>>> structures. I will try to merge this code with Carrot2, replacing PCJ;
>>> this should give me an additional level of confidency that everything
>>> is working fine. I plan to have this step done by Friday.
>>>
>>> Then, I will make this code available as a patch (ZIP) file on Mahout
>>> JIRA for you to have a look at. I am more than fine in maintaining it
>>> as part of Mahout (and reverting my active committer status), but we
>>> would need to make sure it's a separate deliverable (separate JAR;
>>> mahout-hppc; high performance primitive collections... or something).
>>>
>>> Hosting this project in Mahout will have the additional bonus of
>>> polishing the API for computational code. If folks decide it can have
>>> the life of its own, we can ask for a separate sub-space in the
>>> commons project.
>>>
>>> Sounds good? I'll get back to you on Friday/Saturday.
>>> Dawid
>>>
>>> On Wed, Jan 13, 2010 at 8:23 PM, Jake Mannix <ja...@gmail.com> wrote:
>>>> On Wed, Jan 13, 2010 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> It sure feels like it can and should be a module, which depends on no
>>>>> other Mahout modules. The only barrier is someone willing to do it --
>>>>> and looks like we have that.
>>>>>
>>>>
>>>> Right now, mahout-math depends on no other modules, and as of last night,
>>>> also does not depend on hadoop (yay!), so this should be getting easier to
>>>> do.
>>>>
>>>> The barrier to entry as an eventual commons project is that it still has way
>>>> more dependencies than typical commons projects, it looks like at first
>>>> glance.
>>>>
>>>>  -jake
>>>>
>>>
>>
>

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
Hey, credit where credit is due. I did \not/ do the initial
license-sorting port. I've been cleaning up and filling in after that
(aside from one little licensing problem that was not related to
LGPL). I think that Jake and Sean get the credit for the heavy
lifting.

On Thu, Jan 14, 2010 at 6:52 AM, Dawid Weiss <da...@gmail.com> wrote:
> Oh, as a side note to Benson -- your effort on porting these COLT
> collections is appreciated from more than one angle, regardless of the
> HPPC discussion. We have been using COLT's math/ matrix packages in C2
> and had a long-open  issue of getting rid of the LGPL parts. Solved
> now, thanks!
>
> http://issues.carrot2.org/browse/CARROT-614
>
> Dawid
>
> On Thu, Jan 14, 2010 at 9:07 AM, Dawid Weiss <da...@gmail.com> wrote:
>> Let's do this, guys: I have finished the implementation of basic data
>> structures. I will try to merge this code with Carrot2, replacing PCJ;
>> this should give me an additional level of confidency that everything
>> is working fine. I plan to have this step done by Friday.
>>
>> Then, I will make this code available as a patch (ZIP) file on Mahout
>> JIRA for you to have a look at. I am more than fine in maintaining it
>> as part of Mahout (and reverting my active committer status), but we
>> would need to make sure it's a separate deliverable (separate JAR;
>> mahout-hppc; high performance primitive collections... or something).
>>
>> Hosting this project in Mahout will have the additional bonus of
>> polishing the API for computational code. If folks decide it can have
>> the life of its own, we can ask for a separate sub-space in the
>> commons project.
>>
>> Sounds good? I'll get back to you on Friday/Saturday.
>> Dawid
>>
>> On Wed, Jan 13, 2010 at 8:23 PM, Jake Mannix <ja...@gmail.com> wrote:
>>> On Wed, Jan 13, 2010 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> It sure feels like it can and should be a module, which depends on no
>>>> other Mahout modules. The only barrier is someone willing to do it --
>>>> and looks like we have that.
>>>>
>>>
>>> Right now, mahout-math depends on no other modules, and as of last night,
>>> also does not depend on hadoop (yay!), so this should be getting easier to
>>> do.
>>>
>>> The barrier to entry as an eventual commons project is that it still has way
>>> more dependencies than typical commons projects, it looks like at first
>>> glance.
>>>
>>>  -jake
>>>
>>
>

Re: Collections of primitives.

Posted by Dawid Weiss <da...@gmail.com>.
Oh, as a side note to Benson -- your effort on porting these COLT
collections is appreciated from more than one angle, regardless of the
HPPC discussion. We have been using COLT's math/ matrix packages in C2
and had a long-open  issue of getting rid of the LGPL parts. Solved
now, thanks!

http://issues.carrot2.org/browse/CARROT-614

Dawid

On Thu, Jan 14, 2010 at 9:07 AM, Dawid Weiss <da...@gmail.com> wrote:
> Let's do this, guys: I have finished the implementation of basic data
> structures. I will try to merge this code with Carrot2, replacing PCJ;
> this should give me an additional level of confidency that everything
> is working fine. I plan to have this step done by Friday.
>
> Then, I will make this code available as a patch (ZIP) file on Mahout
> JIRA for you to have a look at. I am more than fine in maintaining it
> as part of Mahout (and reverting my active committer status), but we
> would need to make sure it's a separate deliverable (separate JAR;
> mahout-hppc; high performance primitive collections... or something).
>
> Hosting this project in Mahout will have the additional bonus of
> polishing the API for computational code. If folks decide it can have
> the life of its own, we can ask for a separate sub-space in the
> commons project.
>
> Sounds good? I'll get back to you on Friday/Saturday.
> Dawid
>
> On Wed, Jan 13, 2010 at 8:23 PM, Jake Mannix <ja...@gmail.com> wrote:
>> On Wed, Jan 13, 2010 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> It sure feels like it can and should be a module, which depends on no
>>> other Mahout modules. The only barrier is someone willing to do it --
>>> and looks like we have that.
>>>
>>
>> Right now, mahout-math depends on no other modules, and as of last night,
>> also does not depend on hadoop (yay!), so this should be getting easier to
>> do.
>>
>> The barrier to entry as an eventual commons project is that it still has way
>> more dependencies than typical commons projects, it looks like at first
>> glance.
>>
>>  -jake
>>
>

Re: Collections of primitives.

Posted by Dawid Weiss <da...@gmail.com>.
Let's do this, guys: I have finished the implementation of basic data
structures. I will try to merge this code with Carrot2, replacing PCJ;
this should give me an additional level of confidency that everything
is working fine. I plan to have this step done by Friday.

Then, I will make this code available as a patch (ZIP) file on Mahout
JIRA for you to have a look at. I am more than fine in maintaining it
as part of Mahout (and reverting my active committer status), but we
would need to make sure it's a separate deliverable (separate JAR;
mahout-hppc; high performance primitive collections... or something).

Hosting this project in Mahout will have the additional bonus of
polishing the API for computational code. If folks decide it can have
the life of its own, we can ask for a separate sub-space in the
commons project.

Sounds good? I'll get back to you on Friday/Saturday.
Dawid

On Wed, Jan 13, 2010 at 8:23 PM, Jake Mannix <ja...@gmail.com> wrote:
> On Wed, Jan 13, 2010 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> It sure feels like it can and should be a module, which depends on no
>> other Mahout modules. The only barrier is someone willing to do it --
>> and looks like we have that.
>>
>
> Right now, mahout-math depends on no other modules, and as of last night,
> also does not depend on hadoop (yay!), so this should be getting easier to
> do.
>
> The barrier to entry as an eventual commons project is that it still has way
> more dependencies than typical commons projects, it looks like at first
> glance.
>
>  -jake
>

Re: Collections of primitives.

Posted by Jake Mannix <ja...@gmail.com>.
On Wed, Jan 13, 2010 at 10:43 AM, Sean Owen <sr...@gmail.com> wrote:

> It sure feels like it can and should be a module, which depends on no
> other Mahout modules. The only barrier is someone willing to do it --
> and looks like we have that.
>

Right now, mahout-math depends on no other modules, and as of last night,
also does not depend on hadoop (yay!), so this should be getting easier to
do.

The barrier to entry as an eventual commons project is that it still has way
more dependencies than typical commons projects, it looks like at first
glance.

  -jake

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
I was going to propose the necessary module multiplication and surgery
to do this to the colt-derived collections eventually, but I won't
bother until I see how this all comes out.



On Wed, Jan 13, 2010 at 1:43 PM, Sean Owen <sr...@gmail.com> wrote:
> It sure feels like it can and should be a module, which depends on no
> other Mahout modules. The only barrier is someone willing to do it --
> and looks like we have that.
>
> It's just good design, and indeed facilitates reuse. +1
>
> On Wed, Jan 13, 2010 at 6:41 PM, Benson Margulies <bi...@gmail.com> wrote:
>> How about this: How about Dawid & I set out to make this collections
>> its own module. We can then try to sell it to commons if and when we
>> feel like it?
>

Re: Collections of primitives.

Posted by Sean Owen <sr...@gmail.com>.
It sure feels like it can and should be a module, which depends on no
other Mahout modules. The only barrier is someone willing to do it --
and looks like we have that.

It's just good design, and indeed facilitates reuse. +1

On Wed, Jan 13, 2010 at 6:41 PM, Benson Margulies <bi...@gmail.com> wrote:
> How about this: How about Dawid & I set out to make this collections
> its own module. We can then try to sell it to commons if and when we
> feel like it?

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
How about this: How about Dawid & I set out to make this collections
its own module. We can then try to sell it to commons if and when we
feel like it?

On Wed, Jan 13, 2010 at 12:57 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Jan 13, 2010, at 11:23 AM, Benson Margulies wrote:
>
>> Here's my view.
>>
>> If the Carrot-derived collections can be negotiated into commons in a
>> nice state, Mahout can consider pitching the Colt collection in their
>> favor. In the mean time, they could hide inside the rest of carrot of
>> carrot is coming into Mahout.
>
> Understood.  It's just everyone involved is already a committer here (w/ the exception of Staszek, Dawid's partner at C2) and so bringing in that code to Mahout is next to trivial.  Starting up a new commons project, dealing w/ commit rights, etc. is a lot more involved.  And, since it is so core to what we do, I'd want a good flow between the two.  Like I said, though, I don't feel that strongly about it.  I just know I don't have the bandwidth right now to go start another project.  However, if you guys do, let's do it.

Re: Collections of primitives.

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 13, 2010, at 11:23 AM, Benson Margulies wrote:

> Here's my view.
> 
> If the Carrot-derived collections can be negotiated into commons in a
> nice state, Mahout can consider pitching the Colt collection in their
> favor. In the mean time, they could hide inside the rest of carrot of
> carrot is coming into Mahout.

Understood.  It's just everyone involved is already a committer here (w/ the exception of Staszek, Dawid's partner at C2) and so bringing in that code to Mahout is next to trivial.  Starting up a new commons project, dealing w/ commit rights, etc. is a lot more involved.  And, since it is so core to what we do, I'd want a good flow between the two.  Like I said, though, I don't feel that strongly about it.  I just know I don't have the bandwidth right now to go start another project.  However, if you guys do, let's do it.

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
Here's my view.

If the Carrot-derived collections can be negotiated into commons in a
nice state, Mahout can consider pitching the Colt collection in their
favor. In the mean time, they could hide inside the rest of carrot of
carrot is coming into Mahout.

--benson


On Wed, Jan 13, 2010 at 10:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> For better or worse, and as has been evidenced by the state of our use (or trial) of several different matrix/collections libs at this point already, I'm kind of liking controlling or own destiny when it comes to something so core to our performance (not that I couldn't be talked out of it, I don't believe in "Not Invented Here" for the most part).
>
> That being said, Dawid, maybe you'd consider reverting your Emeritus status and working here to build out the library that serves both of our needs?  Then there is no need for the extra layer of going somewhere else and being dependent on yet another third party w/ diff. release cycles etc.  The only change for C2 would then be a dependency on Mahout Math (why does the thought of Elephants counting always come to mind when I say that?)  Of course, maybe that isn't ideal for C2.
>
> I don't feel strongly about this, though.  We have plenty of other deps. as it is and it seems there is a fair need for this between both projects.
>
> Food for thought,
> Grant
>
> On Jan 12, 2010, at 9:56 AM, Benson Margulies wrote:
>
>> Dawid,
>>
>> Like I said, I'm not sure we're disagreeing. My focal goal is
>> primitive collections, and I'm prepared to take my lumps with
>> compatibility. Sun has made such a mess of the collections API that we
>> seem forced to choose.
>>
>> --benson
>>
>>
>> On Tue, Jan 12, 2010 at 9:28 AM, Dawid Weiss <da...@gmail.com> wrote:
>>> Thanks for the clarification and understanding of my motives, Benson.
>>>
>>> I know Trove and I know other libraries of this type -- PCJ has been
>>> our favorite so far, but it's LGPL and our persistent attempts to ask
>>> Soren Bak to distribute that code under a different license have
>>> failed.
>>>
>>> Adapters are a way to work around the compatibility problems (they are
>>> used in PCJ, for example), but they're in many ways defeat the purpose
>>> of having a collections library based on primitives. Other than lower
>>> memory consumption in the static case, you gain very little
>>> (autoboxing will kill performance).
>>>
>>> Let me do this. I'll send you a ZIP file with the code as it is right
>>> now in our repository. Take a look at what we've done and compare it
>>> to your efforts (especially in terms of templates-code generation I
>>> think you'll find it interesting). I should be done with the
>>> implementation of Deque today or tomorrow and then, knowing what other
>>> people think, we can decide how to move forward.
>>>
>>> Dawid
>>>
>>> On Tue, Jan 12, 2010 at 1:34 PM, Benson Margulies <bi...@gmail.com> wrote:
>>>> Dawid,
>>>>
>>>> I find that I didn't quite answer all of your questions, and then
>>>> again maybe I'm not in a position to.
>>>>
>>>> I started this by looking for some way to get the functionality of
>>>> Trove without the GPL. When I discovered that Mahout had already
>>>> absorbed Colt, I decided that the shortest path was to start from
>>>> there.
>>>>
>>>> I looked into planting this in commons-collections, but it was not so
>>>> easy. Colt collections depend on some scalar math code (random
>>>> numbers, etc). The commons-math people have a very high bar for
>>>> contributions, and the process of factoring out the math substrate,
>>>> comparing it to commons math, identifying the delta, contributing it
>>>> to math, etc, etc, was beyond my personal resources.
>>>>
>>>> If you've got a foundation member at hand, collections will make you a
>>>> sandbox to facilitate the contribution. Much as I'm enjoying myself at
>>>> this, I certainly won't be offended if Mahout throws it all out in
>>>> favor of what you've got there.
>>>>
>>>> One issue for Mahout is that it is using some above-collections
>>>> functionality from Colt that will take very careful modification to
>>>> use your (or any other) alternative.
>>>>
>>>> --benson
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
>
>

Re: Collections of primitives.

Posted by Grant Ingersoll <gs...@apache.org>.
For better or worse, and as has been evidenced by the state of our use (or trial) of several different matrix/collections libs at this point already, I'm kind of liking controlling or own destiny when it comes to something so core to our performance (not that I couldn't be talked out of it, I don't believe in "Not Invented Here" for the most part).   

That being said, Dawid, maybe you'd consider reverting your Emeritus status and working here to build out the library that serves both of our needs?  Then there is no need for the extra layer of going somewhere else and being dependent on yet another third party w/ diff. release cycles etc.  The only change for C2 would then be a dependency on Mahout Math (why does the thought of Elephants counting always come to mind when I say that?)  Of course, maybe that isn't ideal for C2.

I don't feel strongly about this, though.  We have plenty of other deps. as it is and it seems there is a fair need for this between both projects.

Food for thought,
Grant

On Jan 12, 2010, at 9:56 AM, Benson Margulies wrote:

> Dawid,
> 
> Like I said, I'm not sure we're disagreeing. My focal goal is
> primitive collections, and I'm prepared to take my lumps with
> compatibility. Sun has made such a mess of the collections API that we
> seem forced to choose.
> 
> --benson
> 
> 
> On Tue, Jan 12, 2010 at 9:28 AM, Dawid Weiss <da...@gmail.com> wrote:
>> Thanks for the clarification and understanding of my motives, Benson.
>> 
>> I know Trove and I know other libraries of this type -- PCJ has been
>> our favorite so far, but it's LGPL and our persistent attempts to ask
>> Soren Bak to distribute that code under a different license have
>> failed.
>> 
>> Adapters are a way to work around the compatibility problems (they are
>> used in PCJ, for example), but they're in many ways defeat the purpose
>> of having a collections library based on primitives. Other than lower
>> memory consumption in the static case, you gain very little
>> (autoboxing will kill performance).
>> 
>> Let me do this. I'll send you a ZIP file with the code as it is right
>> now in our repository. Take a look at what we've done and compare it
>> to your efforts (especially in terms of templates-code generation I
>> think you'll find it interesting). I should be done with the
>> implementation of Deque today or tomorrow and then, knowing what other
>> people think, we can decide how to move forward.
>> 
>> Dawid
>> 
>> On Tue, Jan 12, 2010 at 1:34 PM, Benson Margulies <bi...@gmail.com> wrote:
>>> Dawid,
>>> 
>>> I find that I didn't quite answer all of your questions, and then
>>> again maybe I'm not in a position to.
>>> 
>>> I started this by looking for some way to get the functionality of
>>> Trove without the GPL. When I discovered that Mahout had already
>>> absorbed Colt, I decided that the shortest path was to start from
>>> there.
>>> 
>>> I looked into planting this in commons-collections, but it was not so
>>> easy. Colt collections depend on some scalar math code (random
>>> numbers, etc). The commons-math people have a very high bar for
>>> contributions, and the process of factoring out the math substrate,
>>> comparing it to commons math, identifying the delta, contributing it
>>> to math, etc, etc, was beyond my personal resources.
>>> 
>>> If you've got a foundation member at hand, collections will make you a
>>> sandbox to facilitate the contribution. Much as I'm enjoying myself at
>>> this, I certainly won't be offended if Mahout throws it all out in
>>> favor of what you've got there.
>>> 
>>> One issue for Mahout is that it is using some above-collections
>>> functionality from Colt that will take very careful modification to
>>> use your (or any other) alternative.
>>> 
>>> --benson
>>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
Dawid,

Like I said, I'm not sure we're disagreeing. My focal goal is
primitive collections, and I'm prepared to take my lumps with
compatibility. Sun has made such a mess of the collections API that we
seem forced to choose.

--benson


On Tue, Jan 12, 2010 at 9:28 AM, Dawid Weiss <da...@gmail.com> wrote:
> Thanks for the clarification and understanding of my motives, Benson.
>
> I know Trove and I know other libraries of this type -- PCJ has been
> our favorite so far, but it's LGPL and our persistent attempts to ask
> Soren Bak to distribute that code under a different license have
> failed.
>
> Adapters are a way to work around the compatibility problems (they are
> used in PCJ, for example), but they're in many ways defeat the purpose
> of having a collections library based on primitives. Other than lower
> memory consumption in the static case, you gain very little
> (autoboxing will kill performance).
>
> Let me do this. I'll send you a ZIP file with the code as it is right
> now in our repository. Take a look at what we've done and compare it
> to your efforts (especially in terms of templates-code generation I
> think you'll find it interesting). I should be done with the
> implementation of Deque today or tomorrow and then, knowing what other
> people think, we can decide how to move forward.
>
> Dawid
>
> On Tue, Jan 12, 2010 at 1:34 PM, Benson Margulies <bi...@gmail.com> wrote:
>> Dawid,
>>
>> I find that I didn't quite answer all of your questions, and then
>> again maybe I'm not in a position to.
>>
>> I started this by looking for some way to get the functionality of
>> Trove without the GPL. When I discovered that Mahout had already
>> absorbed Colt, I decided that the shortest path was to start from
>> there.
>>
>> I looked into planting this in commons-collections, but it was not so
>> easy. Colt collections depend on some scalar math code (random
>> numbers, etc). The commons-math people have a very high bar for
>> contributions, and the process of factoring out the math substrate,
>> comparing it to commons math, identifying the delta, contributing it
>> to math, etc, etc, was beyond my personal resources.
>>
>> If you've got a foundation member at hand, collections will make you a
>> sandbox to facilitate the contribution. Much as I'm enjoying myself at
>> this, I certainly won't be offended if Mahout throws it all out in
>> favor of what you've got there.
>>
>> One issue for Mahout is that it is using some above-collections
>> functionality from Colt that will take very careful modification to
>> use your (or any other) alternative.
>>
>> --benson
>>
>

Re: Collections of primitives.

Posted by Dawid Weiss <da...@gmail.com>.
Thanks for the clarification and understanding of my motives, Benson.

I know Trove and I know other libraries of this type -- PCJ has been
our favorite so far, but it's LGPL and our persistent attempts to ask
Soren Bak to distribute that code under a different license have
failed.

Adapters are a way to work around the compatibility problems (they are
used in PCJ, for example), but they're in many ways defeat the purpose
of having a collections library based on primitives. Other than lower
memory consumption in the static case, you gain very little
(autoboxing will kill performance).

Let me do this. I'll send you a ZIP file with the code as it is right
now in our repository. Take a look at what we've done and compare it
to your efforts (especially in terms of templates-code generation I
think you'll find it interesting). I should be done with the
implementation of Deque today or tomorrow and then, knowing what other
people think, we can decide how to move forward.

Dawid

On Tue, Jan 12, 2010 at 1:34 PM, Benson Margulies <bi...@gmail.com> wrote:
> Dawid,
>
> I find that I didn't quite answer all of your questions, and then
> again maybe I'm not in a position to.
>
> I started this by looking for some way to get the functionality of
> Trove without the GPL. When I discovered that Mahout had already
> absorbed Colt, I decided that the shortest path was to start from
> there.
>
> I looked into planting this in commons-collections, but it was not so
> easy. Colt collections depend on some scalar math code (random
> numbers, etc). The commons-math people have a very high bar for
> contributions, and the process of factoring out the math substrate,
> comparing it to commons math, identifying the delta, contributing it
> to math, etc, etc, was beyond my personal resources.
>
> If you've got a foundation member at hand, collections will make you a
> sandbox to facilitate the contribution. Much as I'm enjoying myself at
> this, I certainly won't be offended if Mahout throws it all out in
> favor of what you've got there.
>
> One issue for Mahout is that it is using some above-collections
> functionality from Colt that will take very careful modification to
> use your (or any other) alternative.
>
> --benson
>

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
Dawid,

I find that I didn't quite answer all of your questions, and then
again maybe I'm not in a position to.

I started this by looking for some way to get the functionality of
Trove without the GPL. When I discovered that Mahout had already
absorbed Colt, I decided that the shortest path was to start from
there.

I looked into planting this in commons-collections, but it was not so
easy. Colt collections depend on some scalar math code (random
numbers, etc). The commons-math people have a very high bar for
contributions, and the process of factoring out the math substrate,
comparing it to commons math, identifying the delta, contributing it
to math, etc, etc, was beyond my personal resources.

If you've got a foundation member at hand, collections will make you a
sandbox to facilitate the contribution. Much as I'm enjoying myself at
this, I certainly won't be offended if Mahout throws it all out in
favor of what you've got there.

One issue for Mahout is that it is using some above-collections
functionality from Colt that will take very careful modification to
use your (or any other) alternative.

--benson

Re: Collections of primitives.

Posted by Benson Margulies <bi...@gmail.com>.
Dawid,

There is a model compromise out there: the Trove 'decorator' approach.
I'm perfectly happy to follow that model to give people whatever value
you can get from Java collection compatibility. I confess that I've
been considering using it as an excuse to learn the CGM library and
generate the code by magic, but I suspect that there will be a giant
retching noise at that suggestion.

Anyway, my initial goal is just to clean up and fill out the tight
APIs, and that's why you see such a paucity of compatibility up to
this point.

--benson



On Tue, Jan 12, 2010 at 7:19 AM, Dawid Weiss <da...@gmail.com> wrote:
> Hi guys,
>
> I see Benson working really hard on converting Colt primitive
> collections to Mahout -- this is great effort, really, since no such
> library currently exists with an Apache or BSD license.
>
> I wanted to ask you if compatibility with Java Collections is
> something you consider crucial for a set of collection classes. There
> are pros and cons of this compatibility.
>
> 1) compatibility with the Java Collections API gives you tons of other
> libraries (Google Collections, for example) which you can use out of
> the box with primitives,
>
> 2) compatibility with the Java Collections API means boxing/unboxing
> conversions on standard API calls and awkward other methods, should
> you wish to avoid such conversions,
>
> 3) collection classes have many strong contracts, including
> fast-failing iterators, etc. These are largerly unnecessary for
> computational code.
>
> 4) you may be fond of certain idioms you might have grown accustomed
> to (subList().clear()).
>
> 5) resigning from certain API elements can yield much faster code
> (resigning from bounds check, exposing the internal implementation of
> a given type for custom processing).
>
> I'm asking because we at Carrot Search have been working on a similar
> library for managing basic collection types for primitives (and
> generics), namely:
>
> - hash maps (open addressing),
> - sets (open addressing),
> - efficient bit set and bit operations (we imported stuff from Lucene),
> - stacks, dequeues, arrays.
>
> Our line of thinking eventually led us to create a library that is
> MOSTLY API-compatible, but NOT interface compatible. That is, for
> example, you get put(byte, int) methods on a hash map specialized for
> byte keys and int values, but this hash map does not implement
> Map<Byte, Integer>. It is therefore not a general purpose Java
> Collections replacement, but for computational code we found our
> implementation very efficient and straightforward.
>
> I have the code ready, tested and we'd be willing to contribute this
> entirely to the Apache foundation. The upside is that it's
> royalty-free (white box reimplementation of basic data structures).
> Some of the code was borrowed from Lucene (BitSet) and the method of
> open addressing is quadratic rather than double-hashing, which was
> inspired by the work done on Google sparse hashes.
>
>  I hope Benson won't feel offended -- he's done a great job working on
> Colt's code, but if you think the above assumptions are fine (the
> primary one being breaking the compatibility with Java Collections),
> then perhaps we should apply for a commons sub-project (we currently
> call this library "high performance primitive collections") and join
> the efforts under a single umbrella for everyone's benefit?
>
> Dawid
>