You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2011/03/25 15:36:27 UTC

Flattened arrays of simple structures (valuetype-like classes).

Hi guys,

This is not directly related to Mahout, but since most of you deal with
computations, I think it is relevant and I seek feedback/ improvement ideas.
If you've ever had to create a large array (or worse: multidimensional
array) of a relatively simple structure-like data holder class then you
probably know the pain of initializing sub-arrays and the memory overhead
that jagged arrays incur. The idea to generate stub code to handle such
cases has been around my head for a long time, but I finally managed to find
some time and implement it. I really like the results so far, especially in
multidimensional case the code is so much nicer. Even if you have a
relatively simple array of byte[][] you can do this:

@Struct(dimensions = 2)
public final class Byte {
  public byte value;
}

this will generate stub class ByteArray2D (if javac has access to
apt-processor in hppc-struct, that is; or if your maven project is
configured properly, see hppc-examples for how to do this) with a single
byte[] field and flattened Byte objects, including accessors to individual
fields or valuetype-copying methods for handling entire structures. More is
here:

http://issues.carrot2.org/browse/HPPC-54

And a trivial sample here:

https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java

Again, if you have any ideas/ improvements, they are most welcome (use JIRA
above for comments or fork the code on github).

Dawid

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> I like the technique.  Compilers are now fast enough that this provides a
> very nice way to do all kinds of code generation tricks.

The speed is pretty much as if you had one more source code file to
compile -- it is nearly instant.

> If I am not mistaken, it even leaves a moderately debuggable program behind
> as well.  I don't suppose that the source is available at debug time, but
> most other things would make sense.

You can write an annotation processor that generates .class files
directly, but I went with the simpler (and more intuitive) way of
generating Java source that the compiler picks and compiles in the
next round. The source code can be left on disk, which means it is
available for debugging, inspection, whatever.

I think making it a separate project from HPPC is a good idea, it is
fairly self-contained.

Dawid

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

Very much.  Thank you.

I like the technique.  Compilers are now fast enough that this provides a
very nice way to do all kinds of code generation tricks.

If I am not mistaken, it even leaves a moderately debuggable program behind
as well.  I don't suppose that the source is available at debug time, but
most other things would make sense.

On Sat, Mar 26, 2011 at 1:49 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> If you use Maven, hppc-examples project on HPPC's github has
> a working configuration at the moment. Does this help, at least a
> little bit, Ted?
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

You weren't rude.  I wasn't being sarcastic but was honestly complimenting
you on your generosity in what often devolves to a religious discussion.

On Sat, Mar 26, 2011 at 1:49 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> >> I guess both viewpoints have pros and cons, so convincing anybody does
> >> not make much sense, but it's an interesting discussion nonetheless.
> >
> > Thank you for putting this so generously.
>
> I didn't mean to be rude and I hope this wasn't understood this way. I
> just think this is a matter where different opinions can be equally
> well justified and since so it will be hard to find a clear "winning"
> side.

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> I must be a total pessimist.  I don't believe in the existence of "correctly written programs" any more
> than I believe in Santa Claus.

I don't believe in correct programs either, but I believe that you can
be sure enough nothing will break to run without assertions.
Inevitably something WILL break sooner or later (due to software, data
or hardware issue), but enabling assertions or condition checking
won't help you much anyway. But then -- I really don't have a strong
opinion on this and you will be able to easily convert me to your
side, especially if you buy my a german beer in Berlin ;)

>> I guess both viewpoints have pros and cons, so convincing anybody does
>> not make much sense, but it's an interesting discussion nonetheless.
>
> Thank you for putting this so generously.

I didn't mean to be rude and I hope this wasn't understood this way. I
just think this is a matter where different opinions can be equally
well justified and since so it will be hard to find a clear "winning"
side.

> Commonly, I find the risky calls save nothing, especially with modern JVM's.
>  So I put back in the safe calls.

It is so hard to tell, really. Modern JVMs (and modern CPUs) are so
unpredictable in terms of performance... I agree in most cases tiny
optimizations don't make much sense in the global scenario (especially
if micro-benchmarked in isolation). I am just used to thinking in
terms of possible performance losses and this habit (or addiction) is
hard to get rid of.

> setRiskyDontUseThisWithoutAdultSupervision.

This is a good one. I would use it immediately :)

> Did you generate this code using a separate code generator step?  Or using a
> class loader magic thing?
> Can you point us at the heart of the code generator in either case?

I'm sorry I should have made it clear -- this "generator" is actually
an annotation processor that is automatically discovered by javac (or
eclipse's compiler) if found on classpath. In other words it generates
those array classes and immediately compiles them at the same time you
are compiling your sources (and only those classes that are annotated
with @Struct of course). So there is nothing special you need to do
other than make hppc-struct (and its dependencies) available in
compilation classpath or otherwise to javac. This is covered by
JSR175, here:

http://www.jcp.org/en/jsr/detail?id=175

and has been integrated in Java 1.6, so every 1.6 compatible compiler
should be able to make use of it. For javac, the documentation is
here:

http://download.oracle.com/javase/6/docs/technotes/tools/windows/javac.html#processing

I will create some documentation and a decent example that uses ANT
and Maven, but I wanted to get some feedback first (it is still early
stages). If you use Maven, hppc-examples project on HPPC's github has
a working configuration at the moment. Does this help, at least a
little bit, Ted?

Dawid

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Mar 26, 2011 at 5:04 AM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> > They are there to assert things that should never ever be
> > false, if the program is working, no matter what bad input or network
> faults
> > occur. Turning them off should have zero effect in a working program.
>
> This is an interesting point of view and I think I like it even better
> than my own, only I would change "zero effect in a working program" to
> "zero effect in a correctly written program". That is adhering to API
> contracts, etc. This was actually my understanding when I was writing
> HPPC -- that once your code passes a million unit/ integration tests
> you can trust it enough to run it without assertions in production...
> when/if something breaks, you will know anyway because of malformed
> output or other exceptions and then you can rerun with assertions on
> or add more tests.
>

I must be a total pessimist.

I don't believe in the existence of "correctly written programs" any more
than I believe in Santa Claus.  There may be tiny moments in time when a
program is correct, but once it becomes more than small or more than a short
period passes between the writing and the rewriting or more than one person
dips their pen, it becomes a dream rather than a reality.

I guess both viewpoints have pros and cons, so convincing anybody does
> not make much sense, but it's an interesting discussion nonetheless.
>

Thank you for putting this so generously.

> > I don't like the "checked" and "unchecked" getter idiom. We have such a
> > thing in Vector -- set() and setQuick(). From reading the API, well,
> who's
> > not going to choose the quick operation over the "slow" one?
>
> Exactly. I always found it a bit odd when I was presented two versions
> of essentially the same method... and usually went with the "riskier"
> one.
>

Wow.  I go the opposite way.  I always pick the safer method.  Then if I
have time in my development cycle or the profiler shows that code to be
slow, I go in and use loop invariants to prove the risky calls are safe.

Commonly, I find the risky calls save nothing, especially with modern JVM's.
 So I put back in the safe calls.

Occasionally, there is a >10% difference.  Then I put in comments and
argument testing outside the loop to try to ensure that nobody else breaks
the code.

When reviewing code, I always view the use of the unsafe versions as a bug
unless there is documentation backing up the benefit and safeguards around
the risky version.

I also try to use the terminology "safe" and "risky" rather than "slow" and
"fast".  The safe/risky terminology is always correct while the slow/fast
terminology is only occasionally correct. The setQuick nomenclature is
something we inherited from Colt and would have a hard time changing even
though it would probably be better called
setRiskyDontUseThisWithoutAdultSupervision.

> > unchecked version is I think tiny. Bad input will just result in an
> > ArrayIndexOutOfBoundsException or NullPointerException quickly anyway.
>
> It may or it may not, if your storage is larger than your input. I
> guess most of the problems are off-by-one errors and not
> off-by-million (or negative index) errors.

Views of arrays also make this kind of error even more hard to see without
explicit checks.

> I think a nice addition would be to allow specifying if you want to

have ifs or assertions (it's generated code anyway) so that people can
> pick what they are comfortable with. I'll add it to the TODO list --
> thanks for inspiration, Sean.
>

Dawid,

Did you generate this code using a separate code generator step?  Or using a
class loader magic thing?

Can you point us at the heart of the code generator in either case?

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> They are there to assert things that should never ever be
> false, if the program is working, no matter what bad input or network faults
> occur. Turning them off should have zero effect in a working program.

This is an interesting point of view and I think I like it even better
than my own, only I would change "zero effect in a working program" to
"zero effect in a correctly written program". That is adhering to API
contracts, etc. This was actually my understanding when I was writing
HPPC -- that once your code passes a million unit/ integration tests
you can trust it enough to run it without assertions in production...
when/if something breaks, you will know anyway because of malformed
output or other exceptions and then you can rerun with assertions on
or add more tests.

I guess both viewpoints have pros and cons, so convincing anybody does
not make much sense, but it's an interesting discussion nonetheless.

> I don't like the "checked" and "unchecked" getter idiom. We have such a
> thing in Vector -- set() and setQuick(). From reading the API, well, who's
> not going to choose the quick operation over the "slow" one?

Exactly. I always found it a bit odd when I was presented two versions
of essentially the same method... and usually went with the "riskier"
one.

> unchecked version is I think tiny. Bad input will just result in an
> ArrayIndexOutOfBoundsException or NullPointerException quickly anyway.

It may or it may not, if your storage is larger than your input. I
guess most of the problems are off-by-one errors and not
off-by-million (or negative index) errors.

> So in this new annotation (which sounds great), I wonder if one can get away
> with leaning on the bounds checking and such that the JVM will do anyway.
> Everyone's happy then. It's fast and correct.

Not really because it's a flat array underneath so offsets are a
cumulative product of indexes and you can get a wrong/different result
while still being in-bounds of the storage array.

I think a nice addition would be to allow specifying if you want to
have ifs or assertions (it's generated code anyway) so that people can
pick what they are comfortable with. I'll add it to the TODO list --
thanks for inspiration, Sean.

Dawid

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

My agreement about this is so strong that I had to be reminded recently that
asserts had to be enabled in order to work.  That is such a bad design (in
my opinion) that I had reconstructed my view of the world and forgotten that
assert is not a synonym for if/throw.

Lately, I view assert as almost as strong a signal of a latent bug as catch
(Exception).

On Sat, Mar 26, 2011 at 4:01 AM, Sean Owen <sr...@gmail.com> wrote:

> The tragic thing about asserts is that they are a perfectly useful
> construct, but because they are rarely used 100% correctly, must be left on,
> and then both defeat their own purpose (might as well be an "if") and worse,
> harm the program (overhead of checks that the programmer thought would be
> off) This is essentially why we don't use it in Mahout. At least it's the
> reason in my head.

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Sean Owen <sr...@gmail.com>.

This is a useful tangent.

Assertions are definitely designed to be off in production, or else they
serve no purpose. They are there to assert things that should never ever be
false, if the program is working, no matter what bad input or network faults
occur. Turning them off should have zero effect in a working program.

So they can't be for argument checking for instance or anything that could
be true in an ugly world. If that's what they're being used for here I'd
disagree with it -- it changes behavior. Those have to be "if" statements
checking args.

This is theory. In practice, I've always encountered the same general unease
about running production differently than test. And it's justified. Who
knows whether asserts were used correctly, per above?

The tragic thing about asserts is that they are a perfectly useful
construct, but because they are rarely used 100% correctly, must be left on,
and then both defeat their own purpose (might as well be an "if") and worse,
harm the program (overhead of checks that the programmer thought would be
off) This is essentially why we don't use it in Mahout. At least it's the
reason in my head.

I don't like the "checked" and "unchecked" getter idiom. We have such a
thing in Vector -- set() and setQuick(). From reading the API, well, who's
not going to choose the quick operation over the "slow" one? If the caller
has bothered to think about issues of bounds checking, why would the caller
knowingly send bad input into the method that needs to be checked? So we end
up with everyone using the unchecked method. (So what's the point of set()
then?)

In Vector, and maybe this new code too, the consequence of mis-using the
unchecked version is I think tiny. Bad input will just result in an
ArrayIndexOutOfBoundsException or NullPointerException quickly anyway.
That's fine. It's less clean than explicitly checking and throwing a very
proper exception, but, we don't really care about dealing with these
situations cleanly, as long as the program correctness isn't compromised.

So we probably get away with it just fine in Vector with setQuick(). But
then... if this method has the speed and correctness properties everyone
wants, why is there a separate set() method? On this line of reasoning, it's
only feature is handling a very rare programmer error a little more cleanly.
So it should just be the one set() method.

So in this new annotation (which sounds great), I wonder if one can get away
with leaning on the bounds checking and such that the JVM will do anyway.
Everyone's happy then. It's fast and correct. However there may be times
where not explicitly checking input will result in a real problem. You just
can't avoid argument checking then.

In cases where that vital argument checking is becoming a performance
problem, I'd strongly suspect the better answer is to construct a method
tailored to that use case. For example, often the answer is some kind of
bulk-setter method.

On Fri, Mar 25, 2011 at 8:29 PM, Ted Dunning <te...@gmail.com> wrote:

> That is why I prefer the default, even in production, to be to check.  Not
> checking should be the (very) special case where there are obscure reasons
> to know that the access is correct that
> the optimizer can't see.
>
> On Fri, Mar 25, 2011 at 12:17 PM, Dawid Weiss
> <da...@cs.put.poznan.pl>wrote:
>
> > The difference in speed may be marginal since newer hardware/hotspot will
> > predict those branches almost never take place and probably discard them.
> >
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

I'll add a task to check what kind of overhead this actually is on
modern hardware, will report if I find out anything interesting.

Dawid

On Fri, Mar 25, 2011 at 9:49 PM, Ted Dunning <te...@gmail.com> wrote:
> Since assertions are almost always DISabled in production, I prefer to see
> explicit if's and throw's
> in the code.
>
> On Fri, Mar 25, 2011 at 1:44 PM, Dawid Weiss <da...@cs.put.poznan.pl>
> wrote:
>>
>> Do you mean running with assertions enabled in production or having
>> hardcoded, explicit ifs/throws? Just curious.
>>
>> Dawid
>>
>> On Fri, Mar 25, 2011 at 9:29 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > That is why I prefer the default, even in production, to be to check.
>> >  Not
>> > checking should be the (very) special case where there are obscure
>> > reasons
>> > to know that the access is correct that
>> > the optimizer can't see.
>> >
>> > On Fri, Mar 25, 2011 at 12:17 PM, Dawid Weiss
>> > <da...@cs.put.poznan.pl>
>> > wrote:
>> >>
>> >> The difference in speed may be marginal since newer hardware/hotspot
>> >> will
>> >> predict those branches almost never take place and probably discard
>> >> them.
>> >
>
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

Since assertions are almost always DISabled in production, I prefer to see
explicit if's and throw's
in the code.

On Fri, Mar 25, 2011 at 1:44 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> Do you mean running with assertions enabled in production or having
> hardcoded, explicit ifs/throws? Just curious.
>
> Dawid
>
> On Fri, Mar 25, 2011 at 9:29 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > That is why I prefer the default, even in production, to be to check.
>  Not
> > checking should be the (very) special case where there are obscure
> reasons
> > to know that the access is correct that
> > the optimizer can't see.
> >
> > On Fri, Mar 25, 2011 at 12:17 PM, Dawid Weiss <
> dawid.weiss@cs.put.poznan.pl>
> > wrote:
> >>
> >> The difference in speed may be marginal since newer hardware/hotspot
> will
> >> predict those branches almost never take place and probably discard
> them.
> >
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Do you mean running with assertions enabled in production or having
hardcoded, explicit ifs/throws? Just curious.

Dawid

On Fri, Mar 25, 2011 at 9:29 PM, Ted Dunning <te...@gmail.com> wrote:
> That is why I prefer the default, even in production, to be to check.  Not
> checking should be the (very) special case where there are obscure reasons
> to know that the access is correct that
> the optimizer can't see.
>
> On Fri, Mar 25, 2011 at 12:17 PM, Dawid Weiss <da...@cs.put.poznan.pl>
> wrote:
>>
>> The difference in speed may be marginal since newer hardware/hotspot will
>> predict those branches almost never take place and probably discard them.
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

That is why I prefer the default, even in production, to be to check.  Not
checking should be the (very) special case where there are obscure reasons
to know that the access is correct that
the optimizer can't see.

On Fri, Mar 25, 2011 at 12:17 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> The difference in speed may be marginal since newer hardware/hotspot will
> predict those branches almost never take place and probably discard them.
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Hi Ted,

I used assertions to be consistent with the rest of HPPC (there are no
explicit parameter validation checks unless you ask for them using -ea). I
think tests should always run with -ea and having two versions of accessor
methods misses the point (either you want those validation checks and run
with -ea or you don't want them and skip assertions entirely). I consider
HPPC a low-level library, so I assume whoever decides to use it knows how to
run the JVM with assertions enabled... but I also agree that for larger
audience explicit ifs/throws may be a better choice to capture problems
earlier in the development process.

The difference in speed may be marginal since newer hardware/hotspot will
predict those branches almost never take place and probably discard them.

Dawid


On Fri, Mar 25, 2011 at 5:47 PM, Ted Dunning <te...@gmail.com> wrote:

> Dawid,
>
> This is impressive and there have definitely been times that this would
> have helped.
>
> Can you say more about why you used assert's for checking bounds instead of
> having quick accessors without checks and normal
> accessors with checks (that hopefully will get inlined and lifted)?
>
> The problem I have with assert is that it doesn't normally have any effect.
>
>
> On Fri, Mar 25, 2011 at 7:36 AM, Dawid Weiss <dawid.weiss@cs.put.poznan.pl
> > wrote:
>
>> Hi guys,
>>
>> This is not directly related to Mahout, but since most of you deal with
>> computations, I think it is relevant and I seek feedback/ improvement
>> ideas.
>> If you've ever had to create a large array (or worse: multidimensional
>> array) of a relatively simple structure-like data holder class then you
>> probably know the pain of initializing sub-arrays and the memory overhead
>> that jagged arrays incur. The idea to generate stub code to handle such
>> cases has been around my head for a long time, but I finally managed to
>> find
>> some time and implement it. I really like the results so far, especially
>> in
>> multidimensional case the code is so much nicer. Even if you have a
>> relatively simple array of byte[][] you can do this:
>>
>> @Struct(dimensions = 2)
>> public final class Byte {
>>  public byte value;
>> }
>>
>> this will generate stub class ByteArray2D (if javac has access to
>> apt-processor in hppc-struct, that is; or if your maven project is
>> configured properly, see hppc-examples for how to do this) with a single
>> byte[] field and flattened Byte objects, including accessors to individual
>> fields or valuetype-copying methods for handling entire structures. More
>> is
>> here:
>>
>> http://issues.carrot2.org/browse/HPPC-54
>>
>> And a trivial sample here:
>>
>>
>> https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java
>>
>> Again, if you have any ideas/ improvements, they are most welcome (use
>> JIRA
>> above for comments or fork the code on github).
>>
>> Dawid
>>
>
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

Dawid,

This is impressive and there have definitely been times that this would have
helped.

Can you say more about why you used assert's for checking bounds instead of
having quick accessors without checks and normal
accessors with checks (that hopefully will get inlined and lifted)?

The problem I have with assert is that it doesn't normally have any effect.

On Fri, Mar 25, 2011 at 7:36 AM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> Hi guys,
>
> This is not directly related to Mahout, but since most of you deal with
> computations, I think it is relevant and I seek feedback/ improvement
> ideas.
> If you've ever had to create a large array (or worse: multidimensional
> array) of a relatively simple structure-like data holder class then you
> probably know the pain of initializing sub-arrays and the memory overhead
> that jagged arrays incur. The idea to generate stub code to handle such
> cases has been around my head for a long time, but I finally managed to
> find
> some time and implement it. I really like the results so far, especially in
> multidimensional case the code is so much nicer. Even if you have a
> relatively simple array of byte[][] you can do this:
>
> @Struct(dimensions = 2)
> public final class Byte {
>  public byte value;
> }
>
> this will generate stub class ByteArray2D (if javac has access to
> apt-processor in hppc-struct, that is; or if your maven project is
> configured properly, see hppc-examples for how to do this) with a single
> byte[] field and flattened Byte objects, including accessors to individual
> fields or valuetype-copying methods for handling entire structures. More is
> here:
>
> http://issues.carrot2.org/browse/HPPC-54
>
> And a trivial sample here:
>
>
> https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java
>
> Again, if you have any ideas/ improvements, they are most welcome (use JIRA
> above for comments or fork the code on github).
>
> Dawid
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Ted Dunning <te...@gmail.com>.

This is worth a separate artifact release.  It would really be nice to be
able to use this without Mahout or HPCC or Carrot.  Lucene is the great
example.  They don't need any of our stuff, but Mike's reaction indicates
that they might be thrilled to use this annotation.

On Sat, Mar 26, 2011 at 2:18 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> We do this in Carrot2 as well, with numerous stuff... I'll try to wrap
> the existing prototype, add some docs and examples and ship it with
> the next major release of HPPC (which should be out in a week or two).
>
> Dawid
>
> On Sat, Mar 26, 2011 at 9:51 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > This is a fabulous idea!
> >
> > We do this in Lucene, "manually", in the indexer, where we need a
> > simple struct to hold details for each unique term we've seen.  We
> > maintain our own (1D) parallel arrays for this...
> >
> > Mike
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Mar 25, 2011 at 10:36 AM, Dawid Weiss
> > <da...@cs.put.poznan.pl> wrote:
> >>
> >> Hi guys,
> >> This is not directly related to Mahout, but since most of you deal with
> >> computations, I think it is relevant and I seek feedback/ improvement
> ideas.
> >> If you've ever had to create a large array (or worse: multidimensional
> >> array) of a relatively simple structure-like data holder class then you
> >> probably know the pain of initializing sub-arrays and the memory
> overhead
> >> that jagged arrays incur. The idea to generate stub code to handle such
> >> cases has been around my head for a long time, but I finally managed to
> find
> >> some time and implement it. I really like the results so far, especially
> in
> >> multidimensional case the code is so much nicer. Even if you have a
> >> relatively simple array of byte[][] you can do this:
> >> @Struct(dimensions = 2)
> >> public final class Byte {
> >>   public byte value;
> >> }
> >> this will generate stub class ByteArray2D (if javac has access to
> >> apt-processor in hppc-struct, that is; or if your maven project is
> >> configured properly, see hppc-examples for how to do this) with a single
> >> byte[] field and flattened Byte objects, including accessors to
> individual
> >> fields or valuetype-copying methods for handling entire structures. More
> is
> >> here:
> >> http://issues.carrot2.org/browse/HPPC-54
> >> And a trivial sample here:
> >>
> https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java
> >> Again, if you have any ideas/ improvements, they are most welcome (use
> JIRA
> >> above for comments or fork the code on github).
> >> Dawid
> >
> >
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

We do this in Carrot2 as well, with numerous stuff... I'll try to wrap
the existing prototype, add some docs and examples and ship it with
the next major release of HPPC (which should be out in a week or two).

Dawid

On Sat, Mar 26, 2011 at 9:51 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> This is a fabulous idea!
>
> We do this in Lucene, "manually", in the indexer, where we need a
> simple struct to hold details for each unique term we've seen.  We
> maintain our own (1D) parallel arrays for this...
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Fri, Mar 25, 2011 at 10:36 AM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
>>
>> Hi guys,
>> This is not directly related to Mahout, but since most of you deal with
>> computations, I think it is relevant and I seek feedback/ improvement ideas.
>> If you've ever had to create a large array (or worse: multidimensional
>> array) of a relatively simple structure-like data holder class then you
>> probably know the pain of initializing sub-arrays and the memory overhead
>> that jagged arrays incur. The idea to generate stub code to handle such
>> cases has been around my head for a long time, but I finally managed to find
>> some time and implement it. I really like the results so far, especially in
>> multidimensional case the code is so much nicer. Even if you have a
>> relatively simple array of byte[][] you can do this:
>> @Struct(dimensions = 2)
>> public final class Byte {
>>   public byte value;
>> }
>> this will generate stub class ByteArray2D (if javac has access to
>> apt-processor in hppc-struct, that is; or if your maven project is
>> configured properly, see hppc-examples for how to do this) with a single
>> byte[] field and flattened Byte objects, including accessors to individual
>> fields or valuetype-copying methods for handling entire structures. More is
>> here:
>> http://issues.carrot2.org/browse/HPPC-54
>> And a trivial sample here:
>> https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java
>> Again, if you have any ideas/ improvements, they are most welcome (use JIRA
>> above for comments or fork the code on github).
>> Dawid
>
>

Re: Flattened arrays of simple structures (valuetype-like classes).

Posted by Michael McCandless <lu...@mikemccandless.com>.

This is a fabulous idea!

We do this in Lucene, "manually", in the indexer, where we need a
simple struct to hold details for each unique term we've seen.  We
maintain our own (1D) parallel arrays for this...

Mike

http://blog.mikemccandless.com

On Fri, Mar 25, 2011 at 10:36 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>
> Hi guys,
> This is not directly related to Mahout, but since most of you deal with
> computations, I think it is relevant and I seek feedback/ improvement ideas.
> If you've ever had to create a large array (or worse: multidimensional
> array) of a relatively simple structure-like data holder class then you
> probably know the pain of initializing sub-arrays and the memory overhead
> that jagged arrays incur. The idea to generate stub code to handle such
> cases has been around my head for a long time, but I finally managed to find
> some time and implement it. I really like the results so far, especially in
> multidimensional case the code is so much nicer. Even if you have a
> relatively simple array of byte[][] you can do this:
> @Struct(dimensions = 2)
> public final class Byte {
>   public byte value;
> }
> this will generate stub class ByteArray2D (if javac has access to
> apt-processor in hppc-struct, that is; or if your maven project is
> configured properly, see hppc-examples for how to do this) with a single
> byte[] field and flattened Byte objects, including accessors to individual
> fields or valuetype-copying methods for handling entire structures. More is
> here:
> http://issues.carrot2.org/browse/HPPC-54
> And a trivial sample here:
> https://github.com/carrotsearch/hppc/blob/master/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/StructExample.java
> Again, if you have any ideas/ improvements, they are most welcome (use JIRA
> above for comments or fork the code on github).
> Dawid