You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gilles <gi...@harfang.homelinux.org> on 2016/09/17 02:24:17 UTC

Re: [RNG] About RNG-6

Hi.

On Sat, 17 Sep 2016 01:38:16 +0200, Emmanuel Bourg wrote:
> Le 15/09/2016 � 16:37, Gilles a �crit :
>
>> I don't understand "motivations for the API".
>
> I just mean a short introduction to the component, something 
> explaining
> why it exists, its usefulness (performance, randomness quality), the
> implementations supported, etc.

I took care of this by expanding the "Overview" page.

I/you/anyone can add links to interesting references about the
subject.

>> I won't do that.
>> It was already a pain to create those tables as they are (see the
>> doc source).
>
> I understand you don't want to do it, editing apt files isn't fun. 
> But
> are you opposed to the idea?

Personally, I prefer separate tables.

Something useful would be to have a script for creating them.

[I have one for the benchmark table (takes the output of the
JMH benchmark include in "src/test").  Parsing the output of
"TestU01" is slightly more challenging...]

>>> - The internal packages should be hidden from the Javadoc,
>>
>> I do not know how do that.
>
> Packages can be excluded from the Javadoc by configuring the
> maven-javadoc-plugin with the <excludePackageNames> element:
>
> 
> https://maven.apache.org/plugins/maven-javadoc-plugin/examples/exclude-package-names.html
>
>
>> And I don't see the point, since from the "binary" POV, they
>> are "public".
>
> That's purely from the documentation point of view here. If we don't
> want users to use these packages, let's not mention them in the 
> Javadoc.
> That's an extra security with the "internal" package name.

(false) Security through obscurity...

Users who don't read the docs deserve what they get. ;-)
More probably such users will only copy/paste from the examples.
And be safe.

>> Users can select a generator without knowing the gory details.
>> If they want to know more, they go to the internals.
>> It they really want to know, they go to the references.
>>
>> Internal should not be hidden: contributors who'd fancy to provide
>> additional generators do not have to go through any hoops.
>>
>> Or do you suggest that this component should supply two JARs:
>>   * commons-rng-api
>>      Contents of "o.a.c.rng"
>>   * commons-rng-core
>>      Contents of "o.a.c.rng.internal"
>> ?
>>
>> And two sets of Javadoc?
>
> No thanks, too complicated. Contributors will just checkout the code 
> and
> quickly figure out how it's organized and how it's meant to be 
> extended,
> I don't worry about that. But I worry about mere users getting 
> confused
> about the API.

I worry about the reverse.

The lack of regular contributors is rather more actual than the
occasional "lost newbie".

>>> - Why isn't the UniformRandomProvider interface simply named
>>> RandomProvider?
>>
>> Maybe because we want to be transparent on what the code does (?).
>>
>> With this name, there is no ambiguity: no more, no less than what
>> the name implies.
>
> I don't feel there is an ambiguity here, probably because I'm not an
> expert in random number generators.

Then, please consider that there may be an ambiguity.

> I just see an excessively long class
> name that could be shortened. CM4/Hipparchus just use 
> 'RandomGenerator'
> for this interface, that's in line with the name of the component and 
> a
> better choice in my opinion.

No. It's ambiguous.

Shortening is also not a concern because you don't often instantiate
it.  Usually, it would be done _once_ per program.

>>> - UniformRandomProvider mimics the java.util.Random methods which 
>>> is
>>> nice. Is there any value in also implementing the nextGaussian() 
>>> method
>>> such that our API is a perfect drop-in replacement for the JDK 
>>> class?
>>
>> No. [IMHO.]
>> This was discussed on this list (and cf. above point).
>
> That would be convenient for migrating code using java.util.Random to
> commons-rng though. CM has it and the implementation doesn't look so
> complicated, I'm not sure to understand why you think it should go
> elsewhere. Even if more elaborated implementations could be latter
> imagined there is no harm providing a default one as a convenience 
> for
> migrating legacy code.

There is harm in that it creates confusion.

Please look beyond "java.util.Random" for reference about what it
means to generate random numbers.

People who design generators mostly deal with how to make them
good at generating uniform sequences (see the "Quality" section
of the userguide).

Generating non-uniform sequences is another business.

 From a programming POV, there is nothing for singling out
the Gaussian distribution.

In a library for scientific use, the Normal distribution will
have a higher usage count thna other distributions, but all
of them may have their uses, and thus should be provided.
This is not the business of this low level component.

>>> - For code relying on the java.util.Random type, it might be useful 
>>> to
>>> provide an adapter turning an UniformRandomProvider into a
>>> java.util.Random implementation (Hipparchus has one).
>>
>> This was discussed on this list.
>>
>> And see Commons Math's classes "o.a.c.math4.random.RandomUtils" and
>> "o.a.c.math4.random.RngAdaptor" classes.
>>
>> In the latter (being deprecated as of its creation), I indicated why
>> it is a bad idea.
>>
>> Again, if we really want to support the use-case you proposed, it's
>> a utility that should go in another component.
>
> The component already converts a java.util.Random into our
> RandomProvider,

It is purely for comparison purpose; it should be obvious that
using "RandomSource.JDK" is not recommended beyond illustrative
purpose (such as the performance ratio).

> I tend to think it's obvious to support the reverse
> operation in the same component.

Not the business of a library focused on implementations.

Bridge, wrapper, whatever of the sort is a utility.
Let's talk about a higher-level utility that depend on the core
(this component).

>>> - The choice of an enum for the factory class RandomSource is a bit
>>> unusual. It is probably ok for simple RNG with a simple seed, but 
>>> with
>>> more complex implementations I find it less intuitive to use.
>>
>> Less intuitive than what?
>
> Less intuitive than other designs, such as the one I proposed.

As you've guessed, I don't agree.

Similar to the "long" name rationale, the factory is a marginal
part of usage.

>>> When a user types this in its IDE, there is no type information nor
>>> contextual documentation for the parameters expected. The user has 
>>> to
>>> know the type of the seed and the meaning of the parameters 
>>> expected,
>>> the IDE won't be able to help here (with CTRL+P and CTRL+Q in 
>>> IntelliJ).
>>
>> Honestly, I don't care.
>> To put it less bluntly, the choice of RNG to instantiate is not a
>> matter of IDE.
>
> I don't think you understand my point. I don't expect the IDE to pick
> the right RNG for my use case of course, that's impossible. But I 
> expect
> the IDE to help me using the API by hinting the type expected and
> providing contextual documentation about the method used. The current
> design makes this impossible. The users are forced to open the 
> Javadoc
> in their browser when they want to instantiate a generator they don't
> know or remember about. I don't think it's a good idea.

I did understand your point. And I didn't agree.

Mine is that don't need an intelligent IDE to use Commons RNG.
You just need to read the docs _once_ in your life.

To be honest, you took the sole example ("TWO_CMRES_SELECT") among
15 implementations where additional parameters (beyond source type
and seed value) are used.

This was my design choice in order to provide a family of generators
(of which "TWO_CMRES" is one member).

It could have been equally possible to make the 2 ints part of the
seed (as it is done in other families like "MWC_256").
Here I hesitated because of the risk to have too many seemingly
different seeds that would produce the same generator (and sequence).

But I doubt that there will be other more "complex" uses of the
additional parameters than as indices in a table.
So your point about the IDE is just moot.

>> Practically, the lack of type information (which I don't like 
>> either)
>> was a trade-off to have a dead-simple API.
>
> I'd say it's certainly a concise API, but its simplicity is 
> debatable.

Then anything is.
And I don't agree.

>> This design came after 9 months of sustained work.
>> Nothing to be proud of in terms of LOC/day...
>> But what you propose to drop is the only non trivial part of
>> this work.
>>
>> As I said above, this is a trade-off, and it looks like a fault
>> only because Java is strongly typed.
>>
>> The automatic seed conversion is a huge advantage for casual
>> users; they can still seed them with a single "long" (and under
>> the hood get a good "native" seed).
>> But the doc is there to warn them that RNGs usually have a larger
>> state.
>
> What's the use case for accepting a long[] if the API expects a 
> int[]?
> That's a bit odd.

It's not; that's all the simplicity of the factory API: a user can
instantiate any generator with the same seed.

> At least, instead of using an Object in the create method, we could 
> use
> a dedicated Seed class containing factory methods and the conversion
> logic. Its API could look like this:
>
>   Seed.of(int...)
>   Seed.of(long...)
>   Seed.toInt()
>   Seed.toLong()
>   Seed.toIntArray()
>   Seed.toLongArray()
>   Seed.createIntArray(n)
>   Seed.createLongArray(n)


"SeedFactory" (internal) and API methods in "RandomSource"
do that (IIUC).

As an aside, I share your instinctive reluctance to deal
with "Object" argument.
But after _much_ pondering, this is surely one case where
it is undoubtedly adequate: the seed is _really_ not of a
definite type, it only needs to be a sequence of bits that
can be transformed in a good initial state.

Hence having several methods, like there were in CM, one
taking a "long", another an "int" and yet another an "int[]"
was plainly misleading.

That was confusing: why several seeding methods?

This is all mentioned in the Javadoc.

>> No, no, no.
>> Avoiding the "Serializable" clutter was a _requirement_.
>>
>> [...]
>>
>> It's also a feature that this component will never have to deal
>> with CVE related to exploitation of something "Serializable" that
>> should not have been...
>
> The random generators just hold a few primitive variables, there is 
> no
> risk here.

"Serializable" is to be avoided (even if you like to skip the
reference I put there to back the claim, which is not mine).

>>> - Why is the byte[] state encapsulated into the RandomSource.State
>>> class? Why not using a byte array directly?
>>
>> Because encapsulation is the hallmark of good OO programming?
>
> True but the RandomSource.State class isn't encapsulating the actual
> state of the generators here. It encapsulates an image of the 
> generator
> state. Even if the content of the byte array was changed it would not
> impact the state of the generator. This extra layer is really 
> unnecessary.

On the one hand you want to hide the Javadoc, but on the other, when
there is a little measure of "Don't fiddle with the representation
of the state", you want to expose it fully.

I repeat, there is the use-case of save/restore without persistence
where it is nicer to just deal with a "State" object rather than an
array object.

You are of course right that it is purely cosmetic.  I repeat that
I had to do it for the sake of the persistence use-case.

>> If some application only needs to save/restore without actual
>> persistence, it can do without dealing with the "ugliness" of
>> "byte[]".
>
> byte[] is ugly???

Yes.
Why would the "state" be an array.
In fact it is not in many implementations, but they provide a
transformation into "byte[]" in order to allow easy serialization,
without "Serializable"

>> And it will continue to work even if the internal representation
>> is changed.
>
> Why would we change that?


This is far-fetched, I know.
I was thinking of additional measures to check the integrity
of the state...

>> There were a lot of discussions here about what "beta release"
>> means and how to implement it in practice (package naming,
>> versioning scheme).
>>
>> They never led anywhere.
>
> Did you forget my suggestion to use a different package name and
> artifactId for the beta releases?

I don't remember exactly who said what.

I do remember though that I was in favour of something that would
speed up releases held back because of the monstrous nature of
Commons Math.

This problem has another solution: small components such as
Commons RNG.

This code which I proposed to release has been developed in
the same arena as anything else here in plain sight.

Nobody had anything to say about it for months.

I don't think it's serious to consider that potential users who
might have a need for a "java random generator" will be those
suscribed to Commons' "announce" ML.

Much more likely they'll do a search for what's available on
repositories.

>> I seriously doubt that having code tagged "beta" would be more
>> successful at getting users to provide feedback.
>
> A beta release will get more exposure than a snapshot, because it'll 
> be
> announced on the mailing lists, on Twitter and available on Maven 
> Central.
>
>> It will more likely turn them away, thinking that the generators
>> might be buggy.  Which is not the case.
>
> Not if we explain that the code is inherited from CM and very mature 
> but
> moved to a new component with a reworked API.

Who will do that dissertation?

The code is not good because it is inherited from CM.

The code was written because of the many defects (IMHO) of what
is/was in CM and in "java.util.Random".
It was written to be a repository of alternative generators with most
features needed for simple usage and none of the misfeatures which
casual users might be tempted to abuse (such as a "reseed" method,
see the Javadoc).

The code is good primarily because the uniformity was tested with
"TestU01" (and users can run it themselves).
This test was not available in released version of CM.

That code also provides (much) faster implementation than were
available in CM.
Although the CM "team" did all it could to prevent me from looking
in that direction.

And it feels a bit like the same right now, by attempting to
thwart the release of a code that is certainly not worse than
what was released for years without such a last-minute fuss.

>>> but I'm -1 for releasing the current code as 1.0.
>>
>> Any technical reasons?
>
> Simply the reasons I mentioned that can't be addressed in a later
> release without breaking the binary compatibility:
> - UniformRandomProvider is too long, rename back to RandomGenerator

No.

> - Add nextGaussian() to the UniformRandomProvider interface

No.

> - RandomSource hides type information and parameters' semantic

True but "not a problem".

> - RandomSource.State isn't necessary

True but "not a problem".

> Emmanuel Bourg
>
> (My apologies for the late feedback, I wish I had more time to do it
> earlier)

We are all free to set our priorities however we want.

I respect yours (of not having had the time to help with
the development of this component).

Please respect mine now.
I "did the work" and I think it's far from being so bad as to
deserve a -1.

You cannot reasonably sustain in the same breath that it was
impossible to look at 92 lines of code (the size of the public
API) for the last 3 to 6 months, but that you can valuably
question months of work, in a single mail, precisely when the
release vote is started.

People will use this code, or not, but nobody here can claim
to know better about this subject so as to come up with a
viable alternative.
And if they did now better, then they should have said so at
the proper time.

Flaws are quite possible of course.
But actual use is to uncover them (and we'll fix them when
they are reported), not mileage talk that only serves to
further delay the release.

There are missing features, for sure, such as "jump ahead",
but enabling this would require a lot of additional commitment,
and when done, it will certainly be worth a new release.


Regards,
Gilles

[1] I recall that the development version of CM depends on
     Commons RNG (~170 uses of "UniforRandomProvider" and
     "RandomSource"), and it all still works...


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org