You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gilles Sadowski <gi...@gmail.com> on 2020/02/18 16:31:05 UTC

Fwd: [math] Discuss: New feature MiniBatchKMeansClusterer

Hello.

Le mar. 18 févr. 2020 à 04:49, 陈 涛 <ch...@hotmail.com> a écrit :
>
> Hi Gilles:
>
>    I really do not know if anyone received my last mail, no one replay me for a long time so I send it again and copy to you with another email system.

Sorry for the delay. :-}

>
> > Some remarks:
>
> > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
> > does not call the existing "KMeansPlusPlusClusterer".
> > Code seems duplicated: As there is a case for reuse, the currently
> > "private" centroid initialization code should be factored out.
>
> This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,

I guess that you mean that we discuss your implementation of the
algorithm referenced in the Javadoc.

> and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
>
> the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer",

I don't follow.

>
> this is a significant change, so I did not reactor it.

Significant changes are welcome (since the next release will contain
other major changes anyways) if they improve the code base (like e.g.
reducing code duplication).

>
>
>
> > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
> > that a computation is performed (as opposed to selecting from an
> > existing data structure).

I think I'd prefer "selectCentroids".

>
> It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"?

I don't understand.

It would be easier if you create a pull request, so that we can clearly see
what codes are added/removed/changed.

>
> The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used.
>
>
> > * Not convinced that there should be so many constructors (in most
> > cases, it makes no sense to pick default values that are likely to
> > be heavily application-dependent.
>
> I can add more constructors.

No, the constructors with default values clutter the API, for
no obvious gain IMHO.
[If the default values make sense, they must be documented.]

>
> I'd like a builder class more than constructors, but does not meet the historical code style.

Now is the time for improving the API.
It would be quite helpful to create a report on JIRA with "sub-tasks"
for all such API proposed changes.

> > * Input should generally be validated: e.g. the maximum number of
> > iterations should not be changed unwittingly; rather, an exception
> > should be raised if the user passed a negative value.
>
> Thanks for your advices, I will improve these.
>
> > Could be nice to illustrate (not just with a picture, but in a table
> > with entries average over several runs) the differences in result
> > between the implementations, using various setups (number of
> > clusters, stopping criterion, etc.).
>
> I will make more tests, include benchmarks.
>
> It is a challenge for me to generate the various kinds of test data,
>
> could !nybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html

They are generated programmatically; code is here:
    https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide

[I've just updated the codes so that they compiles and run using
the new dependencies (see the "README" file).]

> > "MT_64" is probably not the best default.  And this is one of the
> > parameters from which there should not be a default IMO.
>
~ I will do more tests

You don't need to test the generators; users should choose
by themselves (from those provided in "Commons RNG").

>
> > [Note: there are spurious characters in your message (see e.g. the
> > paragraph quoted just above) that make it difficult to read.]
>
> I had well format my mail in my mail box, it may been changed by the Mail List service.
>
> I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML.

It's probably an encoding thing (setting it to "UTF-8" should be
fine).

Best regards,
Gilles

> > [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org