You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Natia Chachkhiani <na...@gmail.com> on 2022/06/03 02:28:05 UTC

flink-ml algorithms

Hi,

I am running OnlineKmeans from flink-ml repo on a small dataset. I've
noticed that I don't get consistent results, assignments to clusters,
across different runs. I have set both parallelism and globalBatchSize to 1.
I am doing simple fit and transform on each data point ingested. Is the
order of processing not guaranteed? Or am I missing something?

Thanks,
Natia

Re: flink-ml algorithms

Posted by Natia Chachkhiani <na...@gmail.com>.

Hi, I have another question. Is the implementation of kmeans in flink-ml
same as Spark's StreamingKmeans?
Should the accuracy/results from the same dataset be comparable between the
two?

On Sun, Jun 5, 2022 at 8:14 PM Natia Chachkhiani <
natia.chachkhiani1@gmail.com> wrote:

> Thanks for the reply Zhipeng and Jing.
> Running the OnlineKmeans with a fixed initial model removed the randomness!
>
>
> On Sun, Jun 5, 2022 at 6:19 PM Zhipeng Zhang <zh...@gmail.com>
> wrote:
>
>> Hi Natia,
>>
>> As I understand, the processing order of onlineKmeans is the same the
>> input data.
>>
>> Are you running OnlineKmeans with using one data point with random
>> initial KmeansModel? Could you use a fixed initial model following [1] and
>> try out?
>>
>> [1]
>> https://github.com/apache/flink-ml/blob/239788f2b1f1f3a4e55ca112517980b598705a15/flink-ml-lib/src/test/java/org/apache/flink/ml/clustering/OnlineKMeansTest.java#L354
>>
>> Jing Ge <ji...@ververica.com> 于2022年6月3日周五 17:04写道：
>>
>>> Hi,
>>>
>>> It seems like an evaluation with a small dataset. In this case, would
>>> you like to share your data sample and code? In addition, have you tried
>>> KMeans with the same dataset and got inconsistent results too?
>>>
>>> Best regards,
>>> Jing
>>>
>>> On Fri, Jun 3, 2022 at 4:29 AM Natia Chachkhiani <
>>> natia.chachkhiani1@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am running OnlineKmeans from flink-ml repo on a small dataset. I've
>>>> noticed that I don't get consistent results, assignments to clusters,
>>>> across different runs. I have set both parallelism and globalBatchSize to 1.
>>>> I am doing simple fit and transform on each data point ingested. Is the
>>>> order of processing not guaranteed? Or am I missing something?
>>>>
>>>> Thanks,
>>>> Natia
>>>>
>>>
>>
>> --
>> best,
>> Zhipeng
>>
>>

Re: flink-ml algorithms

Posted by Natia Chachkhiani <na...@gmail.com>.

Thanks for the reply Zhipeng and Jing.
Running the OnlineKmeans with a fixed initial model removed the randomness!


On Sun, Jun 5, 2022 at 6:19 PM Zhipeng Zhang <zh...@gmail.com>
wrote:

> Hi Natia,
>
> As I understand, the processing order of onlineKmeans is the same the
> input data.
>
> Are you running OnlineKmeans with using one data point with random initial
> KmeansModel? Could you use a fixed initial model following [1] and try out?
>
> [1]
> https://github.com/apache/flink-ml/blob/239788f2b1f1f3a4e55ca112517980b598705a15/flink-ml-lib/src/test/java/org/apache/flink/ml/clustering/OnlineKMeansTest.java#L354
>
> Jing Ge <ji...@ververica.com> 于2022年6月3日周五 17:04写道：
>
>> Hi,
>>
>> It seems like an evaluation with a small dataset. In this case, would you
>> like to share your data sample and code? In addition, have you tried KMeans
>> with the same dataset and got inconsistent results too?
>>
>> Best regards,
>> Jing
>>
>> On Fri, Jun 3, 2022 at 4:29 AM Natia Chachkhiani <
>> natia.chachkhiani1@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running OnlineKmeans from flink-ml repo on a small dataset. I've
>>> noticed that I don't get consistent results, assignments to clusters,
>>> across different runs. I have set both parallelism and globalBatchSize to 1.
>>> I am doing simple fit and transform on each data point ingested. Is the
>>> order of processing not guaranteed? Or am I missing something?
>>>
>>> Thanks,
>>> Natia
>>>
>>
>
> --
> best,
> Zhipeng
>
>

Re: flink-ml algorithms

Posted by Zhipeng Zhang <zh...@gmail.com>.

Hi Natia,

As I understand, the processing order of onlineKmeans is the same the input
data.

Are you running OnlineKmeans with using one data point with random initial
KmeansModel? Could you use a fixed initial model following [1] and try out?

[1]
https://github.com/apache/flink-ml/blob/239788f2b1f1f3a4e55ca112517980b598705a15/flink-ml-lib/src/test/java/org/apache/flink/ml/clustering/OnlineKMeansTest.java#L354

Jing Ge <ji...@ververica.com> 于2022年6月3日周五 17:04写道：

> Hi,
>
> It seems like an evaluation with a small dataset. In this case, would you
> like to share your data sample and code? In addition, have you tried KMeans
> with the same dataset and got inconsistent results too?
>
> Best regards,
> Jing
>
> On Fri, Jun 3, 2022 at 4:29 AM Natia Chachkhiani <
> natia.chachkhiani1@gmail.com> wrote:
>
>> Hi,
>>
>> I am running OnlineKmeans from flink-ml repo on a small dataset. I've
>> noticed that I don't get consistent results, assignments to clusters,
>> across different runs. I have set both parallelism and globalBatchSize to 1.
>> I am doing simple fit and transform on each data point ingested. Is the
>> order of processing not guaranteed? Or am I missing something?
>>
>> Thanks,
>> Natia
>>
>

-- 
best,
Zhipeng

Re: flink-ml algorithms

Posted by Jing Ge <ji...@ververica.com>.

Hi,

It seems like an evaluation with a small dataset. In this case, would you
like to share your data sample and code? In addition, have you tried KMeans
with the same dataset and got inconsistent results too?

Best regards,
Jing

On Fri, Jun 3, 2022 at 4:29 AM Natia Chachkhiani <
natia.chachkhiani1@gmail.com> wrote:

> Hi,
>
> I am running OnlineKmeans from flink-ml repo on a small dataset. I've
> noticed that I don't get consistent results, assignments to clusters,
> across different runs. I have set both parallelism and globalBatchSize to 1.
> I am doing simple fit and transform on each data point ingested. Is the
> order of processing not guaranteed? Or am I missing something?
>
> Thanks,
> Natia
>