You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Varun Dhore <va...@gmail.com> on 2018/05/11 01:16:57 UTC

Latency with cross operation on Datasets

Hello flink community,

I am trying to understand the latency involved in cross operation. Below are
my tests.

In plain Java:
1. Create 2D array 1 - populated with 1 million rows and 3 columns with
randomly generated double values 
2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly
generated double values 
3. Run nested for loop for 1 million X 100 times and perform
EuclideanDistance calculation inside the nested loop 
4. Collect the output in a List of doubles and print size of the list at
last. 

above steps are complete in about 15 seconds in plain java on my laptop.

In flink batch:
1. Read avro files with 1 million and 100 rows in same format as above
2. Perform cross operation from 100 rows dataset with 1 million row with
crossWithHuge hint as the broadcasted 1 million dataset is bigger in this
case.
3. Apply map function that will perform distance function. 
4. After cross I am doing a count at the end as a closure step. 


When I package and submit jar to flink cluster it takes about 2 min and 10
sec to complete. I can see that 1 million dataset finishes population from
avro file in a minute and its indicated as broadcast which makes sense.
Since I am running it on a single slot I believe there is not data shipped
across the network. I am wondering why it still takes another 70 seconds to
run cross operation. I understand cartesian product can be expensive but I
am guessing it should be close to the nested loop in Java for this case.
Please advise. 

Thanks for your help in advance!

Regards,
Varun



 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Latency with cross operation on Datasets

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Varun,

The focus of the DataSet execution is on robustness. The smaller DataSet is
stored serialized in memory.
Also most of the communication happens via serialization (instead of
passing object references).
The serialization overhead should have a significant overhead compared to a
thread-local execution.

Best, Fabian

2018-05-11 3:16 GMT+02:00 Varun Dhore <va...@gmail.com>:

> Hello flink community,
>
> I am trying to understand the latency involved in cross operation. Below
> are
> my tests.
>
> In plain Java:
> 1. Create 2D array 1 - populated with 1 million rows and 3 columns with
> randomly generated double values
> 2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly
> generated double values
> 3. Run nested for loop for 1 million X 100 times and perform
> EuclideanDistance calculation inside the nested loop
> 4. Collect the output in a List of doubles and print size of the list at
> last.
>
> above steps are complete in about 15 seconds in plain java on my laptop.
>
> In flink batch:
> 1. Read avro files with 1 million and 100 rows in same format as above
> 2. Perform cross operation from 100 rows dataset with 1 million row with
> crossWithHuge hint as the broadcasted 1 million dataset is bigger in this
> case.
> 3. Apply map function that will perform distance function.
> 4. After cross I am doing a count at the end as a closure step.
>
>
> When I package and submit jar to flink cluster it takes about 2 min and 10
> sec to complete. I can see that 1 million dataset finishes population from
> avro file in a minute and its indicated as broadcast which makes sense.
> Since I am running it on a single slot I believe there is not data shipped
> across the network. I am wondering why it still takes another 70 seconds to
> run cross operation. I understand cartesian product can be expensive but I
> am guessing it should be close to the nested loop in Java for this case.
> Please advise.
>
> Thanks for your help in advance!
>
> Regards,
> Varun
>
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>