You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Ismaël Mejía <ie...@gmail.com> on 2020/12/18 16:15:22 UTC

Possible issue with bounded Read translation using SDF

Hello,

I was trying to profile some pipeline using Java's direct runner. It
reads ~30 60MB text files (CSV). When I started the profiler it
reported more than 40K instances of TextSource being built which
really surprised me given the small size of the data being processed.
I wonder if I found maybe an issue of over-splitting after we moved to
the SDF based translation that may affect simpler uses.

I have not gone deeper or created a JIRA because I wanted to ask here
first maybe to see if there is a 'valid' explanation for so many
'splits'.

Regards,
Ismaël

Re: Possible issue with bounded Read translation using SDF

Posted by Steve Niemitz <sn...@apache.org>.

I think this actually the same problem as I reported w/ the PubsubIO [1],
but in the bounded case.  The BoundedSourceAsSDFWrapper closes (and then
re-creates) the underlying source each time it checkpoints, and the default
behavior is to checkpoint very frequently.

[1]
https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E

On Fri, Dec 18, 2020 at 11:16 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Hello,
>
> I was trying to profile some pipeline using Java's direct runner. It
> reads ~30 60MB text files (CSV). When I started the profiler it
> reported more than 40K instances of TextSource being built which
> really surprised me given the small size of the data being processed.
> I wonder if I found maybe an issue of over-splitting after we moved to
> the SDF based translation that may affect simpler uses.
>
> I have not gone deeper or created a JIRA because I wanted to ask here
> first maybe to see if there is a 'valid' explanation for so many
> 'splits'.
>
> Regards,
> Ismaël
>