You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Chak-Pong Chung <cc...@gatech.edu> on 2018/12/14 00:20:57 UTC

SplittableDoFn for zipWithIndex for a large file

Hello everyone!

I asked the following question and think I might get some suggestions
whether what I want is doable or not.

https://stackoverflow.com/questions/53746046/how-can-i-implement-zipwithindex-like-spark-in-apache-beam/53747612#53747612

If I can get `PCollection` id and the number of (contiguous)lines in each
`PCollection`, then I can calculate the row order within each
partition/`PCollection`  first and then do prefix-sum to compute the offset
for each partition. This is doable in MPI or openMP since I can get the
id/rank of each processor/thread.

Best,
Chak-Pong

Re: SplittableDoFn for zipWithIndex for a large file

Posted by Scott Wegner <sc...@apache.org>.
I previously responded to your post on user@:
https://lists.apache.org/thread.html/5c10b7edf982ef63d1d1d70545e3fe2716d00628ff5c2a7854383413@%3Cuser.beam.apache.org%3E

I've also mirrored my response on StackOverflow:
https://stackoverflow.com/a/53771980/33791

On Thu, Dec 13, 2018 at 4:21 PM Chak-Pong Chung <cc...@gatech.edu> wrote:

> Hello everyone!
>
> I asked the following question and think I might get some suggestions
> whether what I want is doable or not.
>
>
> https://stackoverflow.com/questions/53746046/how-can-i-implement-zipwithindex-like-spark-in-apache-beam/53747612#53747612
>
> If I can get `PCollection` id and the number of (contiguous)lines in each
> `PCollection`, then I can calculate the row order within each
> partition/`PCollection`  first and then do prefix-sum to compute the offset
> for each partition. This is doable in MPI or openMP since I can get the
> id/rank of each processor/thread.
>
> Best,
> Chak-Pong
>


-- 




Got feedback? tinyurl.com/swegner-feedback