You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by subham agarwal <ag...@gmail.com> on 2020/02/19 18:47:44 UTC

Help needed on a problem statement

Hi Team,

I was working on a problem statement and I came across beam. Being very new
to beam I am not sure if my use case can be solved by beam. Can you please
help me here.

Use case:

I have list of CSV and JSON files coming every min in Google cloud storage.
The file can range from kb to gb. I need to parse the file and process
records in each file independently, which means file 1 records should be
parsed and data will be enriched and be stored in different output location
and file 2 will go into different location.

I started with launching a different dataflow job for each file but it is
over kill for small files. So, I thought if I can batch files every 15 mins
and process them together in a single job but I need to maintain the above
boundary of data processing.


Can anyone please help me if there is a solution around my problem or beam
is not meant for this problem statement.

Thanks in advance.

Looking forward for a reply.

Regards,
Subham Agarwal

Re: Help needed on a problem statement

Posted by Luke Cwik <lc...@google.com>.

There is the TextIO.readAll[1] that you could use which watches a file
pattern and will automatically read the files that appear there.

Note that Apache Beam only has TextIO and no CsvIO or JsonIO.
TextIO does allow skipping so many lines from the header so you could still
use it for csv files but you are responsible for parsing each line.
If each json record is a single line you could also use TextIO and then use
something like Jackson to convert the strings to POJO.

1:
https://github.com/apache/beam/blob/d6168635d9e9a0ecfac13cdfb675f4239a741d72/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L213

On Wed, Feb 19, 2020 at 1:54 PM Austin Bennett <wh...@gmail.com>
wrote:

> I'd disentangle Dataflow from Beam.  Beam can help you.  Dataflow might be
> useful, though, yes, for batch jobs the spin up cost might be a lot for
> small file sizes.
>
> There are potentially lots of ways to do this.
>
> An idea (that I haven't seen used anywhere).  Have a streaming Beam
> pipeline (that can autoscale if needed) persistently running.  Have another
> process that takes each record from the files as dropped and puts in
> message queue for Beam to process (you'd have both the data 'record' as
> well as metadata about source file).
>
> * The dataflow spin up is heavy: I'm wondering suitability of running
> something like this using Direct runner (or even single node Flink) on GCP
> RUN (with an event notification coming from GCS to kickoff job):
> https://cloud.google.com/run/quotas <-- looks like can handle up to 2GB
> in memory.  So, if not, have some logic for when to launch dataflow, vs
> when to do lighter weight beam job.
>
> I've have not faced your problem.  Merely making up what might be
> interesting solutions :-)  Good luck!
>
>
>
> On Wed, Feb 19, 2020 at 11:10 AM subham agarwal <
> agarwalsubham1992@gmail.com> wrote:
>
>> Hi Team,
>>
>> I was working on a problem statement and I came across beam. Being very
>> new to beam I am not sure if my use case can be solved by beam. Can you
>> please help me here.
>>
>> Use case:
>>
>> I have list of CSV and JSON files coming every min in Google cloud
>> storage. The file can range from kb to gb. I need to parse the file and
>> process records in each file independently, which means file 1 records
>> should be parsed and data will be enriched and be stored in different
>> output location and file 2 will go into different location.
>>
>> I started with launching a different dataflow job for each file but it is
>> over kill for small files. So, I thought if I can batch files every 15 mins
>> and process them together in a single job but I need to maintain the above
>> boundary of data processing.
>>
>>
>> Can anyone please help me if there is a solution around my problem or
>> beam is not meant for this problem statement.
>>
>> Thanks in advance.
>>
>> Looking forward for a reply.
>>
>> Regards,
>> Subham Agarwal
>>
>

Re: Help needed on a problem statement

Posted by Luke Cwik <lc...@google.com>.

There is the TextIO.readAll[1] that you could use which watches a file
pattern and will automatically read the files that appear there.

Note that Apache Beam only has TextIO and no CsvIO or JsonIO.
TextIO does allow skipping so many lines from the header so you could still
use it for csv files but you are responsible for parsing each line.
If each json record is a single line you could also use TextIO and then use
something like Jackson to convert the strings to POJO.

1:
https://github.com/apache/beam/blob/d6168635d9e9a0ecfac13cdfb675f4239a741d72/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L213

On Wed, Feb 19, 2020 at 1:54 PM Austin Bennett <wh...@gmail.com>
wrote:

> I'd disentangle Dataflow from Beam.  Beam can help you.  Dataflow might be
> useful, though, yes, for batch jobs the spin up cost might be a lot for
> small file sizes.
>
> There are potentially lots of ways to do this.
>
> An idea (that I haven't seen used anywhere).  Have a streaming Beam
> pipeline (that can autoscale if needed) persistently running.  Have another
> process that takes each record from the files as dropped and puts in
> message queue for Beam to process (you'd have both the data 'record' as
> well as metadata about source file).
>
> * The dataflow spin up is heavy: I'm wondering suitability of running
> something like this using Direct runner (or even single node Flink) on GCP
> RUN (with an event notification coming from GCS to kickoff job):
> https://cloud.google.com/run/quotas <-- looks like can handle up to 2GB
> in memory.  So, if not, have some logic for when to launch dataflow, vs
> when to do lighter weight beam job.
>
> I've have not faced your problem.  Merely making up what might be
> interesting solutions :-)  Good luck!
>
>
>
> On Wed, Feb 19, 2020 at 11:10 AM subham agarwal <
> agarwalsubham1992@gmail.com> wrote:
>
>> Hi Team,
>>
>> I was working on a problem statement and I came across beam. Being very
>> new to beam I am not sure if my use case can be solved by beam. Can you
>> please help me here.
>>
>> Use case:
>>
>> I have list of CSV and JSON files coming every min in Google cloud
>> storage. The file can range from kb to gb. I need to parse the file and
>> process records in each file independently, which means file 1 records
>> should be parsed and data will be enriched and be stored in different
>> output location and file 2 will go into different location.
>>
>> I started with launching a different dataflow job for each file but it is
>> over kill for small files. So, I thought if I can batch files every 15 mins
>> and process them together in a single job but I need to maintain the above
>> boundary of data processing.
>>
>>
>> Can anyone please help me if there is a solution around my problem or
>> beam is not meant for this problem statement.
>>
>> Thanks in advance.
>>
>> Looking forward for a reply.
>>
>> Regards,
>> Subham Agarwal
>>
>

Re: Help needed on a problem statement

Posted by Austin Bennett <wh...@gmail.com>.

I'd disentangle Dataflow from Beam.  Beam can help you.  Dataflow might be
useful, though, yes, for batch jobs the spin up cost might be a lot for
small file sizes.

There are potentially lots of ways to do this.

An idea (that I haven't seen used anywhere).  Have a streaming Beam
pipeline (that can autoscale if needed) persistently running.  Have another
process that takes each record from the files as dropped and puts in
message queue for Beam to process (you'd have both the data 'record' as
well as metadata about source file).

* The dataflow spin up is heavy: I'm wondering suitability of running
something like this using Direct runner (or even single node Flink) on GCP
RUN (with an event notification coming from GCS to kickoff job):
https://cloud.google.com/run/quotas <-- looks like can handle up to 2GB in
memory.  So, if not, have some logic for when to launch dataflow, vs when
to do lighter weight beam job.

I've have not faced your problem.  Merely making up what might be
interesting solutions :-)  Good luck!

On Wed, Feb 19, 2020 at 11:10 AM subham agarwal <ag...@gmail.com>
wrote:

> Hi Team,
>
> I was working on a problem statement and I came across beam. Being very
> new to beam I am not sure if my use case can be solved by beam. Can you
> please help me here.
>
> Use case:
>
> I have list of CSV and JSON files coming every min in Google cloud
> storage. The file can range from kb to gb. I need to parse the file and
> process records in each file independently, which means file 1 records
> should be parsed and data will be enriched and be stored in different
> output location and file 2 will go into different location.
>
> I started with launching a different dataflow job for each file but it is
> over kill for small files. So, I thought if I can batch files every 15 mins
> and process them together in a single job but I need to maintain the above
> boundary of data processing.
>
>
> Can anyone please help me if there is a solution around my problem or beam
> is not meant for this problem statement.
>
> Thanks in advance.
>
> Looking forward for a reply.
>
> Regards,
> Subham Agarwal
>

Re: Help needed on a problem statement

Posted by Austin Bennett <wh...@gmail.com>.

I'd disentangle Dataflow from Beam.  Beam can help you.  Dataflow might be
useful, though, yes, for batch jobs the spin up cost might be a lot for
small file sizes.

There are potentially lots of ways to do this.

An idea (that I haven't seen used anywhere).  Have a streaming Beam
pipeline (that can autoscale if needed) persistently running.  Have another
process that takes each record from the files as dropped and puts in
message queue for Beam to process (you'd have both the data 'record' as
well as metadata about source file).

* The dataflow spin up is heavy: I'm wondering suitability of running
something like this using Direct runner (or even single node Flink) on GCP
RUN (with an event notification coming from GCS to kickoff job):
https://cloud.google.com/run/quotas <-- looks like can handle up to 2GB in
memory.  So, if not, have some logic for when to launch dataflow, vs when
to do lighter weight beam job.

I've have not faced your problem.  Merely making up what might be
interesting solutions :-)  Good luck!

On Wed, Feb 19, 2020 at 11:10 AM subham agarwal <ag...@gmail.com>
wrote:

> Hi Team,
>
> I was working on a problem statement and I came across beam. Being very
> new to beam I am not sure if my use case can be solved by beam. Can you
> please help me here.
>
> Use case:
>
> I have list of CSV and JSON files coming every min in Google cloud
> storage. The file can range from kb to gb. I need to parse the file and
> process records in each file independently, which means file 1 records
> should be parsed and data will be enriched and be stored in different
> output location and file 2 will go into different location.
>
> I started with launching a different dataflow job for each file but it is
> over kill for small files. So, I thought if I can batch files every 15 mins
> and process them together in a single job but I need to maintain the above
> boundary of data processing.
>
>
> Can anyone please help me if there is a solution around my problem or beam
> is not meant for this problem statement.
>
> Thanks in advance.
>
> Looking forward for a reply.
>
> Regards,
> Subham Agarwal
>