You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Anant Chaudhary <an...@getrocket.com> on 2018/02/14 02:50:18 UTC

json source for a pipeline

Hello Beam Devs,

We are starting to explore apache beam and google cloud dataflow. Seems
like it can fit some of our data processing use cases pretty well. Some of
my colleagues have worked with Apache Spark in the past, however the
promise of not having to manage the servers has us inclining towards
dataflow right now.

A lot of the raw data that we have sits in S3 buckets as either single JSON
object, or a JSON array of multiple objects. I see on the beam wiki that a
JSON source may be in the works, or at least is being discussed.

https://beam.apache.org/documentation/io/built-in/
https://issues.apache.org/jira/browse/BEAM-1581

I do also see the docs recommend thinking hard before trying to write a new
source. Being a newbie to this world, I might be missing a more
straightforward solution to the problem.

The pipeline I had in mind was  read from s3 source -> convert to json
objects -> (if arrays, then flatMap) -> filter -> groupby -> collect

In the initial step however the textIO source splits the file in to lines
(in trying to speed up the reading I suppose) - happens on files in gs or
local disk.

Is there a way to recombine lines from a 'single file' back in to one
string which can be JSON parsed? Seems like a group operation in the
pipeline, cant see the textIO sending the filename/line numbers to the
downstream transform, which could group the data back.

I can try to hack a custom source for our use case, but thought I'll shoot
you guys a note (wiki says I should :-)

Let me know if you guys have thoughts, and apologize for what might be a
super noob question. After spending a day reading beam wiki, googling and
stackoverflow, I figured might be worth a shot.

Thanks
Anant

Re: json source for a pipeline

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Anant,

did you take a look on the jackson extension:

https://github.com/apache/beam/tree/master/sdks/java/extensions/jackson

Maybe it does what you want (converting JSON as object).

Regards
JB

On 02/14/2018 03:50 AM, Anant Chaudhary wrote:
> Hello Beam Devs,
> 
> We are starting to explore apache beam and google cloud dataflow. Seems like it
> can fit some of our data processing use cases pretty well. Some of my colleagues
> have worked with Apache Spark in the past, however the promise of not having to
> manage the servers has us inclining towards dataflow right now.
> 
> A lot of the raw data that we have sits in S3 buckets as either single JSON
> object, or a JSON array of multiple objects. I see on the beam wiki that a JSON
> source may be in the works, or at least is being discussed.
> 
> https://beam.apache.org/documentation/io/built-in/
> https://issues.apache.org/jira/browse/BEAM-1581
> 
> I do also see the docs recommend thinking hard before trying to write a new
> source. Being a newbie to this world, I might be missing a more straightforward
> solution to the problem.
> 
> The pipeline I had in mind was  read from s3 source -> convert to json objects
> -> (if arrays, then flatMap) -> filter -> groupby -> collect
> 
> In the initial step however the textIO source splits the file in to lines (in
> trying to speed up the reading I suppose) - happens on files in gs or local disk.
> 
> Is there a way to recombine lines from a 'single file' back in to one string
> which can be JSON parsed? Seems like a group operation in the pipeline, cant see
> the textIO sending the filename/line numbers to the downstream transform, which
> could group the data back.
> 
> I can try to hack a custom source for our use case, but thought I'll shoot you
> guys a note (wiki says I should :-)
> 
> Let me know if you guys have thoughts, and apologize for what might be a super
> noob question. After spending a day reading beam wiki, googling and
> stackoverflow, I figured might be worth a shot.
> 
> Thanks
> Anant

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: json source for a pipeline

Posted by Eugene Kirpichov <ki...@google.com>.

Hi,

You can use the general-purpose FileIO. It was designed to support pretty
much anything not explicitly supported by the IOs for concrete file formats
bundled with Beam, eg TextIO and AvroIO.

E.g.:
p.apply(FileIO.match().filepattern("...")).apply(FileIO.readMatches()) will
give you a PCollection<ReadableFile> that you can use to do any file
processing you want using regular Java libraries, e.g. ReadableFile.open()
gives a ReadableByteChannel (use Channels.newInputStream() to convert it to
an InputStream).

On Tue, Feb 13, 2018 at 9:42 PM Anant Chaudhary <an...@getrocket.com> wrote:

> Hello Beam Devs,
>
> We are starting to explore apache beam and google cloud dataflow. Seems
> like it can fit some of our data processing use cases pretty well. Some of
> my colleagues have worked with Apache Spark in the past, however the
> promise of not having to manage the servers has us inclining towards
> dataflow right now.
>
> A lot of the raw data that we have sits in S3 buckets as either single
> JSON object, or a JSON array of multiple objects. I see on the beam wiki
> that a JSON source may be in the works, or at least is being discussed.
>
> https://beam.apache.org/documentation/io/built-in/
> https://issues.apache.org/jira/browse/BEAM-1581
>
> I do also see the docs recommend thinking hard before trying to write a
> new source. Being a newbie to this world, I might be missing a more
> straightforward solution to the problem.
>
> The pipeline I had in mind was  read from s3 source -> convert to json
> objects -> (if arrays, then flatMap) -> filter -> groupby -> collect
>
> In the initial step however the textIO source splits the file in to lines
> (in trying to speed up the reading I suppose) - happens on files in gs or
> local disk.
>
> Is there a way to recombine lines from a 'single file' back in to one
> string which can be JSON parsed? Seems like a group operation in the
> pipeline, cant see the textIO sending the filename/line numbers to the
> downstream transform, which could group the data back.
>
> I can try to hack a custom source for our use case, but thought I'll shoot
> you guys a note (wiki says I should :-)
>
> Let me know if you guys have thoughts, and apologize for what might be a
> super noob question. After spending a day reading beam wiki, googling and
> stackoverflow, I figured might be worth a shot.
>
> Thanks
> Anant
>