You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/04/05 19:04:41 UTC
[jira] [Commented] (BEAM-1581) JSON source and sink

    [ https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957450#comment-15957450 ] 

Eugene Kirpichov commented on BEAM-1581:
----------------------------------------

Aviem - all you're saying makes sense in general, but again, to make progress on designing a concrete API for this, we need to go from concrete use cases. It is fine if the use cases are hypothetical, but at the least, they need to be very concrete.

What are the use cases where we want to ingest data stored as JSON into a Beam pipeline?
- Is the data stored in one file or in multiple files?
- Within one file, is the data stored as 1 JSON object, or as many JSON objects embedded into 1 JSON object, or as a plain sequence of JSON objects delimited e.g. by newlines, or somehow else? - are all of these actually common?
- In what form do we want the JSON objects to be accessible in the pipeline: as an abstract JsonObject (or something like that - and again, it's not clear which JSON library to use here), or is there an existing business entity class that they are intended to map well onto, or perhaps they are not intended to map onto any Java class but we're fine mapping them onto one anyway, for the purposes of the pipeline?

Same questions apply in reverse to writing data as JSON.

Are there other configuration dimensions I missed? Do we even _want_ to provide an API that handles all of these at the same time? What is the minimal set of building blocks we can provide that people can assemble all they want from them?

What do other data processing frameworks do - e.g. Spark, Flink, Hadoop? Do people like the API they provide? E.g. it seems that Spark provides an API that assumes 1 JSON object per line, and I'm not quite sure what assumptions it makes about the format of that object.

> JSON source and sink
> --------------------
>
>                 Key: BEAM-1581
>                 URL: https://issues.apache.org/jira/browse/BEAM-1581
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Aviem Zur
>            Assignee: Aviem Zur
>
> JSON source and sink to read/write JSON files.
> Similarly to {{XmlSource}}/{{XmlSink}}, these be a {{JsonSource}}/{{JonSink}} which are a {{FileBaseSource}}/{{FileBasedSink}}.
> Consider using methods/code (or refactor these) found in {{AsJsons}} and {{ParseJsons}}
> The {{PCollection}} of objects the user passes to the transform should be embedded in a valid JSON file
> The most common pattern for this is a large object with an array member which holds all the data objects and other members for metadata.
> Examples of public JSON APIs: https://www.sitepoint.com/10-example-json-files/
> Another pattern used is a file which is simply a JSON array of objects.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)