You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Luke Cwik (JIRA)" <ji...@apache.org> on 2017/08/30 15:35:00 UTC

[jira] [Comment Edited] (BEAM-2826) Need to generate a single XML file when write is performed on small amount of data

    [ https://issues.apache.org/jira/browse/BEAM-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147446#comment-16147446 ] 

Luke Cwik edited comment on BEAM-2826 at 8/30/17 3:34 PM:
----------------------------------------------------------

What about doing this in your pipeline:
{code}
PC<V> -> DoFn(AssignVoidKey) -> PC<KV<Void, V>> -> GroupByKey -> PC<KV<Void, Iterable<V>> -> DoFn(Format<Iterable<V>> as XML string) -> PC<String> -> TextIO.withNumShards(1).withSuffix("xml");
{code}


was (Author: lcwik):
What about doing this in your pipeline:
```
PC<V> -> DoFn(AssignVoidKey) -> PC<KV<Void, V>> -> GroupByKey -> PC<KV<Void, Iterable<V>> -> DoFn(Format<Iterable<V>> as XML string) -> PC<String> -> TextIO.withNumShards(1).withSuffix("xml");
```

> Need to generate a single XML file when write is performed on small amount of data
> ----------------------------------------------------------------------------------
>
>                 Key: BEAM-2826
>                 URL: https://issues.apache.org/jira/browse/BEAM-2826
>             Project: Beam
>          Issue Type: New Feature
>          Components: beam-model
>    Affects Versions: 2.0.0
>            Reporter: Balajee Venkatesh
>            Assignee: Kenneth Knowles
>
> I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.
> When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.
> As I learnt,it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)