You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fabian Hueske <fh...@apache.org> on 2014/11/03 11:52:06 UTC

Re: WriteAsText bug or bad name?

Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:

> Regarding the text vs. sequence output.
> writeAsText() emits each record using its toString() method, which should
> be the String itself in your case.
>
> So if it would write binary data, something is wrong...
>
>
> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>
>> You can set the DOP of the data sink to 1 [1].
>> There is also a config parameter whether to create a directory or not in
>> case of DOP=1. If I remember correctly, the default is to NOT create
>> a folder for DOP=1.
>>
>> [1]
>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>
>> Best, Fabian
>>
>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Would it be that difficult to change the behaviour for file:/// and
>>> create a single file?or is there a way to do that?
>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>> wrote:
>>>
>>>> Dear Flavio,
>>>>
>>>> Yes, the writeAsText() merthod really creates a folder which contains a
>>>> file for each execution thread, so your threads do not block each other and
>>>> the execution can use multiple cores on your machine. You can see similar
>>>> results if you try it with env.execute() from an IDE.
>>>>
>>>> There are filesystems, HDFS to mention the most prominent one which can
>>>> transparently treat such folder structure as a single file and then it
>>>> would behave as you expect. I hope this answers your question.
>>>>
>>>> Best,
>>>>
>>>> Marton
>>>>
>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>> running the example at
>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>> file on my local filesystem..instead it creates something similar to a
>>>>> sequence file (within a folder).
>>>>> This is something misleading I think...or the API name is wrong or
>>>>> this is a bug (IMHO).
>>>>> Btw..how can I modify the following program to write results in a
>>>>> single text file on my local filesystem?
>>>>>
>>>>> public static void main(String[] args) throws Exception {
>>>>>  ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>  DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>  data.filter(new FilterFunction<String>() {
>>>>>    public boolean filter(String value) {
>>>>>     return value.startsWith("http://");
>>>>>    }
>>>>>   }).writeAsText("file:///tmp/res.txt");
>>>>>   env.execute();}
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: WriteAsText bug or bad name?

Posted by Flavio Pompermaier <po...@okkam.it>.
That is not a big problem, it should just be well documented :)

On Mon, Nov 3, 2014 at 12:09 PM, Stephan Ewen <se...@apache.org> wrote:

> Hey!
>
> Parallel outputs require multiple output files.
>
> The only way to make this a single file by default is to set the default
> parallelism of file outputs to 1. That would cause many surprises on
> cluster execution, actually.
>
> It may be a fair compromise to set the default parallelism of sinks to 1
> if the execution environment is the local environment.
>
> Stephan
>
>
> On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <fh...@apache.org> wrote:
>
>> OK, I assume the problem of creating multiple files (+ output directory)
>> is fixed by setting the DOP of the OutputFormat to 1, right?
>>
>> But you still get binary output with a TextOutputFormat that writes a
>> DataSet<String>?
>>
>> 2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Nope. This is actually a bug for me, I don't know what the FLINK
>>> community or committee think
>>>
>>>
>>> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org>
>>> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> any updates on this bug?
>>>>
>>>> Thanks, Fabian
>>>>
>>>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>
>>>>> Regarding the text vs. sequence output.
>>>>> writeAsText() emits each record using its toString() method, which
>>>>> should be the String itself in your case.
>>>>>
>>>>> So if it would write binary data, something is wrong...
>>>>>
>>>>>
>>>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>>
>>>>>> You can set the DOP of the data sink to 1 [1].
>>>>>> There is also a config parameter whether to create a directory or not
>>>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>>>> a folder for DOP=1.
>>>>>>
>>>>>> [1]
>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>>>
>>>>>> Best, Fabian
>>>>>>
>>>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>>>
>>>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>>>> create a single file?or is there a way to do that?
>>>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear Flavio,
>>>>>>>>
>>>>>>>> Yes, the writeAsText() merthod really creates a folder which
>>>>>>>> contains a file for each execution thread, so your threads do not block
>>>>>>>> each other and the execution can use multiple cores on your machine. You
>>>>>>>> can see similar results if you try it with env.execute() from an IDE.
>>>>>>>>
>>>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Marton
>>>>>>>>
>>>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi to all,
>>>>>>>>> running the example at
>>>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>>>> sequence file (within a folder).
>>>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>>>> this is a bug (IMHO).
>>>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>>>> single text file on my local filesystem?
>>>>>>>>>
>>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>>>  ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>>>>  DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>>>>  data.filter(new FilterFunction<String>() {
>>>>>>>>>    public boolean filter(String value) {
>>>>>>>>>     return value.startsWith("http://");
>>>>>>>>>    }
>>>>>>>>>   }).writeAsText("file:///tmp/res.txt");
>>>>>>>>>   env.execute();}
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: WriteAsText bug or bad name?

Posted by Stephan Ewen <se...@apache.org>.
Hey!

Parallel outputs require multiple output files.

The only way to make this a single file by default is to set the default
parallelism of file outputs to 1. That would cause many surprises on
cluster execution, actually.

It may be a fair compromise to set the default parallelism of sinks to 1 if
the execution environment is the local environment.

Stephan


On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <fh...@apache.org> wrote:

> OK, I assume the problem of creating multiple files (+ output directory)
> is fixed by setting the DOP of the OutputFormat to 1, right?
>
> But you still get binary output with a TextOutputFormat that writes a
> DataSet<String>?
>
> 2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>
>> Nope. This is actually a bug for me, I don't know what the FLINK
>> community or committee think
>>
>>
>> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org>
>> wrote:
>>
>>> Hi Flavio,
>>>
>>> any updates on this bug?
>>>
>>> Thanks, Fabian
>>>
>>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>
>>>> Regarding the text vs. sequence output.
>>>> writeAsText() emits each record using its toString() method, which
>>>> should be the String itself in your case.
>>>>
>>>> So if it would write binary data, something is wrong...
>>>>
>>>>
>>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>
>>>>> You can set the DOP of the data sink to 1 [1].
>>>>> There is also a config parameter whether to create a directory or not
>>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>>> a folder for DOP=1.
>>>>>
>>>>> [1]
>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>>
>>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>>> create a single file?or is there a way to do that?
>>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Flavio,
>>>>>>>
>>>>>>> Yes, the writeAsText() merthod really creates a folder which
>>>>>>> contains a file for each execution thread, so your threads do not block
>>>>>>> each other and the execution can use multiple cores on your machine. You
>>>>>>> can see similar results if you try it with env.execute() from an IDE.
>>>>>>>
>>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Marton
>>>>>>>
>>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Hi to all,
>>>>>>>> running the example at
>>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>>> sequence file (within a folder).
>>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>>> this is a bug (IMHO).
>>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>>> single text file on my local filesystem?
>>>>>>>>
>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>>  ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>>>  DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>>>  data.filter(new FilterFunction<String>() {
>>>>>>>>    public boolean filter(String value) {
>>>>>>>>     return value.startsWith("http://");
>>>>>>>>    }
>>>>>>>>   }).writeAsText("file:///tmp/res.txt");
>>>>>>>>   env.execute();}
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: WriteAsText bug or bad name?

Posted by Fabian Hueske <fh...@apache.org>.
OK, I assume the problem of creating multiple files (+ output directory) is
fixed by setting the DOP of the OutputFormat to 1, right?

But you still get binary output with a TextOutputFormat that writes a
DataSet<String>?

2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:

> Nope. This is actually a bug for me, I don't know what the FLINK community
> or committee think
>
>
> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org> wrote:
>
>> Hi Flavio,
>>
>> any updates on this bug?
>>
>> Thanks, Fabian
>>
>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>
>>> Regarding the text vs. sequence output.
>>> writeAsText() emits each record using its toString() method, which
>>> should be the String itself in your case.
>>>
>>> So if it would write binary data, something is wrong...
>>>
>>>
>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>
>>>> You can set the DOP of the data sink to 1 [1].
>>>> There is also a config parameter whether to create a directory or not
>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>> a folder for DOP=1.
>>>>
>>>> [1]
>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>
>>>> Best, Fabian
>>>>
>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>
>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>> create a single file?or is there a way to do that?
>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Flavio,
>>>>>>
>>>>>> Yes, the writeAsText() merthod really creates a folder which contains
>>>>>> a file for each execution thread, so your threads do not block each other
>>>>>> and the execution can use multiple cores on your machine. You can see
>>>>>> similar results if you try it with env.execute() from an IDE.
>>>>>>
>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Marton
>>>>>>
>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi to all,
>>>>>>> running the example at
>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>> sequence file (within a folder).
>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>> this is a bug (IMHO).
>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>> single text file on my local filesystem?
>>>>>>>
>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>  ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>>  DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>>  data.filter(new FilterFunction<String>() {
>>>>>>>    public boolean filter(String value) {
>>>>>>>     return value.startsWith("http://");
>>>>>>>    }
>>>>>>>   }).writeAsText("file:///tmp/res.txt");
>>>>>>>   env.execute();}
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: WriteAsText bug or bad name?

Posted by Flavio Pompermaier <po...@okkam.it>.
Nope. This is actually a bug for me, I don't know what the FLINK community
or committee think

On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Flavio,
>
> any updates on this bug?
>
> Thanks, Fabian
>
> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>
>> Regarding the text vs. sequence output.
>> writeAsText() emits each record using its toString() method, which should
>> be the String itself in your case.
>>
>> So if it would write binary data, something is wrong...
>>
>>
>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>
>>> You can set the DOP of the data sink to 1 [1].
>>> There is also a config parameter whether to create a directory or not in
>>> case of DOP=1. If I remember correctly, the default is to NOT create
>>> a folder for DOP=1.
>>>
>>> [1]
>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>
>>> Best, Fabian
>>>
>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>
>>>> Would it be that difficult to change the behaviour for file:/// and
>>>> create a single file?or is there a way to do that?
>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Flavio,
>>>>>
>>>>> Yes, the writeAsText() merthod really creates a folder which contains
>>>>> a file for each execution thread, so your threads do not block each other
>>>>> and the execution can use multiple cores on your machine. You can see
>>>>> similar results if you try it with env.execute() from an IDE.
>>>>>
>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>> can transparently treat such folder structure as a single file and then it
>>>>> would behave as you expect. I hope this answers your question.
>>>>>
>>>>> Best,
>>>>>
>>>>> Marton
>>>>>
>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi to all,
>>>>>> running the example at
>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>> sequence file (within a folder).
>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>> this is a bug (IMHO).
>>>>>> Btw..how can I modify the following program to write results in a
>>>>>> single text file on my local filesystem?
>>>>>>
>>>>>> public static void main(String[] args) throws Exception {
>>>>>>  ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>  DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>  data.filter(new FilterFunction<String>() {
>>>>>>    public boolean filter(String value) {
>>>>>>     return value.startsWith("http://");
>>>>>>    }
>>>>>>   }).writeAsText("file:///tmp/res.txt");
>>>>>>   env.execute();}
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>