You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fabian Hueske <fh...@apache.org> on 2014/11/03 11:52:06 UTC
Re: WriteAsText bug or bad name?
Hi Flavio,
any updates on this bug?
Thanks, Fabian
2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
> Regarding the text vs. sequence output.
> writeAsText() emits each record using its toString() method, which should
> be the String itself in your case.
>
> So if it would write binary data, something is wrong...
>
>
> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>
>> You can set the DOP of the data sink to 1 [1].
>> There is also a config parameter whether to create a directory or not in
>> case of DOP=1. If I remember correctly, the default is to NOT create
>> a folder for DOP=1.
>>
>> [1]
>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>
>> Best, Fabian
>>
>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Would it be that difficult to change the behaviour for file:/// and
>>> create a single file?or is there a way to do that?
>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>> wrote:
>>>
>>>> Dear Flavio,
>>>>
>>>> Yes, the writeAsText() merthod really creates a folder which contains a
>>>> file for each execution thread, so your threads do not block each other and
>>>> the execution can use multiple cores on your machine. You can see similar
>>>> results if you try it with env.execute() from an IDE.
>>>>
>>>> There are filesystems, HDFS to mention the most prominent one which can
>>>> transparently treat such folder structure as a single file and then it
>>>> would behave as you expect. I hope this answers your question.
>>>>
>>>> Best,
>>>>
>>>> Marton
>>>>
>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>> running the example at
>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>> file on my local filesystem..instead it creates something similar to a
>>>>> sequence file (within a folder).
>>>>> This is something misleading I think...or the API name is wrong or
>>>>> this is a bug (IMHO).
>>>>> Btw..how can I modify the following program to write results in a
>>>>> single text file on my local filesystem?
>>>>>
>>>>> public static void main(String[] args) throws Exception {
>>>>> ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>> DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>> data.filter(new FilterFunction<String>() {
>>>>> public boolean filter(String value) {
>>>>> return value.startsWith("http://");
>>>>> }
>>>>> }).writeAsText("file:///tmp/res.txt");
>>>>> env.execute();}
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>>
>>>>
>>
>
Re: WriteAsText bug or bad name?
Posted by Flavio Pompermaier <po...@okkam.it>.
That is not a big problem, it should just be well documented :)
On Mon, Nov 3, 2014 at 12:09 PM, Stephan Ewen <se...@apache.org> wrote:
> Hey!
>
> Parallel outputs require multiple output files.
>
> The only way to make this a single file by default is to set the default
> parallelism of file outputs to 1. That would cause many surprises on
> cluster execution, actually.
>
> It may be a fair compromise to set the default parallelism of sinks to 1
> if the execution environment is the local environment.
>
> Stephan
>
>
> On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <fh...@apache.org> wrote:
>
>> OK, I assume the problem of creating multiple files (+ output directory)
>> is fixed by setting the DOP of the OutputFormat to 1, right?
>>
>> But you still get binary output with a TextOutputFormat that writes a
>> DataSet<String>?
>>
>> 2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Nope. This is actually a bug for me, I don't know what the FLINK
>>> community or committee think
>>>
>>>
>>> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org>
>>> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> any updates on this bug?
>>>>
>>>> Thanks, Fabian
>>>>
>>>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>
>>>>> Regarding the text vs. sequence output.
>>>>> writeAsText() emits each record using its toString() method, which
>>>>> should be the String itself in your case.
>>>>>
>>>>> So if it would write binary data, something is wrong...
>>>>>
>>>>>
>>>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>>
>>>>>> You can set the DOP of the data sink to 1 [1].
>>>>>> There is also a config parameter whether to create a directory or not
>>>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>>>> a folder for DOP=1.
>>>>>>
>>>>>> [1]
>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>>>
>>>>>> Best, Fabian
>>>>>>
>>>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>>>
>>>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>>>> create a single file?or is there a way to do that?
>>>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear Flavio,
>>>>>>>>
>>>>>>>> Yes, the writeAsText() merthod really creates a folder which
>>>>>>>> contains a file for each execution thread, so your threads do not block
>>>>>>>> each other and the execution can use multiple cores on your machine. You
>>>>>>>> can see similar results if you try it with env.execute() from an IDE.
>>>>>>>>
>>>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Marton
>>>>>>>>
>>>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi to all,
>>>>>>>>> running the example at
>>>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>>>> sequence file (within a folder).
>>>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>>>> this is a bug (IMHO).
>>>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>>>> single text file on my local filesystem?
>>>>>>>>>
>>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>>> ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>>>> DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>>>> data.filter(new FilterFunction<String>() {
>>>>>>>>> public boolean filter(String value) {
>>>>>>>>> return value.startsWith("http://");
>>>>>>>>> }
>>>>>>>>> }).writeAsText("file:///tmp/res.txt");
>>>>>>>>> env.execute();}
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
Re: WriteAsText bug or bad name?
Posted by Stephan Ewen <se...@apache.org>.
Hey!
Parallel outputs require multiple output files.
The only way to make this a single file by default is to set the default
parallelism of file outputs to 1. That would cause many surprises on
cluster execution, actually.
It may be a fair compromise to set the default parallelism of sinks to 1 if
the execution environment is the local environment.
Stephan
On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <fh...@apache.org> wrote:
> OK, I assume the problem of creating multiple files (+ output directory)
> is fixed by setting the DOP of the OutputFormat to 1, right?
>
> But you still get binary output with a TextOutputFormat that writes a
> DataSet<String>?
>
> 2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>
>> Nope. This is actually a bug for me, I don't know what the FLINK
>> community or committee think
>>
>>
>> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org>
>> wrote:
>>
>>> Hi Flavio,
>>>
>>> any updates on this bug?
>>>
>>> Thanks, Fabian
>>>
>>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>
>>>> Regarding the text vs. sequence output.
>>>> writeAsText() emits each record using its toString() method, which
>>>> should be the String itself in your case.
>>>>
>>>> So if it would write binary data, something is wrong...
>>>>
>>>>
>>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>>
>>>>> You can set the DOP of the data sink to 1 [1].
>>>>> There is also a config parameter whether to create a directory or not
>>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>>> a folder for DOP=1.
>>>>>
>>>>> [1]
>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>>
>>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>>> create a single file?or is there a way to do that?
>>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Flavio,
>>>>>>>
>>>>>>> Yes, the writeAsText() merthod really creates a folder which
>>>>>>> contains a file for each execution thread, so your threads do not block
>>>>>>> each other and the execution can use multiple cores on your machine. You
>>>>>>> can see similar results if you try it with env.execute() from an IDE.
>>>>>>>
>>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Marton
>>>>>>>
>>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Hi to all,
>>>>>>>> running the example at
>>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>>> sequence file (within a folder).
>>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>>> this is a bug (IMHO).
>>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>>> single text file on my local filesystem?
>>>>>>>>
>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>> ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>>> DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>>> data.filter(new FilterFunction<String>() {
>>>>>>>> public boolean filter(String value) {
>>>>>>>> return value.startsWith("http://");
>>>>>>>> }
>>>>>>>> }).writeAsText("file:///tmp/res.txt");
>>>>>>>> env.execute();}
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: WriteAsText bug or bad name?
Posted by Fabian Hueske <fh...@apache.org>.
OK, I assume the problem of creating multiple files (+ output directory) is
fixed by setting the DOP of the OutputFormat to 1, right?
But you still get binary output with a TextOutputFormat that writes a
DataSet<String>?
2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
> Nope. This is actually a bug for me, I don't know what the FLINK community
> or committee think
>
>
> On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org> wrote:
>
>> Hi Flavio,
>>
>> any updates on this bug?
>>
>> Thanks, Fabian
>>
>> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>
>>> Regarding the text vs. sequence output.
>>> writeAsText() emits each record using its toString() method, which
>>> should be the String itself in your case.
>>>
>>> So if it would write binary data, something is wrong...
>>>
>>>
>>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>>
>>>> You can set the DOP of the data sink to 1 [1].
>>>> There is also a config parameter whether to create a directory or not
>>>> in case of DOP=1. If I remember correctly, the default is to NOT create
>>>> a folder for DOP=1.
>>>>
>>>> [1]
>>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>>
>>>> Best, Fabian
>>>>
>>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>>
>>>>> Would it be that difficult to change the behaviour for file:/// and
>>>>> create a single file?or is there a way to do that?
>>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Flavio,
>>>>>>
>>>>>> Yes, the writeAsText() merthod really creates a folder which contains
>>>>>> a file for each execution thread, so your threads do not block each other
>>>>>> and the execution can use multiple cores on your machine. You can see
>>>>>> similar results if you try it with env.execute() from an IDE.
>>>>>>
>>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>>> can transparently treat such folder structure as a single file and then it
>>>>>> would behave as you expect. I hope this answers your question.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Marton
>>>>>>
>>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi to all,
>>>>>>> running the example at
>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>>> sequence file (within a folder).
>>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>>> this is a bug (IMHO).
>>>>>>> Btw..how can I modify the following program to write results in a
>>>>>>> single text file on my local filesystem?
>>>>>>>
>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>> ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>>> DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>>> data.filter(new FilterFunction<String>() {
>>>>>>> public boolean filter(String value) {
>>>>>>> return value.startsWith("http://");
>>>>>>> }
>>>>>>> }).writeAsText("file:///tmp/res.txt");
>>>>>>> env.execute();}
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>
Re: WriteAsText bug or bad name?
Posted by Flavio Pompermaier <po...@okkam.it>.
Nope. This is actually a bug for me, I don't know what the FLINK community
or committee think
On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <fh...@apache.org> wrote:
> Hi Flavio,
>
> any updates on this bug?
>
> Thanks, Fabian
>
> 2014-10-29 22:36 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>
>> Regarding the text vs. sequence output.
>> writeAsText() emits each record using its toString() method, which should
>> be the String itself in your case.
>>
>> So if it would write binary data, something is wrong...
>>
>>
>> 2014-10-29 22:34 GMT+01:00 Fabian Hueske <fh...@apache.org>:
>>
>>> You can set the DOP of the data sink to 1 [1].
>>> There is also a config parameter whether to create a directory or not in
>>> case of DOP=1. If I remember correctly, the default is to NOT create
>>> a folder for DOP=1.
>>>
>>> [1]
>>> http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#parallel-execution
>>>
>>> Best, Fabian
>>>
>>> 2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:
>>>
>>>> Would it be that difficult to change the behaviour for file:/// and
>>>> create a single file?or is there a way to do that?
>>>> On Oct 29, 2014 9:52 PM, "Márton Balassi" <ba...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Flavio,
>>>>>
>>>>> Yes, the writeAsText() merthod really creates a folder which contains
>>>>> a file for each execution thread, so your threads do not block each other
>>>>> and the execution can use multiple cores on your machine. You can see
>>>>> similar results if you try it with env.execute() from an IDE.
>>>>>
>>>>> There are filesystems, HDFS to mention the most prominent one which
>>>>> can transparently treat such folder structure as a single file and then it
>>>>> would behave as you expect. I hope this answers your question.
>>>>>
>>>>> Best,
>>>>>
>>>>> Marton
>>>>>
>>>>> On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi to all,
>>>>>> running the example at
>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html
>>>>>> I was thinking that the writeAsText on a local file was creating a text
>>>>>> file on my local filesystem..instead it creates something similar to a
>>>>>> sequence file (within a folder).
>>>>>> This is something misleading I think...or the API name is wrong or
>>>>>> this is a bug (IMHO).
>>>>>> Btw..how can I modify the following program to write results in a
>>>>>> single text file on my local filesystem?
>>>>>>
>>>>>> public static void main(String[] args) throws Exception {
>>>>>> ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
>>>>>> DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
>>>>>> data.filter(new FilterFunction<String>() {
>>>>>> public boolean filter(String value) {
>>>>>> return value.startsWith("http://");
>>>>>> }
>>>>>> }).writeAsText("file:///tmp/res.txt");
>>>>>> env.execute();}
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>