You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Paweł Szulc <pa...@gmail.com> on 2014/12/11 12:50:36 UTC

Can spark job have sideeffects (write files to FileSystem)

Imagine simple Spark job, that will store each line of the RDD to a
separate file


val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
lines.foreach(line => writeToFile(line))

def writeToFile(line: String) = {
    def filePath = "file://..."
    val file = new File(new URI(path).getPath)
    // using function simply closes the output stream
    using(new FileOutputStream(file)) { output =>
      output.write(value)
    }
}


Now, example above works 99,9% of a time. Files are generated for each
line, each file contains that particular line.

However, when dealing with large number of data, we encounter situations
where some of the files are empty! Files are generated, but there is no
content inside of them (0 bytes).

Now the question is: can Spark job have side effects. Is it even legal to
write such code?
If no, then what other choice do we have when we want to save data from our
RDD?
If yes, then do you guys see what could be the reason of this job acting in
this strange manner 0.1% of the time?


disclaimer: we are fully aware of .saveAsTextFile method in the API,
however the example above is a simplification of our code - normally we
produce PDF files.


Best regards,
Paweł Szulc

Re: Can spark job have sideeffects (write files to FileSystem)

Posted by Davies Liu <da...@databricks.com>.
Thinking about that any task could be launched concurrently in
different nodes, so in order to make sure the generated files are
valid, you need some atomic operation (such as rename) to do it. For
example, you could generate a random name for output file, writing the
data into it, rename it to the target name finally. This is what
happened in saveAsTextFile().

On Mon, Dec 15, 2014 at 4:37 PM, Paweł Szulc <pa...@gmail.com> wrote:
> Yes, this is what I also found in Spark documentation, that foreach can have
> side effects. Nevertheless I have this weird error, that sometimes files are
> just empty.
>
> "using" is simply a wrapper that takes our code, makes try-catch-finally and
> flush & close all resources.
>
> I honestly have no clue what can possibly be wrong.
>
> No errors in logs.
>
> On Thu, Dec 11, 2014 at 2:29 PM, Daniel Darabos
> <da...@lynxanalytics.com> wrote:
>>
>> Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
>> be encountering an IO exception while writing, and maybe using() suppresses
>> it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
>> expect there is less that can go wrong with that simple call.
>>
>> On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <pa...@gmail.com>
>> wrote:
>>>
>>> Imagine simple Spark job, that will store each line of the RDD to a
>>> separate file
>>>
>>>
>>> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
>>> lines.foreach(line => writeToFile(line))
>>>
>>> def writeToFile(line: String) = {
>>>     def filePath = "file://..."
>>>     val file = new File(new URI(path).getPath)
>>>     // using function simply closes the output stream
>>>     using(new FileOutputStream(file)) { output =>
>>>       output.write(value)
>>>     }
>>> }
>>>
>>>
>>> Now, example above works 99,9% of a time. Files are generated for each
>>> line, each file contains that particular line.
>>>
>>> However, when dealing with large number of data, we encounter situations
>>> where some of the files are empty! Files are generated, but there is no
>>> content inside of them (0 bytes).
>>>
>>> Now the question is: can Spark job have side effects. Is it even legal to
>>> write such code?
>>> If no, then what other choice do we have when we want to save data from
>>> our RDD?
>>> If yes, then do you guys see what could be the reason of this job acting
>>> in this strange manner 0.1% of the time?
>>>
>>>
>>> disclaimer: we are fully aware of .saveAsTextFile method in the API,
>>> however the example above is a simplification of our code - normally we
>>> produce PDF files.
>>>
>>>
>>> Best regards,
>>> Paweł Szulc
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Can spark job have sideeffects (write files to FileSystem)

Posted by Paweł Szulc <pa...@gmail.com>.
Yes, this is what I also found in Spark documentation, that foreach can
have side effects. Nevertheless I have this weird error, that sometimes
files are just empty.

"using" is simply a wrapper that takes our code, makes try-catch-finally
and flush & close all resources.

I honestly have no clue what can possibly be wrong.

No errors in logs.

On Thu, Dec 11, 2014 at 2:29 PM, Daniel Darabos <
daniel.darabos@lynxanalytics.com> wrote:
>
> Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
> be encountering an IO exception while writing, and maybe using() suppresses
> it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
> expect there is less that can go wrong with that simple call.
>
> On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <pa...@gmail.com>
> wrote:
>
>> Imagine simple Spark job, that will store each line of the RDD to a
>> separate file
>>
>>
>> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
>> lines.foreach(line => writeToFile(line))
>>
>> def writeToFile(line: String) = {
>>     def filePath = "file://..."
>>     val file = new File(new URI(path).getPath)
>>     // using function simply closes the output stream
>>     using(new FileOutputStream(file)) { output =>
>>       output.write(value)
>>     }
>> }
>>
>>
>> Now, example above works 99,9% of a time. Files are generated for each
>> line, each file contains that particular line.
>>
>> However, when dealing with large number of data, we encounter situations
>> where some of the files are empty! Files are generated, but there is no
>> content inside of them (0 bytes).
>>
>> Now the question is: can Spark job have side effects. Is it even legal to
>> write such code?
>> If no, then what other choice do we have when we want to save data from
>> our RDD?
>> If yes, then do you guys see what could be the reason of this job acting
>> in this strange manner 0.1% of the time?
>>
>>
>> disclaimer: we are fully aware of .saveAsTextFile method in the API,
>> however the example above is a simplification of our code - normally we
>> produce PDF files.
>>
>>
>> Best regards,
>> Paweł Szulc
>>
>>
>>
>>
>>
>>
>>
>

Re: Can spark job have sideeffects (write files to FileSystem)

Posted by Daniel Darabos <da...@lynxanalytics.com>.
Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
be encountering an IO exception while writing, and maybe using() suppresses
it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
expect there is less that can go wrong with that simple call.

On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <pa...@gmail.com> wrote:

> Imagine simple Spark job, that will store each line of the RDD to a
> separate file
>
>
> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
> lines.foreach(line => writeToFile(line))
>
> def writeToFile(line: String) = {
>     def filePath = "file://..."
>     val file = new File(new URI(path).getPath)
>     // using function simply closes the output stream
>     using(new FileOutputStream(file)) { output =>
>       output.write(value)
>     }
> }
>
>
> Now, example above works 99,9% of a time. Files are generated for each
> line, each file contains that particular line.
>
> However, when dealing with large number of data, we encounter situations
> where some of the files are empty! Files are generated, but there is no
> content inside of them (0 bytes).
>
> Now the question is: can Spark job have side effects. Is it even legal to
> write such code?
> If no, then what other choice do we have when we want to save data from
> our RDD?
> If yes, then do you guys see what could be the reason of this job acting
> in this strange manner 0.1% of the time?
>
>
> disclaimer: we are fully aware of .saveAsTextFile method in the API,
> however the example above is a simplification of our code - normally we
> produce PDF files.
>
>
> Best regards,
> Paweł Szulc
>
>
>
>
>
>
>