You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jim Blomo <ji...@gmail.com> on 2014/04/25 03:15:04 UTC

Finding bad data

I'm using PySpark to load some data and getting an error while
parsing it.  Is it possible to find the source file and line of the bad
data?  I imagine that this would be extremely tricky when dealing with
multiple derived RRDs, so an answer with the caveat of "this only
works when running .map() on an textFile() RDD" is totally fine.
Perhaps if the line number and file was available in pyspark I could
catch the exception and output it with the context?

Anyway to narrow down the problem input would be great. Thanks!

Re: Finding bad data

Posted by Matei Zaharia <ma...@gmail.com>.
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this:

14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000

This says what file it was reading, as well as what byte offset (that’s the 0+2000 part). Unfortunately, because the executor is running multiple tasks at the same time, this message will be hard to associate with a particular task unless you only configure one core per executor. But it may help you spot the file.

The other way you might do it is a map() on the data before you process it that checks for error conditions. In that one you could print out the original input line.

I realize that neither of these is ideal. I’ve opened https://issues.apache.org/jira/browse/SPARK-1622 to try to expose this information somewhere else, ideally in the UI. The reason it wasn’t done so far is because some tasks in Spark can be reading from multiple Hadoop InputSplits (e.g. if you use coalesce(), or zip(), or similar), so it’s tough to do it in a super general setting.

Matei

On Apr 24, 2014, at 6:15 PM, Jim Blomo <ji...@gmail.com> wrote:

> I'm using PySpark to load some data and getting an error while
> parsing it.  Is it possible to find the source file and line of the bad
> data?  I imagine that this would be extremely tricky when dealing with
> multiple derived RRDs, so an answer with the caveat of "this only
> works when running .map() on an textFile() RDD" is totally fine.
> Perhaps if the line number and file was available in pyspark I could
> catch the exception and output it with the context?
> 
> Anyway to narrow down the problem input would be great. Thanks!