You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by awzurn <aw...@gmail.com> on 2016/01/20 16:35:40 UTC

Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Hello,

I'm doing some work on Amazon's EMR cluster, and am noticing some peculiar
results when using both DataFrames to procure and operate on data, and also
when using Spark SQL within Zeppelin to run graphs/reports. Particularly,
I'm noticing that when using either of these on the EMR running Spark 1.5.2,
it will truncate the first 8 characters from a String. You can view a sample
of this in the attached images.

On the left is Spark running locally on my Mac, printing results from a
dataframe on a test set of data. On the right, running the same operations
on the same set of data on EMR.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26022/CYo2WDgWEAMLul_.png> 

Similar results when running spark sql using the %sql tag in Zeppelin for
graphing.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26022/sql_spark_text_issue.png> 

Additionally, when I transform these back to an RDD, results are shown as
wanted (on Amazon EMR).
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26022/dt_to_rdd_print.png> 

I'm rather certain that this is not the intended behavior, especially
considering the Dataframe prints out the whole results running on my local
machine running the same version of Spark.

Is there a setting somewhere that might be causing this issue with
DataFrames and Spark SQL which could be causing this issue to come up?

Thanks,

Andrew Zurn

*Specs for EMR*
Release label:emr-4.2.0
Hadoop distribution:Amazon 2.6.0
Applications:Hive 1.0.0, Pig 0.14.0, Hue 3.7.1, Spark 1.5.2, Ganglia 3.6.0,
Mahout 0.11.0, Oozie-Sandbox 4.2.0, Presto-Sandbox 0.125, Zeppelin-Sandbox
0.5.5

Master:Running1c3.4xlarge
Core:Running10r3.4xlarge

*Additional Configuraitons*
spark.executor.cores	5
spark.dynamicAllocation.enabled	true
spark.serializer	org.apache.spark.serializer.KryoSerializer
spark.executor.memory	34G






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Posted by Daniel Darabos <da...@lynxanalytics.com>.
Hi Andrew,

If you still see this with Spark 1.6.0, it would be very helpful if you
could file a bug about it at https://issues.apache.org/jira/browse/SPARK with
as much detail as you can. This issue could be a nasty source of silent
data corruption in a case where some intermediate data loses 8 characters
but it is not obvious in the final output. Thanks!

On Fri, Jan 29, 2016 at 7:53 AM, Jonathan Kelly <jo...@gmail.com>
wrote:

> Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago:
> https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/
>
> On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn <aw...@gmail.com> wrote:
>
>> Hey Daniel,
>>
>> Thanks for the response.
>>
>> After playing around for a bit, it looks like it's probably the something
>> similar to the first situation you mentioned, with the Parquet format
>> causing issues. Both programmatically created dataset and a dataset pulled
>> off the internet (rather than out of S3 and put into HDFS/Hive) acted with
>> DataFrames as one would expect (printed out everything, grouped properly,
>> etc.)
>>
>> It looks like there is more than likely an outstanding bug that causes
>> issues with data coming from S3 and is converted in the parquet format
>> (found an article here highlighting it was around in 1.4, and I guess it
>> wouldn't be out of the realm of things for it still to exist. Link to
>> article:
>> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
>>
>> Hopefully a little more stability will come out with the upcoming Spark
>> 1.6 release on EMR (I think that is happening sometime soon).
>>
>> Thanks again for the advice on where to dig further into. Much
>> appreciated.
>>
>> Andrew
>>
>> On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos <
>> daniel.darabos@lynxanalytics.com> wrote:
>>
>>> Have you tried setting spark.emr.dropCharacters to a lower value? (It
>>> defaults to 8.)
>>>
>>> :) Just joking, sorry! Fantastic bug.
>>>
>>> What data source do you have for this DataFrame? I could imagine for
>>> example that it's a Parquet file and on EMR you are running with two wrong
>>> version of the Parquet library and it messes up strings. It should be easy
>>> enough to try a different data format. You could also try what happens if
>>> you just create the DataFrame programmatically, e.g.
>>> sc.parallelize(Seq("asdfasdfasdf")).toDF.
>>>
>>> To understand better at which point the characters are lost you could
>>> try grouping by a string attribute. I see "education" ends up either as ""
>>> (empty string) or "y" in the printed output. But are the characters already
>>> lost when you try grouping by the attribute? Will there be a single ""
>>> category, or will you have separate categories for "primary" and "tertiary"?
>>>
>>> I think the correct output through the RDD suggests that the issue
>>> happens at the very end. So it will probably happen also with different
>>> data sources, and grouping will create separate groups for "primary" and
>>> "tertiary" even though they are printed as the same string at the end. You
>>> should also check the data from "take(10)" to rule out any issues with
>>> printing. You could try the same "groupBy" trick after "take(10)". Or you
>>> could print the lengths of the strings.
>>>
>>> Good luck!
>>>
>>> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <aw...@gmail.com> wrote:
>>>
>>>> Sorry for the bump, but wondering if anyone else has seen this before.
>>>> We're
>>>> hoping to either resolve this soon, or move on with further steps to
>>>> move
>>>> this into an issue.
>>>>
>>>> Thanks in advance,
>>>>
>>>> Andrew Zurn
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Posted by Jonathan Kelly <jo...@gmail.com>.
Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago:
https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/
On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn <aw...@gmail.com> wrote:

> Hey Daniel,
>
> Thanks for the response.
>
> After playing around for a bit, it looks like it's probably the something
> similar to the first situation you mentioned, with the Parquet format
> causing issues. Both programmatically created dataset and a dataset pulled
> off the internet (rather than out of S3 and put into HDFS/Hive) acted with
> DataFrames as one would expect (printed out everything, grouped properly,
> etc.)
>
> It looks like there is more than likely an outstanding bug that causes
> issues with data coming from S3 and is converted in the parquet format
> (found an article here highlighting it was around in 1.4, and I guess it
> wouldn't be out of the realm of things for it still to exist. Link to
> article:
> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
>
> Hopefully a little more stability will come out with the upcoming Spark
> 1.6 release on EMR (I think that is happening sometime soon).
>
> Thanks again for the advice on where to dig further into. Much appreciated.
>
> Andrew
>
> On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos <
> daniel.darabos@lynxanalytics.com> wrote:
>
>> Have you tried setting spark.emr.dropCharacters to a lower value? (It
>> defaults to 8.)
>>
>> :) Just joking, sorry! Fantastic bug.
>>
>> What data source do you have for this DataFrame? I could imagine for
>> example that it's a Parquet file and on EMR you are running with two wrong
>> version of the Parquet library and it messes up strings. It should be easy
>> enough to try a different data format. You could also try what happens if
>> you just create the DataFrame programmatically, e.g.
>> sc.parallelize(Seq("asdfasdfasdf")).toDF.
>>
>> To understand better at which point the characters are lost you could try
>> grouping by a string attribute. I see "education" ends up either as ""
>> (empty string) or "y" in the printed output. But are the characters already
>> lost when you try grouping by the attribute? Will there be a single ""
>> category, or will you have separate categories for "primary" and "tertiary"?
>>
>> I think the correct output through the RDD suggests that the issue
>> happens at the very end. So it will probably happen also with different
>> data sources, and grouping will create separate groups for "primary" and
>> "tertiary" even though they are printed as the same string at the end. You
>> should also check the data from "take(10)" to rule out any issues with
>> printing. You could try the same "groupBy" trick after "take(10)". Or you
>> could print the lengths of the strings.
>>
>> Good luck!
>>
>> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <aw...@gmail.com> wrote:
>>
>>> Sorry for the bump, but wondering if anyone else has seen this before.
>>> We're
>>> hoping to either resolve this soon, or move on with further steps to move
>>> this into an issue.
>>>
>>> Thanks in advance,
>>>
>>> Andrew Zurn
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Posted by Andrew Zurn <aw...@gmail.com>.
Hey Daniel,

Thanks for the response.

After playing around for a bit, it looks like it's probably the something
similar to the first situation you mentioned, with the Parquet format
causing issues. Both programmatically created dataset and a dataset pulled
off the internet (rather than out of S3 and put into HDFS/Hive) acted with
DataFrames as one would expect (printed out everything, grouped properly,
etc.)

It looks like there is more than likely an outstanding bug that causes
issues with data coming from S3 and is converted in the parquet format
(found an article here highlighting it was around in 1.4, and I guess it
wouldn't be out of the realm of things for it still to exist. Link to
article:
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

Hopefully a little more stability will come out with the upcoming Spark 1.6
release on EMR (I think that is happening sometime soon).

Thanks again for the advice on where to dig further into. Much appreciated.

Andrew

On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos <
daniel.darabos@lynxanalytics.com> wrote:

> Have you tried setting spark.emr.dropCharacters to a lower value? (It
> defaults to 8.)
>
> :) Just joking, sorry! Fantastic bug.
>
> What data source do you have for this DataFrame? I could imagine for
> example that it's a Parquet file and on EMR you are running with two wrong
> version of the Parquet library and it messes up strings. It should be easy
> enough to try a different data format. You could also try what happens if
> you just create the DataFrame programmatically, e.g.
> sc.parallelize(Seq("asdfasdfasdf")).toDF.
>
> To understand better at which point the characters are lost you could try
> grouping by a string attribute. I see "education" ends up either as ""
> (empty string) or "y" in the printed output. But are the characters already
> lost when you try grouping by the attribute? Will there be a single ""
> category, or will you have separate categories for "primary" and "tertiary"?
>
> I think the correct output through the RDD suggests that the issue happens
> at the very end. So it will probably happen also with different data
> sources, and grouping will create separate groups for "primary" and
> "tertiary" even though they are printed as the same string at the end. You
> should also check the data from "take(10)" to rule out any issues with
> printing. You could try the same "groupBy" trick after "take(10)". Or you
> could print the lengths of the strings.
>
> Good luck!
>
> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <aw...@gmail.com> wrote:
>
>> Sorry for the bump, but wondering if anyone else has seen this before.
>> We're
>> hoping to either resolve this soon, or move on with further steps to move
>> this into an issue.
>>
>> Thanks in advance,
>>
>> Andrew Zurn
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Posted by Daniel Darabos <da...@lynxanalytics.com>.
Have you tried setting spark.emr.dropCharacters to a lower value? (It
defaults to 8.)

:) Just joking, sorry! Fantastic bug.

What data source do you have for this DataFrame? I could imagine for
example that it's a Parquet file and on EMR you are running with two wrong
version of the Parquet library and it messes up strings. It should be easy
enough to try a different data format. You could also try what happens if
you just create the DataFrame programmatically, e.g.
sc.parallelize(Seq("asdfasdfasdf")).toDF.

To understand better at which point the characters are lost you could try
grouping by a string attribute. I see "education" ends up either as ""
(empty string) or "y" in the printed output. But are the characters already
lost when you try grouping by the attribute? Will there be a single ""
category, or will you have separate categories for "primary" and "tertiary"?

I think the correct output through the RDD suggests that the issue happens
at the very end. So it will probably happen also with different data
sources, and grouping will create separate groups for "primary" and
"tertiary" even though they are printed as the same string at the end. You
should also check the data from "take(10)" to rule out any issues with
printing. You could try the same "groupBy" trick after "take(10)". Or you
could print the lengths of the strings.

Good luck!

On Tue, Jan 26, 2016 at 3:53 AM, awzurn <aw...@gmail.com> wrote:

> Sorry for the bump, but wondering if anyone else has seen this before.
> We're
> hoping to either resolve this soon, or move on with further steps to move
> this into an issue.
>
> Thanks in advance,
>
> Andrew Zurn
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Posted by awzurn <aw...@gmail.com>.
Sorry for the bump, but wondering if anyone else has seen this before. We're
hoping to either resolve this soon, or move on with further steps to move
this into an issue.

Thanks in advance,

Andrew Zurn



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org