You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sergey Zhemzhitsky <sz...@gmail.com> on 2018/03/28 18:25:21 UTC

DataFrames :: Corrupted Data

Hello guys,

I'm using Spark 2.2.0 and from time to time my job fails printing into
the log the following errors

scala.MatchError:
profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)

The job itself looks like the following and contains a few shuffles and UDAFs

val df = spark.read.avro(...).as[...]
      .groupBy(...)
      .agg(collect_list(...).as(...))
      .select(explode(...).as(...))
      .groupBy(...)
      .agg(sum(...).as(...))
      .groupBy(...)
      .agg(collectMetrics(...).as(...))

The errors occur in the collectMetrics UDAF in the following snippet

key match {
  case "profiles.total" => updateMetrics(...)
  case "profiles.biz" => updateMetrics(...)
  case ProfileAttrsRegex(...) => updateMetrics(...)
}

... and I'm absolutely ok with scala.MatchError because there is no
"catch all" case in the pattern matching expression, but the strings
containing corrupted characters seem to be very strange.

I've found the following jira issues, but it's hardly difficult to say
whether they are related to my case:
- https://issues.apache.org/jira/browse/SPARK-22092
- https://issues.apache.org/jira/browse/SPARK-23512

So I'm wondering, has anybody ever seen such kind of behaviour and
what could be the problem?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: DataFrames :: Corrupted Data

Posted by Sergey Zhemzhitsky <sz...@gmail.com>.

I suppose that it's hardly possible that this issue is connected with
the string encoding, because

- "pr^?files.10056.10040" should be "profiles.10056.10040" and is
defined as constant in the source code
- "profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@"
should no occur in exception at all, because such a strings are not
created within the job
- the strings being corrupted are defined within the job and there are
no such input data
- when yarn restarts the job for the second time after the first
failure, the job completes successfully




On Wed, Mar 28, 2018 at 10:31 PM, Jörn Franke <jo...@gmail.com> wrote:
> Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different?
>
>> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <sz...@gmail.com> wrote:
>>
>> Hello guys,
>>
>> I'm using Spark 2.2.0 and from time to time my job fails printing into
>> the log the following errors
>>
>> scala.MatchError:
>> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>>
>> The job itself looks like the following and contains a few shuffles and UDAFs
>>
>> val df = spark.read.avro(...).as[...]
>>      .groupBy(...)
>>      .agg(collect_list(...).as(...))
>>      .select(explode(...).as(...))
>>      .groupBy(...)
>>      .agg(sum(...).as(...))
>>      .groupBy(...)
>>      .agg(collectMetrics(...).as(...))
>>
>> The errors occur in the collectMetrics UDAF in the following snippet
>>
>> key match {
>>  case "profiles.total" => updateMetrics(...)
>>  case "profiles.biz" => updateMetrics(...)
>>  case ProfileAttrsRegex(...) => updateMetrics(...)
>> }
>>
>> ... and I'm absolutely ok with scala.MatchError because there is no
>> "catch all" case in the pattern matching expression, but the strings
>> containing corrupted characters seem to be very strange.
>>
>> I've found the following jira issues, but it's hardly difficult to say
>> whether they are related to my case:
>> - https://issues.apache.org/jira/browse/SPARK-22092
>> - https://issues.apache.org/jira/browse/SPARK-23512
>>
>> So I'm wondering, has anybody ever seen such kind of behaviour and
>> what could be the problem?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: DataFrames :: Corrupted Data

Posted by Jörn Franke <jo...@gmail.com>.

Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different?

> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <sz...@gmail.com> wrote:
> 
> Hello guys,
> 
> I'm using Spark 2.2.0 and from time to time my job fails printing into
> the log the following errors
> 
> scala.MatchError:
> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> 
> The job itself looks like the following and contains a few shuffles and UDAFs
> 
> val df = spark.read.avro(...).as[...]
>      .groupBy(...)
>      .agg(collect_list(...).as(...))
>      .select(explode(...).as(...))
>      .groupBy(...)
>      .agg(sum(...).as(...))
>      .groupBy(...)
>      .agg(collectMetrics(...).as(...))
> 
> The errors occur in the collectMetrics UDAF in the following snippet
> 
> key match {
>  case "profiles.total" => updateMetrics(...)
>  case "profiles.biz" => updateMetrics(...)
>  case ProfileAttrsRegex(...) => updateMetrics(...)
> }
> 
> ... and I'm absolutely ok with scala.MatchError because there is no
> "catch all" case in the pattern matching expression, but the strings
> containing corrupted characters seem to be very strange.
> 
> I've found the following jira issues, but it's hardly difficult to say
> whether they are related to my case:
> - https://issues.apache.org/jira/browse/SPARK-22092
> - https://issues.apache.org/jira/browse/SPARK-23512
> 
> So I'm wondering, has anybody ever seen such kind of behaviour and
> what could be the problem?
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org