You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saif Addin Ellafi (JIRA)" <ji...@apache.org> on 2015/10/08 20:49:27 UTC
[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

     [ https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saif Addin Ellafi updated SPARK-11009:
--------------------------------------
    Description: 
This issue happens when submitting the job into a standalone cluster. Have not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 does not fix the issue. Also tried having only one node in the cluster, with same result. Other shuffle configuration changes do not alter the results either.

The issue does NOT happen in --master local[*].

        val ws = Window.
            partitionBy("client_id").
            orderBy("date")
 
        val nm = "repeatMe"
        df.select(df.col("*"), rowNumber().over(ws).as(nm))
 
        df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
 
--->
 
Long, DateType, Int
[219483904822,2006-06-01,-1863462909]
[219483904822,2006-09-01,-1863462909]
[219483904822,2007-01-01,-1863462909]
[219483904822,2007-08-01,-1863462909]
[219483904822,2007-07-01,-1863462909]
[192489238423,2007-07-01,-1863462774]
[192489238423,2007-02-01,-1863462774]
[192489238423,2006-11-01,-1863462774]
[192489238423,2006-08-01,-1863462774]
[192489238423,2007-08-01,-1863462774]
[192489238423,2006-09-01,-1863462774]
[192489238423,2007-03-01,-1863462774]
[192489238423,2006-10-01,-1863462774]
[192489238423,2007-05-01,-1863462774]
[192489238423,2006-06-01,-1863462774]
[192489238423,2006-12-01,-1863462774]


  was:
This issue happens when submitting the job into a standalone cluster. Have not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 does not fix the issue. Also tried having only one node in the cluster, with same result. Other shuffle configuration changes do not alter the results either.

The issue does NOT happen in --master local[*].

        val ws = Window.
            partitionBy("client_id").
            orderBy("date")
 
        val nm = "repeatMe"
        df.select(df.col("*"), rowNumber().over(ws).as(nm))
 
        df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
 
--->
 
Long, DateType, Int
[200000000003,2006-06-01,-1863462909]
[200000000003,2006-09-01,-1863462909]
[200000000003,2007-01-01,-1863462909]
[200000000003,2007-08-01,-1863462909]
[200000000003,2007-07-01,-1863462909]
[200000000138,2007-07-01,-1863462774]
[200000000138,2007-02-01,-1863462774]
[200000000138,2006-11-01,-1863462774]
[200000000138,2006-08-01,-1863462774]
[200000000138,2007-08-01,-1863462774]
[200000000138,2006-09-01,-1863462774]
[200000000138,2007-03-01,-1863462774]
[200000000138,2006-10-01,-1863462774]
[200000000138,2007-05-01,-1863462774]
[200000000138,2006-06-01,-1863462774]
[200000000138,2006-12-01,-1863462774]



> RowNumber in HiveContext returns negative values in cluster mode
> ----------------------------------------------------------------
>
>                 Key: SPARK-11009
>                 URL: https://issues.apache.org/jira/browse/SPARK-11009
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.1
>         Environment: Standalone cluster mode
> No hadoop/hive is present in the environment (no hive-site.xml), only using HiveContext. Spark build as with hadoop 2.6.0.
> Default spark configuration variables.
> cluster has 4 nodes, but happens with n nodes as well.
>            Reporter: Saif Addin Ellafi
>
> This issue happens when submitting the job into a standalone cluster. Have not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 does not fix the issue. Also tried having only one node in the cluster, with same result. Other shuffle configuration changes do not alter the results either.
> The issue does NOT happen in --master local[*].
>         val ws = Window.
>             partitionBy("client_id").
>             orderBy("date")
>  
>         val nm = "repeatMe"
>         df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
>         df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org