You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Warfish <se...@gmail.com> on 2015/07/31 18:01:22 UTC

Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

Hi everyone,

I work with Spark for a little while now and have encountered a strange
problem that gives me headaches, which has to do with the JavaRDD.subtract
method. Consider the following piece of code:

    public static void main(String[] args) {
        //context is of type JavaSparkContext; FILE is the filepath to my
input file
        JavaRDD<String> rawTestSet   = context.textFile(FILE);
        JavaRDD<String> rawTestSet2 = context.textFile(FILE);

        //Gives 0 everytime -> Correct
        System.out.println("rawTestSetMinusRawTestSet2    = " +
rawTestSet.subtract(rawTestSet2).count());

        //SearchData is a custom POJO that holds my data
        JavaRDD<SearchData> testSet      = convert(rawTestSet);
        JavaRDD<SearchData> testSet2    = convert(rawTestSet);
        JavaRDD<SearchData> testSet3    = convert(rawTestSet2);

        //These calls give numbers !=0 on cluster mode -> Incorrect
        System.out.println("testSetMinuesTestSet2         = " +
testSet.subtract(testSet2).count());
        System.out.println("testSetMinuesTestSet3         = " +
testSet.subtract(testSet3).count());
        System.out.println("testSet2MinuesTestSet3       = " +
testSet2.subtract(testSet3).count());
    }

    private static JavaRDD<SearchData> convert(JavaRDD<String> input) {
        return input.filter(new Matches("myRegex"))
                         .map(new DoSomething())
                         .map(new Split("mySplitParam"))
                         .map(new ToMap())
                         .map(new Clean())
                         .map(new ToSearchData());
    }

In this code, I read a file (usually from HDFS, but applies to disk as well)
and then convert the Strings into custom objects to hold the data using a
chain of filter- and map-operations. These objects are simple POJOs with
overriden hashCode() and equal() functions. I then apply the subtract method
to several JavaRDDs that contain exact equal data. 

Note: I have omitted the POJO code and the filter- and map-functions to make
the code more concise, but I can post it later if the need arises.

In the main method shown above are several calls of the subtract method, all
of which should give empty RDDs as results because the data in all RDDs
should be exactly the same. This works for Spark in local mode, however when
executing the code on a cluster the second block of subtract calls does not
result in empty sets, which tells me that it is a more complicated issue.
The input data on local and cluster mode was exactly the same. 

Can someone shed some light on this issue? I feel like I'm overlooking
something rather obvious.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

Posted by Sebastian Kalix <se...@gmail.com>.

Thanks for the quick  reply. I will be unable to collect more data until
Monday though, but I will update the thread accordingly.

I am using Spark 1.4.0. Were there any related issues reported? I wasn't
able to find any, but I may have overlooked something. I have also updated
the original question to include the relevant Java files, maybe the issue
is hidden there somewhere.

Ted Yu <yu...@gmail.com> schrieb am Fr., 31. Juli 2015 um 18:09 Uhr:

> Can you call collect() and log the output to get more clue what is left ?
>
> Which Spark release are you using ?
>
> Cheers
>
> On Fri, Jul 31, 2015 at 9:01 AM, Warfish <se...@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I work with Spark for a little while now and have encountered a strange
>> problem that gives me headaches, which has to do with the JavaRDD.subtract
>> method. Consider the following piece of code:
>>
>>     public static void main(String[] args) {
>>         //context is of type JavaSparkContext; FILE is the filepath to my
>> input file
>>         JavaRDD<String> rawTestSet   = context.textFile(FILE);
>>         JavaRDD<String> rawTestSet2 = context.textFile(FILE);
>>
>>         //Gives 0 everytime -> Correct
>>         System.out.println("rawTestSetMinusRawTestSet2    = " +
>> rawTestSet.subtract(rawTestSet2).count());
>>
>>         //SearchData is a custom POJO that holds my data
>>         JavaRDD<SearchData> testSet      = convert(rawTestSet);
>>         JavaRDD<SearchData> testSet2    = convert(rawTestSet);
>>         JavaRDD<SearchData> testSet3    = convert(rawTestSet2);
>>
>>         //These calls give numbers !=0 on cluster mode -> Incorrect
>>         System.out.println("testSetMinuesTestSet2         = " +
>> testSet.subtract(testSet2).count());
>>         System.out.println("testSetMinuesTestSet3         = " +
>> testSet.subtract(testSet3).count());
>>         System.out.println("testSet2MinuesTestSet3       = " +
>> testSet2.subtract(testSet3).count());
>>     }
>>
>>     private static JavaRDD<SearchData> convert(JavaRDD<String> input) {
>>         return input.filter(new Matches("myRegex"))
>>                          .map(new DoSomething())
>>                          .map(new Split("mySplitParam"))
>>                          .map(new ToMap())
>>                          .map(new Clean())
>>                          .map(new ToSearchData());
>>     }
>>
>> In this code, I read a file (usually from HDFS, but applies to disk as
>> well)
>> and then convert the Strings into custom objects to hold the data using a
>> chain of filter- and map-operations. These objects are simple POJOs with
>> overriden hashCode() and equal() functions. I then apply the subtract
>> method
>> to several JavaRDDs that contain exact equal data.
>>
>> Note: I have omitted the POJO code and the filter- and map-functions to
>> make
>> the code more concise, but I can post it later if the need arises.
>>
>> In the main method shown above are several calls of the subtract method,
>> all
>> of which should give empty RDDs as results because the data in all RDDs
>> should be exactly the same. This works for Spark in local mode, however
>> when
>> executing the code on a cluster the second block of subtract calls does
>> not
>> result in empty sets, which tells me that it is a more complicated issue.
>> The input data on local and cluster mode was exactly the same.
>>
>> Can someone shed some light on this issue? I feel like I'm overlooking
>> something rather obvious.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

Posted by Ted Yu <yu...@gmail.com>.

Can you call collect() and log the output to get more clue what is left ?

Which Spark release are you using ?

Cheers

On Fri, Jul 31, 2015 at 9:01 AM, Warfish <se...@gmail.com> wrote:

> Hi everyone,
>
> I work with Spark for a little while now and have encountered a strange
> problem that gives me headaches, which has to do with the JavaRDD.subtract
> method. Consider the following piece of code:
>
>     public static void main(String[] args) {
>         //context is of type JavaSparkContext; FILE is the filepath to my
> input file
>         JavaRDD<String> rawTestSet   = context.textFile(FILE);
>         JavaRDD<String> rawTestSet2 = context.textFile(FILE);
>
>         //Gives 0 everytime -> Correct
>         System.out.println("rawTestSetMinusRawTestSet2    = " +
> rawTestSet.subtract(rawTestSet2).count());
>
>         //SearchData is a custom POJO that holds my data
>         JavaRDD<SearchData> testSet      = convert(rawTestSet);
>         JavaRDD<SearchData> testSet2    = convert(rawTestSet);
>         JavaRDD<SearchData> testSet3    = convert(rawTestSet2);
>
>         //These calls give numbers !=0 on cluster mode -> Incorrect
>         System.out.println("testSetMinuesTestSet2         = " +
> testSet.subtract(testSet2).count());
>         System.out.println("testSetMinuesTestSet3         = " +
> testSet.subtract(testSet3).count());
>         System.out.println("testSet2MinuesTestSet3       = " +
> testSet2.subtract(testSet3).count());
>     }
>
>     private static JavaRDD<SearchData> convert(JavaRDD<String> input) {
>         return input.filter(new Matches("myRegex"))
>                          .map(new DoSomething())
>                          .map(new Split("mySplitParam"))
>                          .map(new ToMap())
>                          .map(new Clean())
>                          .map(new ToSearchData());
>     }
>
> In this code, I read a file (usually from HDFS, but applies to disk as
> well)
> and then convert the Strings into custom objects to hold the data using a
> chain of filter- and map-operations. These objects are simple POJOs with
> overriden hashCode() and equal() functions. I then apply the subtract
> method
> to several JavaRDDs that contain exact equal data.
>
> Note: I have omitted the POJO code and the filter- and map-functions to
> make
> the code more concise, but I can post it later if the need arises.
>
> In the main method shown above are several calls of the subtract method,
> all
> of which should give empty RDDs as results because the data in all RDDs
> should be exactly the same. This works for Spark in local mode, however
> when
> executing the code on a cluster the second block of subtract calls does not
> result in empty sets, which tells me that it is a more complicated issue.
> The input data on local and cluster mode was exactly the same.
>
> Can someone shed some light on this issue? I feel like I'm overlooking
> something rather obvious.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>