You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rohit Verma <ro...@rokittech.com> on 2016/10/25 06:03:36 UTC

Help regarding reading text file within rdd operations

Hi Team,

Please help me with scenario, I tried on stackoverflow but no response, so excuse me for mailing on this thread.

I have two string lists containing text file path, List a, List b.I want to to cartesian product of list a,b to achieve a cartesian dataframe comparison.

The way I am trying is first do cartesian product, transfer it to pairRdd and then on foreach apply operation. Basically these lists are small, a ~50 elements, b~1000 elements

 List<String> a = Lists.newList("/data/1.text",/data/2.text","/data/3.text");
 List<String> b = Lists.newList("/data/4.text",/data/5.text","/data/6.text");

JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
List<Tuple2<String,String>> cartesian = cartesian(a,b);
jsc.parallelizePairs(cartesian).filter(new Function<Tuple2<String, String>, Boolean>() {
        @Override public Boolean call(Tuple2<String, String> tup) throws Exception {
            Dataset<Row> text1 = spark.read().text(tup._1); <-- this throw NullPointerException
            Dataset<Row> text2 = spark.read().text(tup._2);
            return text1.first()==text2.first(); <-- this is an indicative function only
        });
Even I can use spark to do cartesian as, but I believe spark overhead is more here.

JavaRDD<Column> sourceRdd = jsc.parallelize(a);
JavaRDD<Column> allRdd = jsc.parallelize(b);

sourceRdd.cache().cartesian(allRdd).filter(new Function<Tuple2<String, String>, Boolean>() {
        @Override public Boolean call(Tuple2<Column, Column> tup) throws Exception {
            Dataset<Row> text1 = spark.read().text(tup._1);  <-- same issue
            Dataset<Row> text2 = spark.read().text(tup._2);
            return text1.first()==text2.first();
        }
    });
Please suggest good approach to handle this.

Regards
Rohit verma


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org