You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Fernando Paladini <fn...@gmail.com> on 2015/10/29 22:30:23 UTC

Running FPGrowth over a JavaPairRDD?

Hello guys!

First of all, if you want to take a look in a more readable question, take
a look in my StackOverflow question
<http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
(I've made the same question there).

I want to test Spark machine learning algorithms and I have some questions
on how to run these algorithms with non-native data types. I'm going to run
FPGrowth algorithm over the input because I want to get the most frequent
itemsets for this input.

*My data is disposed as the following:*

[timestamp, sensor1value, sensor2value] # id: 0[timestamp,
sensor1value, sensor2value] # id: 1[timestamp, sensor1value,
sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id:
3...

As I need to use Java (because Python doesn't have a lot of machine
learning algorithms from Spark), this data structure isn't very easy to
handle / create.

*To achieve this data structure in Java I can visualize two approaches:*

   1. Use existing Java classes and data types to structure the input (I
   think some problems can occur in Spark depending on how complex is my data).
   2. Create my own class (don't know if it works with Spark algorithms)

1. Existing Java classes and data types

In order to do that I've created a* List<Tuple2<Long, List<Double>>>*, so I
can keep my data structured and also can create a RDD:

List<Tuple2<Long, List<Double>>> algorithm_data = new
ArrayList<Tuple2<Long, List<Double>>>();
populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions
= sc.parallelizePairs(algorithm_data);

I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to
be not available for this data structure, as I will show you later in
this post.

2. Create my own class

I could also create a new class to store the input properly:

public class PointValue {

    private long timestamp;
    private double sensorMeasure1;
    private double sensorMeasure2;

    // Constructor, getters and setters omitted...
}

However, I don't know if I can do that and still use it with Spark
algorithms without any problems (in other words, running Spark algorithms
without headaches). I'll focus in the first approach, but if you see that
the second one is easier to achieve, please tell me.
The solution (based on approach #1):

// Initializing SparkSparkConf conf = new
SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new
JavaSparkContext(conf);
// Getting data for ML algorithmList<Tuple2<Long, List<Double>>>
algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>();
populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions
= sc.parallelizePairs(algorithm_data);
// Running FPGrowthFPGrowth fpg = new
FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel<Tuple2<Long,
List<Double>>> model = fpg.run(transactions);
// Printing everythingfor (FPGrowth.FreqItemset<Tuple2<Long,
List<Double>>> itemset: model.freqItemsets().toJavaRDD().collect()) {
    System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}

But then I got:

*The method run(JavaRDD<Basket>) in the type FPGrowth is not
applicable for the arguments (JavaPairRDD<Long,List<Double>>)*

*What can I do in order to solve my problem (run FPGrowth over
JavaPairRDD)?*

I'm available to give you more information, just tell me exactly what you
need.
Thank you!
Fernando Paladini

Re: Running FPGrowth over a JavaPairRDD?

Posted by Sabarish Sasidharan <sa...@manthan.com>.
Hi

You cannot use PairRDD but you can use JavaRDD<List>. So in your case, to
make it work with least change, you would call run(transactions.values()).

Each MLLib implementation has its own data structure typically and you
would have to convert from your data structure before you invoke. For ex if
you were doing regression on transactions you would instead convert that to
an RDD of LabeledPoint using a transactions.map(). If you wanted clustering
you would convert that to an RDD of Vector.

And taking a step back, without knowing what you want to accomplish, What
your fp growth snippet will tell you is as to which sensor values occur
together most frequently. That may or may not be what you are looking for.

Regards
Sab
On 30-Oct-2015 3:00 am, "Fernando Paladini" <fn...@gmail.com> wrote:

> Hello guys!
>
> First of all, if you want to take a look in a more readable question, take
> a look in my StackOverflow question
> <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
> (I've made the same question there).
>
> I want to test Spark machine learning algorithms and I have some questions
> on how to run these algorithms with non-native data types. I'm going to run
> FPGrowth algorithm over the input because I want to get the most frequent
> itemsets for this input.
>
> *My data is disposed as the following:*
>
> [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id: 3...
>
> As I need to use Java (because Python doesn't have a lot of machine
> learning algorithms from Spark), this data structure isn't very easy to
> handle / create.
>
> *To achieve this data structure in Java I can visualize two approaches:*
>
>    1. Use existing Java classes and data types to structure the input (I
>    think some problems can occur in Spark depending on how complex is my data).
>    2. Create my own class (don't know if it works with Spark algorithms)
>
> 1. Existing Java classes and data types
>
> In order to do that I've created a* List<Tuple2<Long, List<Double>>>*, so
> I can keep my data structured and also can create a RDD:
>
> List<Tuple2<Long, List<Double>>> algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>();
> populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = sc.parallelizePairs(algorithm_data);
>
> I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not available for this data structure, as I will show you later in this post.
>
> 2. Create my own class
>
> I could also create a new class to store the input properly:
>
> public class PointValue {
>
>     private long timestamp;
>     private double sensorMeasure1;
>     private double sensorMeasure2;
>
>     // Constructor, getters and setters omitted...
> }
>
> However, I don't know if I can do that and still use it with Spark
> algorithms without any problems (in other words, running Spark algorithms
> without headaches). I'll focus in the first approach, but if you see that
> the second one is easier to achieve, please tell me.
> The solution (based on approach #1):
>
> // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new JavaSparkContext(conf);
> // Getting data for ML algorithmList<Tuple2<Long, List<Double>>> algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>();
> populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = sc.parallelizePairs(algorithm_data);
> // Running FPGrowthFPGrowth fpg = new FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel<Tuple2<Long, List<Double>>> model = fpg.run(transactions);
> // Printing everythingfor (FPGrowth.FreqItemset<Tuple2<Long, List<Double>>> itemset: model.freqItemsets().toJavaRDD().collect()) {
>     System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}
>
> But then I got:
>
> *The method run(JavaRDD<Basket>) in the type FPGrowth is not applicable for the arguments (JavaPairRDD<Long,List<Double>>)*
>
> *What can I do in order to solve my problem (run FPGrowth over
> JavaPairRDD)?*
>
> I'm available to give you more information, just tell me exactly what you
> need.
> Thank you!
> Fernando Paladini
>