You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by diplomatic Guru <di...@gmail.com> on 2016/02/01 12:29:04 UTC

Re: [MLlib] What is the best way to forecast the next month page visit?

Any suggestions please?


On 29 January 2016 at 22:31, diplomatic Guru <di...@gmail.com>
wrote:

> Hello guys,
>
> I'm trying understand how I could predict the next month page views based
> on the previous access pattern.
>
> For example, I've collected statistics on page views:
>
> e.g.
> Page,UniqueView
> -------------------------
> pageA, 10000
> pageB, 999
> ...
> pageZ,200
>
> I aggregate the statistics monthly.
>
> I've prepared a file containing last 3 months as this:
>
> e.g.
> Page,UV_NOV, UV_DEC, UV_JAN
> ---------------------------------------------------
> pageA, 10000,9989,11000
> pageB, 999,500,700
> ...
> pageZ,200,50,34
>
>
> Based on above information, I want to predict the next month (FEB).
>
> Which alogrithm do you think will suit most, I think linear regression is
> the safe bet. However, I'm struggling to prepare this data for LR ML,
> especially how do I prepare the X,Y relationship.
>
> The Y is easy (uniqiue visitors), but not sure about the X(it should be
> Page,right). However, how do I plot those three months of data.
>
> Could you give me an example based on above example data?
>
>
>
> Page,UV_NOV, UV_DEC, UV_JAN
> ---------------------------------------------------
> 1, 10000,9989,11000
> 2, 999,500,700
> ...
> 26,200,50,34
>
>
>
>
>

Re: [MLlib] What is the best way to forecast the next month page visit?

Posted by diplomatic Guru <di...@gmail.com>.
Hi Jorge,

Thanks for the example. I managed to get the job to run but the results are
appalling.

The best I could get it:
Test Mean Squared Error: 684.3709679595169
Learned regression tree model:
DecisionTreeModel regressor of depth 30 with 6905 nodes

I tried tweaking maxDepth and maxBins but I couldn't get any better results.

Do you know how I could improve the results?



On 5 February 2016 at 08:34, Jorge Machado <jo...@me.com> wrote:

> Hi,
>
> For Example an array:
>
> 3 Categories : Nov,Dec, Jan.
>
> Nov = 1,0,0
> Dec = 0,1,0
> Jan = 0,0,1
> for the complete Year you would have 12 Categories.  Like  Jan =
> 1,0,0,0,0,0,0,0,0,0,0,0
> Pages:
> PageA: 0,0,0,1
> PageB: 0,0,1,0
> PageC:0,1,0,0
> PageD:1,0,0,0
>
> If you are using decisionTree I think you do not need to normalize the
> other values
>
> You should have at the end for Januar and PageA something like :
>
> LabeledPoint (label , (0,0,1,0,0,01,1.0,2.0,3.0))
>
> Pass the LabeledPoint to the ML model.
>
> test it.
>
> PS: label is what you want to predict.
>
> On 02/02/2016, at 20:44, diplomatic Guru <di...@gmail.com> wrote:
>
> Hi Jorge,
>
> Unfortunately, I couldn't transform the data as you suggested.
>
> This is what I get:
>
> +---+---------+-------------+
> | id|pageIndex|      pageVec|
> +---+---------+-------------+
> |0.0|      3.0|    (3,[],[])|
> |1.0|      0.0|(3,[0],[1.0])|
> |2.0|      2.0|(3,[2],[1.0])|
> |3.0|      1.0|(3,[1],[1.0])|
> +---+---------+-------------+
>
>
> This is the snippets:
>
> JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
>         RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0),
>         RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0),
>         RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0),
>         RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0)
>
>     ));
>
>     StructType schema = new StructType(new StructField[] {
>         new StructField("id", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("page", DataTypes.StringType, false,
> Metadata.empty()),
>         new StructField("Nov", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("Dec", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("Jan", DataTypes.DoubleType, false,
> Metadata.empty()) });
>
>     DataFrame df = sqlContext.createDataFrame(jrdd, schema);
>
>     StringIndexerModel indexer = new
> StringIndexer().setInputCol("page").setInputCol("Nov")
>
> .setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df);
>
>     OneHotEncoder encoder = new
> OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec");
>
>     DataFrame indexed = indexer.transform(df);
>
>     DataFrame encoded = encoder.transform(indexed);
>     encoded.select("id", "pageIndex", "pageVec").show();
>
>
> Could you please let me know what I'm doing wrong?
>
>
> PS: My cluster is running Spark 1.3.0, which doesn't support
> StringIndexer, OneHotEncoder  but for testing this I've installed the 1.6.0
> on my local machine.
>
> Cheer.
>
>
> On 2 February 2016 at 10:25, Jorge Machado <jo...@me.com> wrote:
>
>> Hi Guru,
>>
>> Any results ? :)
>>
>> On 01/02/2016, at 14:34, diplomatic Guru <di...@gmail.com>
>> wrote:
>>
>> Hi Jorge,
>>
>> Thank you for the reply and your example. I'll try your suggestion and
>> will let you know the outcome.
>>
>> Cheers
>>
>>
>> On 1 February 2016 at 13:17, Jorge Machado <jo...@me.com> wrote:
>>
>>> Hi Guru,
>>>
>>> So First transform your Name pages with OneHotEncoder (
>>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder)
>>> then make the same thing for months:
>>>
>>> You will end with something like:
>>> (first tree are the pagename, the other the month,)
>>> (0,0,1,0,0,1)
>>>
>>> then you have your label that is what you want to predict. At the end
>>> you will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will
>>> represent (10000 -> (PageA, UV_NOV))
>>> After that try a regression tree with
>>>
>>> val model = DecisionTree.trainRegressor(trainingData,
>>> categoricalFeaturesInfo, impurity,maxDepth, maxBins)
>>>
>>>
>>> Regards
>>> Jorge
>>>
>>> On 01/02/2016, at 12:29, diplomatic Guru <di...@gmail.com>
>>> wrote:
>>>
>>> Any suggestions please?
>>>
>>>
>>> On 29 January 2016 at 22:31, diplomatic Guru <di...@gmail.com>
>>> wrote:
>>>
>>>> Hello guys,
>>>>
>>>> I'm trying understand how I could predict the next month page views
>>>> based on the previous access pattern.
>>>>
>>>> For example, I've collected statistics on page views:
>>>>
>>>> e.g.
>>>> Page,UniqueView
>>>> -------------------------
>>>> pageA, 10000
>>>> pageB, 999
>>>> ...
>>>> pageZ,200
>>>>
>>>> I aggregate the statistics monthly.
>>>>
>>>> I've prepared a file containing last 3 months as this:
>>>>
>>>> e.g.
>>>> Page,UV_NOV, UV_DEC, UV_JAN
>>>> ---------------------------------------------------
>>>> pageA, 10000,9989,11000
>>>> pageB, 999,500,700
>>>> ...
>>>> pageZ,200,50,34
>>>>
>>>>
>>>> Based on above information, I want to predict the next month (FEB).
>>>>
>>>> Which alogrithm do you think will suit most, I think linear regression
>>>> is the safe bet. However, I'm struggling to prepare this data for LR ML,
>>>> especially how do I prepare the X,Y relationship.
>>>>
>>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be
>>>> Page,right). However, how do I plot those three months of data.
>>>>
>>>> Could you give me an example based on above example data?
>>>>
>>>>
>>>>
>>>> Page,UV_NOV, UV_DEC, UV_JAN
>>>> ---------------------------------------------------
>>>> 1, 10000,9989,11000
>>>> 2, 999,500,700
>>>> ...
>>>> 26,200,50,34
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: [MLlib] What is the best way to forecast the next month page visit?

Posted by diplomatic Guru <di...@gmail.com>.
Hi Jorge,

Unfortunately, I couldn't transform the data as you suggested.

This is what I get:

+---+---------+-------------+
| id|pageIndex|      pageVec|
+---+---------+-------------+
|0.0|      3.0|    (3,[],[])|
|1.0|      0.0|(3,[0],[1.0])|
|2.0|      2.0|(3,[2],[1.0])|
|3.0|      1.0|(3,[1],[1.0])|
+---+---------+-------------+


This is the snippets:

JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
        RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0),
        RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0),
        RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0),
        RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0)

    ));

    StructType schema = new StructType(new StructField[] {
        new StructField("id", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("page", DataTypes.StringType, false,
Metadata.empty()),
        new StructField("Nov", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("Dec", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("Jan", DataTypes.DoubleType, false,
Metadata.empty()) });

    DataFrame df = sqlContext.createDataFrame(jrdd, schema);

    StringIndexerModel indexer = new
StringIndexer().setInputCol("page").setInputCol("Nov")

.setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df);

    OneHotEncoder encoder = new
OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec");

    DataFrame indexed = indexer.transform(df);

    DataFrame encoded = encoder.transform(indexed);
    encoded.select("id", "pageIndex", "pageVec").show();


Could you please let me know what I'm doing wrong?


PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer,
OneHotEncoder  but for testing this I've installed the 1.6.0 on my local
machine.

Cheer.


On 2 February 2016 at 10:25, Jorge Machado <jo...@me.com> wrote:

> Hi Guru,
>
> Any results ? :)
>
> On 01/02/2016, at 14:34, diplomatic Guru <di...@gmail.com> wrote:
>
> Hi Jorge,
>
> Thank you for the reply and your example. I'll try your suggestion and
> will let you know the outcome.
>
> Cheers
>
>
> On 1 February 2016 at 13:17, Jorge Machado <jo...@me.com> wrote:
>
>> Hi Guru,
>>
>> So First transform your Name pages with OneHotEncoder (
>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder)
>> then make the same thing for months:
>>
>> You will end with something like:
>> (first tree are the pagename, the other the month,)
>> (0,0,1,0,0,1)
>>
>> then you have your label that is what you want to predict. At the end you
>> will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will represent
>> (10000 -> (PageA, UV_NOV))
>> After that try a regression tree with
>>
>> val model = DecisionTree.trainRegressor(trainingData,
>> categoricalFeaturesInfo, impurity,maxDepth, maxBins)
>>
>>
>> Regards
>> Jorge
>>
>> On 01/02/2016, at 12:29, diplomatic Guru <di...@gmail.com>
>> wrote:
>>
>> Any suggestions please?
>>
>>
>> On 29 January 2016 at 22:31, diplomatic Guru <di...@gmail.com>
>> wrote:
>>
>>> Hello guys,
>>>
>>> I'm trying understand how I could predict the next month page views
>>> based on the previous access pattern.
>>>
>>> For example, I've collected statistics on page views:
>>>
>>> e.g.
>>> Page,UniqueView
>>> -------------------------
>>> pageA, 10000
>>> pageB, 999
>>> ...
>>> pageZ,200
>>>
>>> I aggregate the statistics monthly.
>>>
>>> I've prepared a file containing last 3 months as this:
>>>
>>> e.g.
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---------------------------------------------------
>>> pageA, 10000,9989,11000
>>> pageB, 999,500,700
>>> ...
>>> pageZ,200,50,34
>>>
>>>
>>> Based on above information, I want to predict the next month (FEB).
>>>
>>> Which alogrithm do you think will suit most, I think linear regression
>>> is the safe bet. However, I'm struggling to prepare this data for LR ML,
>>> especially how do I prepare the X,Y relationship.
>>>
>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be
>>> Page,right). However, how do I plot those three months of data.
>>>
>>> Could you give me an example based on above example data?
>>>
>>>
>>>
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---------------------------------------------------
>>> 1, 10000,9989,11000
>>> 2, 999,500,700
>>> ...
>>> 26,200,50,34
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: [MLlib] What is the best way to forecast the next month page visit?

Posted by Jorge Machado <jo...@me.com>.
Hi Guru,

So First transform your Name pages with OneHotEncoder ( https://spark.apache.org/docs/latest/ml-features.html#onehotencoder <https://spark.apache.org/docs/latest/ml-features.html#onehotencoder>) then make the same thing for months:

You will end with something like: 
	(first tree are the pagename, the other the month,)
	(0,0,1,0,0,1) 

then you have your label that is what you want to predict. At the end you will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will represent (10000 -> (PageA, UV_NOV))
After that try a regression tree with 

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,maxDepth, maxBins)


Regards
Jorge

> On 01/02/2016, at 12:29, diplomatic Guru <di...@gmail.com> wrote:
> 
> Any suggestions please?
> 
> 
> On 29 January 2016 at 22:31, diplomatic Guru <diplomaticguru@gmail.com <ma...@gmail.com>> wrote:
> Hello guys,
> 
> I'm trying understand how I could predict the next month page views based on the previous access pattern.
> 
> For example, I've collected statistics on page views:
> 
> e.g.
> Page,UniqueView
> -------------------------
> pageA, 10000
> pageB, 999
> ...
> pageZ,200
> 
> I aggregate the statistics monthly.
> 
> I've prepared a file containing last 3 months as this:
> 
> e.g.
> Page,UV_NOV, UV_DEC, UV_JAN
> ---------------------------------------------------
> pageA, 10000,9989,11000
> pageB, 999,500,700
> ...
> pageZ,200,50,34
> 
> 
> Based on above information, I want to predict the next month (FEB).
> 
> Which alogrithm do you think will suit most, I think linear regression is the safe bet. However, I'm struggling to prepare this data for LR ML, especially how do I prepare the X,Y relationship.
> 
> The Y is easy (uniqiue visitors), but not sure about the X(it should be Page,right). However, how do I plot those three months of data.
> 
> Could you give me an example based on above example data?
> 
> 
> 
> Page,UV_NOV, UV_DEC, UV_JAN
> ---------------------------------------------------
> 1, 10000,9989,11000
> 2, 999,500,700
> ...
> 26,200,50,34
> 
> 
> 
> 
>