You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by qiaoresearcher <qi...@gmail.com> on 2012/11/09 17:20:31 UTC

need help on mahout

Hi All,

Assume the data is stored in a gzip file which includes many text files.
Within each text file, each line represents an activity of a user, for
example, a click on a web page.
the text file will look like:
----------------------------------------------------------------------------------
user 1   time11  visiting_web_page11
user 2   time21  visiting_web_page21
user 1   time12  visiting_web_page12
user 1   time13  visiting_web_page13
user 2   time22  visiting_web_page22
user 3   time31  visiting_web_page31
user 1   time14  visiting_web_page14
 ...           ....                ..........

I am thinking to first construct a web page set like
{ visiting_web_page11, visiting_web_page12, visiting_web_page31, ....... }

then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]  where
'1' means the user visited that page and 0 means he did not
then use mahout to classify the users based on the vectors

does mahout has example like this? if not, what kind of java code we need
to write to process this task?

thanks for any suggestions in advance !

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Hm. you just said you want to supervise it? That's what i suggested first
but you said you wanted supervised classficiation. Clustering is
unsupervised (at least in methods with Mahout... ).


On Fri, Nov 9, 2012 at 9:58 AM, qiaoresearcher <qi...@gmail.com>wrote:

> yes, it is. Does mahout has examples or similar example to do this:  read
> the gzip file, construct page set, form vectors for each user, then run as
> rabbit
>
> On Fri, Nov 9, 2012 at 11:47 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > That's a clustering problem, no?
> >
> >
> > On Fri, Nov 9, 2012 at 4:43 PM, qiaoresearcher <qiaoresearcher@gmail.com
> > >wrote:
> >
> > > It is a supervised classification problem.
> > >
> > > For example, a very simple case:
> > > say, overall we collect 4 pages from the data set:  { web_page 1
> >  web_page
> > > 2 web_page 3 web_page 4  }
> > > then users may have input vectors like:
> > > user1 [1 1  0  0]
> > > user2 [1 1  0  0]
> > > user3 [0 0  1  1]
> > > user4 [0 0  1  1]
> > > user5 [0 0  1  1]
> > >   ...       ....
> > >
> > > then whatever classification algorithm mahout has should return
> > > classification results as
> > > group 1 { user1, user2}
> > > group 2 { user3, user4, user5 }
> > >
> >
>

Re: need help on mahout

Posted by qiaoresearcher <qi...@gmail.com>.
yes, it is. Does mahout has examples or similar example to do this:  read
the gzip file, construct page set, form vectors for each user, then run as
rabbit

On Fri, Nov 9, 2012 at 11:47 AM, Sean Owen <sr...@gmail.com> wrote:

> That's a clustering problem, no?
>
>
> On Fri, Nov 9, 2012 at 4:43 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >wrote:
>
> > It is a supervised classification problem.
> >
> > For example, a very simple case:
> > say, overall we collect 4 pages from the data set:  { web_page 1
>  web_page
> > 2 web_page 3 web_page 4  }
> > then users may have input vectors like:
> > user1 [1 1  0  0]
> > user2 [1 1  0  0]
> > user3 [0 0  1  1]
> > user4 [0 0  1  1]
> > user5 [0 0  1  1]
> >   ...       ....
> >
> > then whatever classification algorithm mahout has should return
> > classification results as
> > group 1 { user1, user2}
> > group 2 { user3, user4, user5 }
> >
>

Re: need help on mahout

Posted by Sean Owen <sr...@gmail.com>.
That's a clustering problem, no?


On Fri, Nov 9, 2012 at 4:43 PM, qiaoresearcher <qi...@gmail.com>wrote:

> It is a supervised classification problem.
>
> For example, a very simple case:
> say, overall we collect 4 pages from the data set:  { web_page 1  web_page
> 2 web_page 3 web_page 4  }
> then users may have input vectors like:
> user1 [1 1  0  0]
> user2 [1 1  0  0]
> user3 [0 0  1  1]
> user4 [0 0  1  1]
> user5 [0 0  1  1]
>   ...       ....
>
> then whatever classification algorithm mahout has should return
> classification results as
> group 1 { user1, user2}
> group 2 { user3, user4, user5 }
>

Re: need help on mahout

Posted by Ted Dunning <te...@gmail.com>.
There is additional confusion typically because supervised and unsupervised
methods are commonly used together.  For instance, clustering
(unsupervised) can be used to generate cluster proximity features that are
then used as features for classification (supervised).

Another example might be where you use unsupervised clustering on the
labeled data including the target variable along with the other features.
 This is an unsupervised algorithm but it is used in such a way that it can
see the target variable so that it is doing a strange sort of mixed thing.
 The resulting cluster proximity features can be very high quality.

You can even do semi-supervised clustering with training data that is only
partially labeled.

It isn't surprising that these distinctions are a bit fuzzy at first.

On Fri, Nov 9, 2012 at 2:11 PM, Pat Ferrel <pa...@gmail.com> wrote:

> The confusion here may be over the term "supervised"
>
> Supervised classification assumes you know which group each user is in,
> and the classifier builds a model to classify new users into the predefined
> groups. Usually there is a classifier for each group that, when given a
> user vector, return how likely the user is a member of that group.
>
> Clustering is an unsupervised classifier which sees the groups without
> being told which user is in which group. It does this by finding structure
> in the data itself.
>
> If you don't know the groups ahead of time you want to cluster. If you are
> classifying users based on known groups of previous users you want to build
> a classifier and mahout has both.
>
> You probably need to create the vectors using Mahout code. Your matrix of
> users and pages visited could be very large and sparse (lots of pages not
> visited). So representing as a .csv is not scalable. Look at the various
> Vector classes in Mahout. Once you get the data into a vector mahout can
> cluster the data or build a supervised classifier.
>
> There is a very nice description of the Mahout Vector types, clustering
> and classification in "Mahout in Action" a book from Manning Publishing.
> Read section 8.1.1 "Transforming data into vectors", the rest of the
> chapter talks about clustering but a sections further along covers
> classification.
>
> On Nov 9, 2012, at 1:44 PM, qiaoresearcher <qi...@gmail.com>
> wrote:
>
> many thanks, i may need sometime to digest the information you
> provide...:-)
>
> have a nice weekend,
>
>
> On Fri, Nov 9, 2012 at 3:34 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > No SGD (stochastic gradient descent) and factorization are two different
> > things. More strictly, those are two different classes of problems --
> > factorization and regression. SGD is one implementation for regression
> > classifcation. Factorization is finding virtual factors in a user/item
> > space (ALS-WR is one of the methods to find such factors).
> >
> > Yes SGD is in the book but not with your example specifically since I
> meant
> > to apply it after you find latent variables (factors, whatever).
> >
> > You will get more help on ALS-WR method by staying on the list and also
> > perhaps create an archive entry for others to follow in a similar
> > situation. The idea is that we all learn together and effectively:) (and
> i
> > score more points for support :)
> >
> > CVB (if i am not totally off) is something called continuous variational
> > Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> > you to analyze content of your web pages IF you manage to grab the text
> off
> > of them. in Mahout, it is facilitated by a package here:
> >
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
> > I
> > don't know where exactly wiki help on CVB is, but searching mahout
> archive
> > and stack overflow may help. Again, by staing on the list you may get
> more
> > help with that.
> >
> > LSA (Latent semantic analysis) is another way to analyze the content of
> you
> > web. See a wikipedia article for refresher, but basically it is a run of
> > SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> > prepare that context data with seqdirectory, seq2sparse commands (again,
> > you can find details in the book). Then you just run 'mahout ssvd
> > <options>' on the output of seq2sparse and use rows of U*Sigma output for
> > the topical allocation values. Somebody will probably correct me on this,
> > but I think you can use topical allocation values to further build your
> > classification with regressions (SGD).
> >
> > -d
> >
> >
> > On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >> wrote:
> >
> >> Hi Dmitriy,
> >>
> >> Many thanks for your comments and i really appreciate although I think I
> >> may not fully understood you.
> >>
> >> As I understand, SGD mean stochastic gradient descent, is that right?
> >> I What I need now is some example code to :  read the files, construct
> > the
> >> web page set, then form the vectors. Such steps are called
> > 'factorization'
> >> in Mahout, right?
> >>
> >> Do you mean Mahout in Action has examples similar to what I described?
> >> what is CVB and LSA, and SSVD (singular value decomposition?)
> >>
> >>
> >>
> >
>
>

Re: need help on mahout

Posted by Pat Ferrel <pa...@gmail.com>.
The confusion here may be over the term "supervised" 

Supervised classification assumes you know which group each user is in, and the classifier builds a model to classify new users into the predefined groups. Usually there is a classifier for each group that, when given a user vector, return how likely the user is a member of that group.

Clustering is an unsupervised classifier which sees the groups without being told which user is in which group. It does this by finding structure in the data itself.

If you don't know the groups ahead of time you want to cluster. If you are classifying users based on known groups of previous users you want to build a classifier and mahout has both.

You probably need to create the vectors using Mahout code. Your matrix of users and pages visited could be very large and sparse (lots of pages not visited). So representing as a .csv is not scalable. Look at the various Vector classes in Mahout. Once you get the data into a vector mahout can cluster the data or build a supervised classifier.

There is a very nice description of the Mahout Vector types, clustering and classification in "Mahout in Action" a book from Manning Publishing. Read section 8.1.1 "Transforming data into vectors", the rest of the chapter talks about clustering but a sections further along covers classification.

On Nov 9, 2012, at 1:44 PM, qiaoresearcher <qi...@gmail.com> wrote:

many thanks, i may need sometime to digest the information you
provide...:-)

have a nice weekend,


On Fri, Nov 9, 2012 at 3:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> No SGD (stochastic gradient descent) and factorization are two different
> things. More strictly, those are two different classes of problems --
> factorization and regression. SGD is one implementation for regression
> classifcation. Factorization is finding virtual factors in a user/item
> space (ALS-WR is one of the methods to find such factors).
> 
> Yes SGD is in the book but not with your example specifically since I meant
> to apply it after you find latent variables (factors, whatever).
> 
> You will get more help on ALS-WR method by staying on the list and also
> perhaps create an archive entry for others to follow in a similar
> situation. The idea is that we all learn together and effectively:) (and i
> score more points for support :)
> 
> CVB (if i am not totally off) is something called continuous variational
> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> you to analyze content of your web pages IF you manage to grab the text off
> of them. in Mahout, it is facilitated by a package here:
> 
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
> I
> don't know where exactly wiki help on CVB is, but searching mahout archive
> and stack overflow may help. Again, by staing on the list you may get more
> help with that.
> 
> LSA (Latent semantic analysis) is another way to analyze the content of you
> web. See a wikipedia article for refresher, but basically it is a run of
> SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> prepare that context data with seqdirectory, seq2sparse commands (again,
> you can find details in the book). Then you just run 'mahout ssvd
> <options>' on the output of seq2sparse and use rows of U*Sigma output for
> the topical allocation values. Somebody will probably correct me on this,
> but I think you can use topical allocation values to further build your
> classification with regressions (SGD).
> 
> -d
> 
> 
> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qiaoresearcher@gmail.com
>> wrote:
> 
>> Hi Dmitriy,
>> 
>> Many thanks for your comments and i really appreciate although I think I
>> may not fully understood you.
>> 
>> As I understand, SGD mean stochastic gradient descent, is that right?
>> I What I need now is some example code to :  read the files, construct
> the
>> web page set, then form the vectors. Such steps are called
> 'factorization'
>> in Mahout, right?
>> 
>> Do you mean Mahout in Action has examples similar to what I described?
>> what is CVB and LSA, and SSVD (singular value decomposition?)
>> 
>> 
>> 
> 


Re: need help on mahout

Posted by qiaoresearcher <qi...@gmail.com>.
many thanks, i may need sometime to digest the information you
provide...:-)

have a nice weekend,


On Fri, Nov 9, 2012 at 3:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> No SGD (stochastic gradient descent) and factorization are two different
> things. More strictly, those are two different classes of problems --
> factorization and regression. SGD is one implementation for regression
> classifcation. Factorization is finding virtual factors in a user/item
> space (ALS-WR is one of the methods to find such factors).
>
> Yes SGD is in the book but not with your example specifically since I meant
> to apply it after you find latent variables (factors, whatever).
>
> You will get more help on ALS-WR method by staying on the list and also
> perhaps create an archive entry for others to follow in a similar
> situation. The idea is that we all learn together and effectively:) (and i
> score more points for support :)
>
> CVB (if i am not totally off) is something called continuous variational
> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> you to analyze content of your web pages IF you manage to grab the text off
> of them. in Mahout, it is facilitated by a package here:
>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
> I
> don't know where exactly wiki help on CVB is, but searching mahout archive
> and stack overflow may help. Again, by staing on the list you may get more
> help with that.
>
> LSA (Latent semantic analysis) is another way to analyze the content of you
> web. See a wikipedia article for refresher, but basically it is a run of
> SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> prepare that context data with seqdirectory, seq2sparse commands (again,
> you can find details in the book). Then you just run 'mahout ssvd
> <options>' on the output of seq2sparse and use rows of U*Sigma output for
> the topical allocation values. Somebody will probably correct me on this,
> but I think you can use topical allocation values to further build your
> classification with regressions (SGD).
>
> -d
>
>
> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >wrote:
>
> > Hi Dmitriy,
> >
> > Many thanks for your comments and i really appreciate although I think I
> > may not fully understood you.
> >
> > As I understand, SGD mean stochastic gradient descent, is that right?
> > I What I need now is some example code to :  read the files, construct
> the
> > web page set, then form the vectors. Such steps are called
> 'factorization'
> > in Mahout, right?
> >
> > Do you mean Mahout in Action has examples similar to what I described?
> > what is CVB and LSA, and SSVD (singular value decomposition?)
> >
> >
> >
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
another correction: CVB= *collapsed* variational Bayes.


On Fri, Nov 9, 2012 at 1:37 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> correction, with LSA you probably want to use rows of U or U*sqrt(Sigma)
> (ssvd --uHalfSigma option), not U*Sigma.
>
>
> On Fri, Nov 9, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> No SGD (stochastic gradient descent) and factorization are two different
>> things. More strictly, those are two different classes of problems --
>> factorization and regression. SGD is one implementation for regression
>> classifcation. Factorization is finding virtual factors in a user/item
>> space (ALS-WR is one of the methods to find such factors).
>>
>> Yes SGD is in the book but not with your example specifically since I
>> meant to apply it after you find latent variables (factors, whatever).
>>
>> You will get more help on ALS-WR method by staying on the list and also
>> perhaps create an archive entry for others to follow in a similar
>> situation. The idea is that we all learn together and effectively:) (and i
>> score more points for support :)
>>
>> CVB (if i am not totally off) is something called continuous variational
>> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
>> you to analyze content of your web pages IF you manage to grab the text off
>> of them. in Mahout, it is facilitated by a package here:
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html I
>> don't know where exactly wiki help on CVB is, but searching mahout archive
>> and stack overflow may help. Again, by staing on the list you may get more
>> help with that.
>>
>> LSA (Latent semantic analysis) is another way to analyze the content of
>> you web. See a wikipedia article for refresher, but basically it is a run
>> of SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
>> prepare that context data with seqdirectory, seq2sparse commands (again,
>> you can find details in the book). Then you just run 'mahout ssvd
>> <options>' on the output of seq2sparse and use rows of U*Sigma output for
>> the topical allocation values. Somebody will probably correct me on this,
>> but I think you can use topical allocation values to further build your
>> classification with regressions (SGD).
>>
>> -d
>>
>>
>> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qi...@gmail.com>wrote:
>>
>>> Hi Dmitriy,
>>>
>>> Many thanks for your comments and i really appreciate although I think I
>>> may not fully understood you.
>>>
>>> As I understand, SGD mean stochastic gradient descent, is that right?
>>> I What I need now is some example code to :  read the files, construct the
>>> web page set, then form the vectors. Such steps are called 'factorization'
>>> in Mahout, right?
>>>
>>> Do you mean Mahout in Action has examples similar to what I described?
>>> what is CVB and LSA, and SSVD (singular value decomposition?)
>>>
>>>
>>>
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
correction, with LSA you probably want to use rows of U or U*sqrt(Sigma)
(ssvd --uHalfSigma option), not U*Sigma.


On Fri, Nov 9, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> No SGD (stochastic gradient descent) and factorization are two different
> things. More strictly, those are two different classes of problems --
> factorization and regression. SGD is one implementation for regression
> classifcation. Factorization is finding virtual factors in a user/item
> space (ALS-WR is one of the methods to find such factors).
>
> Yes SGD is in the book but not with your example specifically since I
> meant to apply it after you find latent variables (factors, whatever).
>
> You will get more help on ALS-WR method by staying on the list and also
> perhaps create an archive entry for others to follow in a similar
> situation. The idea is that we all learn together and effectively:) (and i
> score more points for support :)
>
> CVB (if i am not totally off) is something called continuous variational
> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> you to analyze content of your web pages IF you manage to grab the text off
> of them. in Mahout, it is facilitated by a package here:
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html I
> don't know where exactly wiki help on CVB is, but searching mahout archive
> and stack overflow may help. Again, by staing on the list you may get more
> help with that.
>
> LSA (Latent semantic analysis) is another way to analyze the content of
> you web. See a wikipedia article for refresher, but basically it is a run
> of SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> prepare that context data with seqdirectory, seq2sparse commands (again,
> you can find details in the book). Then you just run 'mahout ssvd
> <options>' on the output of seq2sparse and use rows of U*Sigma output for
> the topical allocation values. Somebody will probably correct me on this,
> but I think you can use topical allocation values to further build your
> classification with regressions (SGD).
>
> -d
>
>
> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qi...@gmail.com>wrote:
>
>> Hi Dmitriy,
>>
>> Many thanks for your comments and i really appreciate although I think I
>> may not fully understood you.
>>
>> As I understand, SGD mean stochastic gradient descent, is that right?
>> I What I need now is some example code to :  read the files, construct the
>> web page set, then form the vectors. Such steps are called 'factorization'
>> in Mahout, right?
>>
>> Do you mean Mahout in Action has examples similar to what I described?
>> what is CVB and LSA, and SSVD (singular value decomposition?)
>>
>>
>>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
No SGD (stochastic gradient descent) and factorization are two different
things. More strictly, those are two different classes of problems --
factorization and regression. SGD is one implementation for regression
classifcation. Factorization is finding virtual factors in a user/item
space (ALS-WR is one of the methods to find such factors).

Yes SGD is in the book but not with your example specifically since I meant
to apply it after you find latent variables (factors, whatever).

You will get more help on ALS-WR method by staying on the list and also
perhaps create an archive entry for others to follow in a similar
situation. The idea is that we all learn together and effectively:) (and i
score more points for support :)

CVB (if i am not totally off) is something called continuous variational
Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
you to analyze content of your web pages IF you manage to grab the text off
of them. in Mahout, it is facilitated by a package here:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
I
don't know where exactly wiki help on CVB is, but searching mahout archive
and stack overflow may help. Again, by staing on the list you may get more
help with that.

LSA (Latent semantic analysis) is another way to analyze the content of you
web. See a wikipedia article for refresher, but basically it is a run of
SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
prepare that context data with seqdirectory, seq2sparse commands (again,
you can find details in the book). Then you just run 'mahout ssvd
<options>' on the output of seq2sparse and use rows of U*Sigma output for
the topical allocation values. Somebody will probably correct me on this,
but I think you can use topical allocation values to further build your
classification with regressions (SGD).

-d


On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qi...@gmail.com>wrote:

> Hi Dmitriy,
>
> Many thanks for your comments and i really appreciate although I think I
> may not fully understood you.
>
> As I understand, SGD mean stochastic gradient descent, is that right?
> I What I need now is some example code to :  read the files, construct the
> web page set, then form the vectors. Such steps are called 'factorization'
> in Mahout, right?
>
> Do you mean Mahout in Action has examples similar to what I described?
> what is CVB and LSA, and SSVD (singular value decomposition?)
>
>
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
ok. i guess you can try factorization first (against user vs. pages) and
then try to run user factor vectors as predictors with SGD. However it will
not work well if your user/page matrix is too sparse. IMO you need to
prototype this approach in R first before moving to scale to see if you
even can get an acceptable result.


On Fri, Nov 9, 2012 at 9:06 AM, qiaoresearcher <qi...@gmail.com>wrote:

> You are absolutely right, but here I have simplified the problem. Content
> similarity can be regarded as one to enrich the features. Features can be
> defined in many ways, here I would like to start with most simple feature:
> visited or not, later on I will add more features if the results can not
> meet expectation
>
> On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > sorry you probably meant that anyway. your trained input should be
> labeled
> > by groups and your prediction request input is not labeled.
> >
> > looks like a job for a classification like sgd except visited pages make
> up
> > poor categorical source without looking into their content similarities.
> > On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
> >
> > > if it is supervised classification, your input should contain the
> groups.
> > > te idea is that you extend knowledge hidden in  a smaller perhaps
> expert
> > > labeled dataset to the rest of the universe.
> > > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qi...@gmail.com>
> > wrote:
> > >
> > >> It is a supervised classification problem.
> > >>
> > >> For example, a very simple case:
> > >> say, overall we collect 4 pages from the data set:  { web_page 1
> >  web_page
> > >> 2 web_page 3 web_page 4  }
> > >> then users may have input vectors like:
> > >> user1 [1 1  0  0]
> > >> user2 [1 1  0  0]
> > >> user3 [0 0  1  1]
> > >> user4 [0 0  1  1]
> > >> user5 [0 0  1  1]
> > >>   ...       ....
> > >>
> > >> then whatever classification algorithm mahout has should return
> > >> classification results as
> > >> group 1 { user1, user2}
> > >> group 2 { user3, user4, user5 }
> > >>
> > >>
> > >>
> > >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:
> > >>
> > >> > First: what question are you trying to answer from this data? You
> are
> > >> > trying to classify users into what, for what purpose?
> > >> >
> > >> >
> > >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
> > >> qiaoresearcher@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > Assume the data is stored in a gzip file which includes many text
> > >> files.
> > >> > > Within each text file, each line represents an activity of a user,
> > for
> > >> > > example, a click on a web page.
> > >> > > the text file will look like:
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> ----------------------------------------------------------------------------------
> > >> > > user 1   time11  visiting_web_page11
> > >> > > user 2   time21  visiting_web_page21
> > >> > > user 1   time12  visiting_web_page12
> > >> > > user 1   time13  visiting_web_page13
> > >> > > user 2   time22  visiting_web_page22
> > >> > > user 3   time31  visiting_web_page31
> > >> > > user 1   time14  visiting_web_page14
> > >> > >  ...           ....                ..........
> > >> > >
> > >> > > I am thinking to first construct a web page set like
> > >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> > >> .......
> > >> > }
> > >> > >
> > >> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
> > >>  where
> > >> > > '1' means the user visited that page and 0 means he did not
> > >> > > then use mahout to classify the users based on the vectors
> > >> > >
> > >> > > does mahout has example like this? if not, what kind of java code
> we
> > >> need
> > >> > > to write to process this task?
> > >> > >
> > >> > > thanks for any suggestions in advance !
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: need help on mahout

Posted by qiaoresearcher <qi...@gmail.com>.
You are absolutely right, but here I have simplified the problem. Content
similarity can be regarded as one to enrich the features. Features can be
defined in many ways, here I would like to start with most simple feature:
visited or not, later on I will add more features if the results can not
meet expectation

On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> sorry you probably meant that anyway. your trained input should be labeled
> by groups and your prediction request input is not labeled.
>
> looks like a job for a classification like sgd except visited pages make up
> poor categorical source without looking into their content similarities.
> On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>
> > if it is supervised classification, your input should contain the groups.
> > te idea is that you extend knowledge hidden in  a smaller perhaps expert
> > labeled dataset to the rest of the universe.
> > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qi...@gmail.com>
> wrote:
> >
> >> It is a supervised classification problem.
> >>
> >> For example, a very simple case:
> >> say, overall we collect 4 pages from the data set:  { web_page 1
>  web_page
> >> 2 web_page 3 web_page 4  }
> >> then users may have input vectors like:
> >> user1 [1 1  0  0]
> >> user2 [1 1  0  0]
> >> user3 [0 0  1  1]
> >> user4 [0 0  1  1]
> >> user5 [0 0  1  1]
> >>   ...       ....
> >>
> >> then whatever classification algorithm mahout has should return
> >> classification results as
> >> group 1 { user1, user2}
> >> group 2 { user3, user4, user5 }
> >>
> >>
> >>
> >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:
> >>
> >> > First: what question are you trying to answer from this data? You are
> >> > trying to classify users into what, for what purpose?
> >> >
> >> >
> >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
> >> qiaoresearcher@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi All,
> >> > >
> >> > > Assume the data is stored in a gzip file which includes many text
> >> files.
> >> > > Within each text file, each line represents an activity of a user,
> for
> >> > > example, a click on a web page.
> >> > > the text file will look like:
> >> > >
> >> > >
> >> >
> >>
> ----------------------------------------------------------------------------------
> >> > > user 1   time11  visiting_web_page11
> >> > > user 2   time21  visiting_web_page21
> >> > > user 1   time12  visiting_web_page12
> >> > > user 1   time13  visiting_web_page13
> >> > > user 2   time22  visiting_web_page22
> >> > > user 3   time31  visiting_web_page31
> >> > > user 1   time14  visiting_web_page14
> >> > >  ...           ....                ..........
> >> > >
> >> > > I am thinking to first construct a web page set like
> >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> >> .......
> >> > }
> >> > >
> >> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
> >>  where
> >> > > '1' means the user visited that page and 0 means he did not
> >> > > then use mahout to classify the users based on the vectors
> >> > >
> >> > > does mahout has example like this? if not, what kind of java code we
> >> need
> >> > > to write to process this task?
> >> > >
> >> > > thanks for any suggestions in advance !
> >> > >
> >> >
> >>
> >
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
sorry you probably meant that anyway. your trained input should be labeled
by groups and your prediction request input is not labeled.

looks like a job for a classification like sgd except visited pages make up
poor categorical source without looking into their content similarities.
On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:

> if it is supervised classification, your input should contain the groups.
> te idea is that you extend knowledge hidden in  a smaller perhaps expert
> labeled dataset to the rest of the universe.
> On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qi...@gmail.com> wrote:
>
>> It is a supervised classification problem.
>>
>> For example, a very simple case:
>> say, overall we collect 4 pages from the data set:  { web_page 1  web_page
>> 2 web_page 3 web_page 4  }
>> then users may have input vectors like:
>> user1 [1 1  0  0]
>> user2 [1 1  0  0]
>> user3 [0 0  1  1]
>> user4 [0 0  1  1]
>> user5 [0 0  1  1]
>>   ...       ....
>>
>> then whatever classification algorithm mahout has should return
>> classification results as
>> group 1 { user1, user2}
>> group 2 { user3, user4, user5 }
>>
>>
>>
>> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>> > First: what question are you trying to answer from this data? You are
>> > trying to classify users into what, for what purpose?
>> >
>> >
>> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
>> qiaoresearcher@gmail.com
>> > >wrote:
>> >
>> > > Hi All,
>> > >
>> > > Assume the data is stored in a gzip file which includes many text
>> files.
>> > > Within each text file, each line represents an activity of a user, for
>> > > example, a click on a web page.
>> > > the text file will look like:
>> > >
>> > >
>> >
>> ----------------------------------------------------------------------------------
>> > > user 1   time11  visiting_web_page11
>> > > user 2   time21  visiting_web_page21
>> > > user 1   time12  visiting_web_page12
>> > > user 1   time13  visiting_web_page13
>> > > user 2   time22  visiting_web_page22
>> > > user 3   time31  visiting_web_page31
>> > > user 1   time14  visiting_web_page14
>> > >  ...           ....                ..........
>> > >
>> > > I am thinking to first construct a web page set like
>> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
>> .......
>> > }
>> > >
>> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
>>  where
>> > > '1' means the user visited that page and 0 means he did not
>> > > then use mahout to classify the users based on the vectors
>> > >
>> > > does mahout has example like this? if not, what kind of java code we
>> need
>> > > to write to process this task?
>> > >
>> > > thanks for any suggestions in advance !
>> > >
>> >
>>
>

Re: need help on mahout

Posted by qiaoresearcher <qi...@gmail.com>.
You are right, I have labels for each user, I just need some example code
to run the job quickly.

The example code should have steps similar to what I described: read the
gzip file, construct the webpage set, form the input vector for each user,
then call some classification/clustering algorithm,

does mahout has example like this?

On Fri, Nov 9, 2012 at 10:49 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> if it is supervised classification, your input should contain the groups.
> te idea is that you extend knowledge hidden in  a smaller perhaps expert
> labeled dataset to the rest of the universe.
> On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qi...@gmail.com> wrote:
>
> > It is a supervised classification problem.
> >
> > For example, a very simple case:
> > say, overall we collect 4 pages from the data set:  { web_page 1
>  web_page
> > 2 web_page 3 web_page 4  }
> > then users may have input vectors like:
> > user1 [1 1  0  0]
> > user2 [1 1  0  0]
> > user3 [0 0  1  1]
> > user4 [0 0  1  1]
> > user5 [0 0  1  1]
> >   ...       ....
> >
> > then whatever classification algorithm mahout has should return
> > classification results as
> > group 1 { user1, user2}
> > group 2 { user3, user4, user5 }
> >
> >
> >
> > On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > First: what question are you trying to answer from this data? You are
> > > trying to classify users into what, for what purpose?
> > >
> > >
> > > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
> qiaoresearcher@gmail.com
> > > >wrote:
> > >
> > > > Hi All,
> > > >
> > > > Assume the data is stored in a gzip file which includes many text
> > files.
> > > > Within each text file, each line represents an activity of a user,
> for
> > > > example, a click on a web page.
> > > > the text file will look like:
> > > >
> > > >
> > >
> >
> ----------------------------------------------------------------------------------
> > > > user 1   time11  visiting_web_page11
> > > > user 2   time21  visiting_web_page21
> > > > user 1   time12  visiting_web_page12
> > > > user 1   time13  visiting_web_page13
> > > > user 2   time22  visiting_web_page22
> > > > user 3   time31  visiting_web_page31
> > > > user 1   time14  visiting_web_page14
> > > >  ...           ....                ..........
> > > >
> > > > I am thinking to first construct a web page set like
> > > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> > .......
> > > }
> > > >
> > > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
> >  where
> > > > '1' means the user visited that page and 0 means he did not
> > > > then use mahout to classify the users based on the vectors
> > > >
> > > > does mahout has example like this? if not, what kind of java code we
> > need
> > > > to write to process this task?
> > > >
> > > > thanks for any suggestions in advance !
> > > >
> > >
> >
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
if it is supervised classification, your input should contain the groups.
te idea is that you extend knowledge hidden in  a smaller perhaps expert
labeled dataset to the rest of the universe.
On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qi...@gmail.com> wrote:

> It is a supervised classification problem.
>
> For example, a very simple case:
> say, overall we collect 4 pages from the data set:  { web_page 1  web_page
> 2 web_page 3 web_page 4  }
> then users may have input vectors like:
> user1 [1 1  0  0]
> user2 [1 1  0  0]
> user3 [0 0  1  1]
> user4 [0 0  1  1]
> user5 [0 0  1  1]
>   ...       ....
>
> then whatever classification algorithm mahout has should return
> classification results as
> group 1 { user1, user2}
> group 2 { user3, user4, user5 }
>
>
>
> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > First: what question are you trying to answer from this data? You are
> > trying to classify users into what, for what purpose?
> >
> >
> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <qiaoresearcher@gmail.com
> > >wrote:
> >
> > > Hi All,
> > >
> > > Assume the data is stored in a gzip file which includes many text
> files.
> > > Within each text file, each line represents an activity of a user, for
> > > example, a click on a web page.
> > > the text file will look like:
> > >
> > >
> >
> ----------------------------------------------------------------------------------
> > > user 1   time11  visiting_web_page11
> > > user 2   time21  visiting_web_page21
> > > user 1   time12  visiting_web_page12
> > > user 1   time13  visiting_web_page13
> > > user 2   time22  visiting_web_page22
> > > user 3   time31  visiting_web_page31
> > > user 1   time14  visiting_web_page14
> > >  ...           ....                ..........
> > >
> > > I am thinking to first construct a web page set like
> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> .......
> > }
> > >
> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]
>  where
> > > '1' means the user visited that page and 0 means he did not
> > > then use mahout to classify the users based on the vectors
> > >
> > > does mahout has example like this? if not, what kind of java code we
> need
> > > to write to process this task?
> > >
> > > thanks for any suggestions in advance !
> > >
> >
>

Re: need help on mahout

Posted by qiaoresearcher <qi...@gmail.com>.
It is a supervised classification problem.

For example, a very simple case:
say, overall we collect 4 pages from the data set:  { web_page 1  web_page
2 web_page 3 web_page 4  }
then users may have input vectors like:
user1 [1 1  0  0]
user2 [1 1  0  0]
user3 [0 0  1  1]
user4 [0 0  1  1]
user5 [0 0  1  1]
  ...       ....

then whatever classification algorithm mahout has should return
classification results as
group 1 { user1, user2}
group 2 { user3, user4, user5 }



On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <sr...@gmail.com> wrote:

> First: what question are you trying to answer from this data? You are
> trying to classify users into what, for what purpose?
>
>
> On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >wrote:
>
> > Hi All,
> >
> > Assume the data is stored in a gzip file which includes many text files.
> > Within each text file, each line represents an activity of a user, for
> > example, a click on a web page.
> > the text file will look like:
> >
> >
> ----------------------------------------------------------------------------------
> > user 1   time11  visiting_web_page11
> > user 2   time21  visiting_web_page21
> > user 1   time12  visiting_web_page12
> > user 1   time13  visiting_web_page13
> > user 2   time22  visiting_web_page22
> > user 3   time31  visiting_web_page31
> > user 1   time14  visiting_web_page14
> >  ...           ....                ..........
> >
> > I am thinking to first construct a web page set like
> > { visiting_web_page11, visiting_web_page12, visiting_web_page31, .......
> }
> >
> > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]  where
> > '1' means the user visited that page and 0 means he did not
> > then use mahout to classify the users based on the vectors
> >
> > does mahout has example like this? if not, what kind of java code we need
> > to write to process this task?
> >
> > thanks for any suggestions in advance !
> >
>

Re: need help on mahout

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
my guess is he probably means cluster users based on behaviour into virtual
behavioural groups
On Nov 9, 2012 8:29 AM, "Sean Owen" <sr...@gmail.com> wrote:

> First: what question are you trying to answer from this data? You are
> trying to classify users into what, for what purpose?
>
>
> On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >wrote:
>
> > Hi All,
> >
> > Assume the data is stored in a gzip file which includes many text files.
> > Within each text file, each line represents an activity of a user, for
> > example, a click on a web page.
> > the text file will look like:
> >
> >
> ----------------------------------------------------------------------------------
> > user 1   time11  visiting_web_page11
> > user 2   time21  visiting_web_page21
> > user 1   time12  visiting_web_page12
> > user 1   time13  visiting_web_page13
> > user 2   time22  visiting_web_page22
> > user 3   time31  visiting_web_page31
> > user 1   time14  visiting_web_page14
> >  ...           ....                ..........
> >
> > I am thinking to first construct a web page set like
> > { visiting_web_page11, visiting_web_page12, visiting_web_page31, .......
> }
> >
> > then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]  where
> > '1' means the user visited that page and 0 means he did not
> > then use mahout to classify the users based on the vectors
> >
> > does mahout has example like this? if not, what kind of java code we need
> > to write to process this task?
> >
> > thanks for any suggestions in advance !
> >
>

Re: need help on mahout

Posted by Sean Owen <sr...@gmail.com>.
First: what question are you trying to answer from this data? You are
trying to classify users into what, for what purpose?


On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <qi...@gmail.com>wrote:

> Hi All,
>
> Assume the data is stored in a gzip file which includes many text files.
> Within each text file, each line represents an activity of a user, for
> example, a click on a web page.
> the text file will look like:
>
> ----------------------------------------------------------------------------------
> user 1   time11  visiting_web_page11
> user 2   time21  visiting_web_page21
> user 1   time12  visiting_web_page12
> user 1   time13  visiting_web_page13
> user 2   time22  visiting_web_page22
> user 3   time31  visiting_web_page31
> user 1   time14  visiting_web_page14
>  ...           ....                ..........
>
> I am thinking to first construct a web page set like
> { visiting_web_page11, visiting_web_page12, visiting_web_page31, ....... }
>
> then for each user, we form a vector [ 1  0 0  1 0  0  .....    ]  where
> '1' means the user visited that page and 0 means he did not
> then use mahout to classify the users based on the vectors
>
> does mahout has example like this? if not, what kind of java code we need
> to write to process this task?
>
> thanks for any suggestions in advance !
>