You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/04/09 00:44:49 UTC

Clustering question

I am new to Mahout and just going through some tutorials. One of the
requirements I am working on involves extracting customer reviews from
Amazon for a given item and then clustering those into similar topics to
see what in general users have been talking about. So for eg: Rating of >
3 could say user experience is good, quality or rating of <=3 could say
price, buggy etc.

Could anyone suggest what would be the best way to approach this?

Re: Clustering question

Posted by Mohit Anchlia <mo...@gmail.com>.

Yes I don't want many small files as it's detrimental to Hadoop's
performance. Examples like reuters have many files. I'll look more in
detail in next few days and ask more questions as to how to do that if I
can't figure it out :)

On Tue, Apr 10, 2012 at 3:37 AM, Sean Owen <sr...@gmail.com> wrote:

> No, you might have each review as a record in a much larger SequenceFile.
> I don't know whether the current implementation reads input formatted
> like this, but if it doesn't, it can't be hard to modify it to do so.
> You would not want many many small files on HDFS.
>
> On Mon, Apr 9, 2012 at 9:54 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>  > Thanks! One thing I am not clear is if each customer review which
> might be
> > just few bytes need to be in separate files? I am planning to utilize
> > hadoop so I was thinking of using SequenceFiles to dump all the raw
> > comments in a sequenceFile but I am not sure if it would mess up any TFDF
> > or anything like that. Could someone help me clarify?
> >
>

Re: Clustering question

Posted by Sean Owen <sr...@gmail.com>.

No, you might have each review as a record in a much larger SequenceFile.
I don't know whether the current implementation reads input formatted
like this, but if it doesn't, it can't be hard to modify it to do so.
You would not want many many small files on HDFS.

On Mon, Apr 9, 2012 at 9:54 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Thanks! One thing I am not clear is if each customer review which might be
> just few bytes need to be in separate files? I am planning to utilize
> hadoop so I was thinking of using SequenceFiles to dump all the raw
> comments in a sequenceFile but I am not sure if it would mess up any TFDF
> or anything like that. Could someone help me clarify?
>

Re: Clustering question

Posted by Lance Norskog <go...@gmail.com>.

If you want to have multiple small files in one file on hdfs, when you
want to pack them somehow.

You should run one of the cluster examples and examine each file along
the path. They all have a custom class that parse the input (email,
reuters article, email archive) into piece Usually the first pass
reads raw files in some format ((email, reuters article, wikipedia)
and writes them as key,value pairs in a sequenceFile, with say the
file name as the key and text as the value. This is usually fast.

The second pass turns these into term vectors. This creates a global
list of all of the words in all documents- this is the slow one.

On Mon, Apr 9, 2012 at 7:54 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Thanks! One thing I am not clear is if each customer review which might be
> just few bytes need to be in separate files? I am planning to utilize
> hadoop so I was thinking of using SequenceFiles to dump all the raw
> comments in a sequenceFile but I am not sure if it would mess up any TFDF
> or anything like that. Could someone help me clarify?
>
> On Sun, Apr 8, 2012 at 11:00 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I think you would cluster these like any other text document. The
>> centroid of each cluster tells you where the cluster is in
>> feature-space, but the features are just words. If you find the
>> features (words) with largest absolute value, those ought to be the
>> words that appear frequently in the cluster and are what they are
>> "about".
>>
>> As to ratings, not sure how you might want to involve them?
>>
>> On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > I am new to Mahout and just going through some tutorials. One of the
>> > requirements I am working on involves extracting customer reviews from
>> > Amazon for a given item and then clustering those into similar topics to
>> > see what in general users have been talking about. So for eg: Rating of >
>> > 3 could say user experience is good, quality or rating of <=3 could say
>> > price, buggy etc.
>> >
>> > Could anyone suggest what would be the best way to approach this?
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Clustering question

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks! One thing I am not clear is if each customer review which might be
just few bytes need to be in separate files? I am planning to utilize
hadoop so I was thinking of using SequenceFiles to dump all the raw
comments in a sequenceFile but I am not sure if it would mess up any TFDF
or anything like that. Could someone help me clarify?

On Sun, Apr 8, 2012 at 11:00 PM, Sean Owen <sr...@gmail.com> wrote:

> I think you would cluster these like any other text document. The
> centroid of each cluster tells you where the cluster is in
> feature-space, but the features are just words. If you find the
> features (words) with largest absolute value, those ought to be the
> words that appear frequently in the cluster and are what they are
> "about".
>
> As to ratings, not sure how you might want to involve them?
>
> On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > I am new to Mahout and just going through some tutorials. One of the
> > requirements I am working on involves extracting customer reviews from
> > Amazon for a given item and then clustering those into similar topics to
> > see what in general users have been talking about. So for eg: Rating of >
> > 3 could say user experience is good, quality or rating of <=3 could say
> > price, buggy etc.
> >
> > Could anyone suggest what would be the best way to approach this?
>

Re: Clustering question

Posted by Sean Owen <sr...@gmail.com>.

I think you would cluster these like any other text document. The
centroid of each cluster tells you where the cluster is in
feature-space, but the features are just words. If you find the
features (words) with largest absolute value, those ought to be the
words that appear frequently in the cluster and are what they are
"about".

As to ratings, not sure how you might want to involve them?

On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> I am new to Mahout and just going through some tutorials. One of the
> requirements I am working on involves extracting customer reviews from
> Amazon for a given item and then clustering those into similar topics to
> see what in general users have been talking about. So for eg: Rating of >
> 3 could say user experience is good, quality or rating of <=3 could say
> price, buggy etc.
>
> Could anyone suggest what would be the best way to approach this?