You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Andreas Spalas <an...@gmail.com> on 2014/07/21 21:43:29 UTC

Help for Grouping similar items together. Clustering/Classification problem?

Hi,

these days I am exploring Mahout Framework in order to solve a specific
problem.
The problem is that I have a csv file with 1.5 Million items - products
with the following format:
id, product_title
1, Apple IPHONE 5
2, Samsung Galaxy S5
etc..

and I would like to group the items-products together in terms of category
so for example in the above case both products would be under "Technology"
or "Smartphones" Category.

I would like to know if this is possible to handle in Mahout and whether
someone would choose clustering or classification way in order to solve
such a problem.

As, I am studying "Mahout in action" currently I saw that for Clustering
case I have to transform my data into a SequenceFile and find a way of
vectorization and I don't really get if this is applicable to my case at
the moment. For, the second case of classification I understand that I have
to provide some training data with target variable(in my case "Category")
in order to create a model for the classification system and I can extend
my dataset with this extra info but is it going to work?

Can anyone give me some advice on how to handle this particular problem?Is
it even possible to do it in Mahout? Any direction would be aprreciated!

Thanks alot in advance.

Re: Help for Grouping similar items together. Clustering/Classification problem?

Posted by parnab kumar <pa...@gmail.com>.
Hi,

     For only 1.5 million items, I feel employing mahout is not required.
Any other machine learning software like, weka should be enough.
Next, I see you have only 2 attributes -  id and name. Unless these are
compounded with additional features no classifier or clustering algorithm
will work.

Try using some external knowledge like Wikipedia  or even first 10 result
snippets from a standard search engine to gather  additional attributes in
form of description.

Thanks,
Parnab




On Tue, Jul 22, 2014 at 1:13 AM, Andreas Spalas <an...@gmail.com>
wrote:

> Hi,
>
> these days I am exploring Mahout Framework in order to solve a specific
> problem.
> The problem is that I have a csv file with 1.5 Million items - products
> with the following format:
> id, product_title
> 1, Apple IPHONE 5
> 2, Samsung Galaxy S5
> etc..
>
> and I would like to group the items-products together in terms of category
> so for example in the above case both products would be under "Technology"
> or "Smartphones" Category.
>
> I would like to know if this is possible to handle in Mahout and whether
> someone would choose clustering or classification way in order to solve
> such a problem.
>
> As, I am studying "Mahout in action" currently I saw that for Clustering
> case I have to transform my data into a SequenceFile and find a way of
> vectorization and I don't really get if this is applicable to my case at
> the moment. For, the second case of classification I understand that I have
> to provide some training data with target variable(in my case "Category")
> in order to create a model for the classification system and I can extend
> my dataset with this extra info but is it going to work?
>
> Can anyone give me some advice on how to handle this particular problem?Is
> it even possible to do it in Mahout? Any direction would be aprreciated!
>
> Thanks alot in advance.
>