You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Kaplan <da...@gmail.com> on 2015/08/12 14:49:29 UTC

Mahout Clustering Help Please

Hi all,
Hope someone can please point me in the right direction,
Very new to mahout..
Here's my scenario:

I have written a system that collects Classifieds items from multiple
websites - phones,cars,antiques and many more using scrapy, all the items
are then ingested into Solr - +- 3 million entries.
 This is then the backend for my search engine

 I want to be able to extract meaningful information to accurately
calculate realistic price average etc. I need guidance/perhaps examples in
accurate outlier detection, categorization etc extreme beginner in machine
learning so need to know if that's what I should be using

 Part of my challenge is the broad range of items/categories, different
levels of skewed data etc. e.g. finding outliers with "iphone" results when
many of those are cheap iphone accessories.

Basically it seems i need to cluster/classify but not sure exactly how to
go about it, because i do already have the categories for 500K of the
entries, example category "Cell Phones & Accessories - Accessories"

And then actually connecting Mahout to Solr...

Many thanks!
David

Re: Mahout Clustering Help Please

Posted by Pat Ferrel <pa...@occamsmachete.com>.

What exactly is you goal? Taking those names and de-duping to see which are talking about the same thing?

Here is an example of weird data. A refurbished iPhone 5C for 4589.0?????

"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0

Honestly I wouldn’t know where to begin.


On Aug 13, 2015, at 6:53 AM, David Kaplan <da...@gmail.com> wrote:

Hi Pat,
Thanks for the reply,
Yes I think there are a lot of problems,

So there are 4 data sources, they each use different categorisation
conventions, some one level,some multilevel,
so I basically picked one source that is about 500K of the entries out of 3
million,

I do have the prices, the data is separated in solr, so i can extract
title, category and price.

My confusion is trying to work out classifier vs clustering as I understand
it clustering is when you don't
have labelled data, but I do for some. Am i looking for a hybrid
classifier/clustering - kmeans or is just SVM sufficient?

To make matters more complicated they are categories and then
sub-categories, so "Cell Phones & Accessories" => "Accesories" ,
Don't know if that means i have train separate models?

Example data snippet:

"2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 -
White","Cell Phones & Accessories - Accessories",529.0
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
"Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking -
Printers",375.0
"Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0

Perhaps I'm overcomplicating the problem...

Many thanks,
David



On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> You have a lot of problems to solve here.
> 
> 1) can you find the price? Is it in text or in structured data? If text
> you have an NLP problem. You can use regex for price.
> 2) how do you associate a price with the object, there may be several
> money amounts in the ad. Some do this with proximity so how many
> chanracters away from the item id is the price.
> 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6
> plus", and it gets worse for things with lots of numbers and modifiers in
> the name like "super whiz bang deLux 5G XLS” The right level of
> de-duplication vs fragmentation is a deep and hard problem.
> 
> How much is an NLP problem and what structure does the data have? Unless I
> misunderstand your problem, extracting the data will be the hardest part
> and not something Mahout can help with.
> 
> On Aug 12, 2015, at 5:49 AM, David Kaplan <da...@gmail.com> wrote:
> 
> Hi all,
> Hope someone can please point me in the right direction,
> Very new to mahout..
> Here's my scenario:
> 
> I have written a system that collects Classifieds items from multiple
> websites - phones,cars,antiques and many more using scrapy, all the items
> are then ingested into Solr - +- 3 million entries.
> This is then the backend for my search engine
> 
> I want to be able to extract meaningful information to accurately
> calculate realistic price average etc. I need guidance/perhaps examples in
> accurate outlier detection, categorization etc extreme beginner in machine
> learning so need to know if that's what I should be using
> 
> Part of my challenge is the broad range of items/categories, different
> levels of skewed data etc. e.g. finding outliers with "iphone" results when
> many of those are cheap iphone accessories.
> 
> Basically it seems i need to cluster/classify but not sure exactly how to
> go about it, because i do already have the categories for 500K of the
> entries, example category "Cell Phones & Accessories - Accessories"
> 
> And then actually connecting Mahout to Solr...
> 
> Many thanks!
> David
> 
>

Re: Mahout Clustering Help Please

Posted by David Kaplan <da...@gmail.com>.

Hi Pat,
Thanks for the reply,
Yes I think there are a lot of problems,

So there are 4 data sources, they each use different categorisation
conventions, some one level,some multilevel,
so I basically picked one source that is about 500K of the entries out of 3
million,

I do have the prices, the data is separated in solr, so i can extract
title, category and price.

My confusion is trying to work out classifier vs clustering as I understand
it clustering is when you don't
have labelled data, but I do for some. Am i looking for a hybrid
classifier/clustering - kmeans or is just SVM sufficient?

To make matters more complicated they are categories and then
sub-categories, so "Cell Phones & Accessories" => "Accesories" ,
Don't know if that means i have train separate models?

Example data snippet:

"2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 -
White","Cell Phones & Accessories - Accessories",529.0
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
"Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking -
Printers",375.0
"Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0

Perhaps I'm overcomplicating the problem...

Many thanks,
David



On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> You have a lot of problems to solve here.
>
> 1) can you find the price? Is it in text or in structured data? If text
> you have an NLP problem. You can use regex for price.
> 2) how do you associate a price with the object, there may be several
> money amounts in the ad. Some do this with proximity so how many
> chanracters away from the item id is the price.
> 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6
> plus", and it gets worse for things with lots of numbers and modifiers in
> the name like "super whiz bang deLux 5G XLS” The right level of
> de-duplication vs fragmentation is a deep and hard problem.
>
> How much is an NLP problem and what structure does the data have? Unless I
> misunderstand your problem, extracting the data will be the hardest part
> and not something Mahout can help with.
>
> On Aug 12, 2015, at 5:49 AM, David Kaplan <da...@gmail.com> wrote:
>
> Hi all,
> Hope someone can please point me in the right direction,
> Very new to mahout..
> Here's my scenario:
>
> I have written a system that collects Classifieds items from multiple
> websites - phones,cars,antiques and many more using scrapy, all the items
> are then ingested into Solr - +- 3 million entries.
> This is then the backend for my search engine
>
> I want to be able to extract meaningful information to accurately
> calculate realistic price average etc. I need guidance/perhaps examples in
> accurate outlier detection, categorization etc extreme beginner in machine
> learning so need to know if that's what I should be using
>
> Part of my challenge is the broad range of items/categories, different
> levels of skewed data etc. e.g. finding outliers with "iphone" results when
> many of those are cheap iphone accessories.
>
> Basically it seems i need to cluster/classify but not sure exactly how to
> go about it, because i do already have the categories for 500K of the
> entries, example category "Cell Phones & Accessories - Accessories"
>
> And then actually connecting Mahout to Solr...
>
> Many thanks!
> David
>
>

Re: Mahout Clustering Help Please

Posted by Pat Ferrel <pa...@occamsmachete.com>.

You have a lot of problems to solve here.

1) can you find the price? Is it in text or in structured data? If text you have an NLP problem. You can use regex for price.
2) how do you associate a price with the object, there may be several money amounts in the ad. Some do this with proximity so how many chanracters away from the item id is the price.
3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6 plus", and it gets worse for things with lots of numbers and modifiers in the name like "super whiz bang deLux 5G XLS” The right level of de-duplication vs fragmentation is a deep and hard problem.

How much is an NLP problem and what structure does the data have? Unless I misunderstand your problem, extracting the data will be the hardest part and not something Mahout can help with.

On Aug 12, 2015, at 5:49 AM, David Kaplan <da...@gmail.com> wrote:

Hi all,
Hope someone can please point me in the right direction,
Very new to mahout..
Here's my scenario:

I have written a system that collects Classifieds items from multiple
websites - phones,cars,antiques and many more using scrapy, all the items
are then ingested into Solr - +- 3 million entries.
This is then the backend for my search engine

I want to be able to extract meaningful information to accurately
calculate realistic price average etc. I need guidance/perhaps examples in
accurate outlier detection, categorization etc extreme beginner in machine
learning so need to know if that's what I should be using

Part of my challenge is the broad range of items/categories, different
levels of skewed data etc. e.g. finding outliers with "iphone" results when
many of those are cheap iphone accessories.

Basically it seems i need to cluster/classify but not sure exactly how to
go about it, because i do already have the categories for 500K of the
entries, example category "Cell Phones & Accessories - Accessories"

And then actually connecting Mahout to Solr...

Many thanks!
David