You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vikas Parashar <vi...@fosteringlinux.com> on 2014/01/13 12:47:23 UTC

categorization on crawl data

Hi folks,

Have anyone tried to do categorization on crawl data. If yes then how can i
achieve this? Which algorithm will help me?

-- 
Thanks & Regards:-
Vikas Parashar
Sr. Linux administrator Cum Developer
Mobile: +91 958 208 8852
Email: vikas.parashar@fosteringlinglinux.com

Re: categorization on crawl data

Posted by Bertrand Dechoux <de...@gmail.com>.

It might seem like you would want do to entity extraction but that's not
trivial and Mahout won't directly help in that area.

Bertrand

On Tue, Jan 14, 2014 at 10:05 AM, Константин Слисенко
<ks...@gmail.com>wrote:

> Hi Vikas!
>
> As I understand, you need to improve indexing of your data for exact
> search. You can look to classification algorithms (
> http://mahout.apache.org/users/classification/classifyingyourdata.html).
> You can define topics and train classifier. Then classifier will split your
> data into several groups and then you can index your data.
>
> But I'm not sure that mahout is good for exact search if you want to find
> switches with exact 24 ports. I think it could be better if index your data
> another way (using hadoop) and get exact parameters of every switch in
> network, then you import this data into database with indexes. You can also
> integrate Lucene to store database IDs.
>
>
> 2014/1/14 Vikas Parashar <vi...@fosteringlinux.com>
>
> > Thanks buddy,
> >
> > Actually, i have crawled data in my system. Let's say "data related to
> all
> > firewall,switches and router domains". With nutch i have crawled all the
> > data in my segments(according to depth).
> >
> > Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
> > can easily search(like a google search) in my data.
> >
> > Now, my pain points begin; when my client needs attributes type search.
> For
> > e.g. I need to get all switches that have 24 ports. For that type of
> > search, i supposed mahout will be in action. I don't know; i am going in
> > right direction or not. But, what i am thinking, if i shall be able to
> > trained my machine in such way so that it gives us desired results. We
> all
> > know, that machine will take some time to give us some +ve result.
> Because,
> > every machine need some time to become expert. But that is fine with me.
> >
> > But again, for that we need to categorize my crawled data in at-least 3
> > parts(according to above example).
> >
> > Any guess! how can i achieve this.
> >
> >
> >
> >
> >
> >
> > On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
> > <ks...@gmail.com>wrote:
> >
> > > Hi Vikas!
> > >
> > > For categorization any data you can try clustering algorithms, see this
> > > link http://mahout.apache.org/users/clustering/clusteringyourdata.html
> .
> > > Simple algorithms by my opinion is k-means
> > > http://mahout.apache.org/users/clustering/k-means-clustering.html.
> > >
> > > Which data do you have?
> > >
> > > If it is text data, you should first extract text, then do some
> > > preprocessing for better quality - remove stop-words (is, are, the,
> ...),
> > > switch words to lower case, also use Porter stem filter (
> > >
> > >
> >
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> > > ).
> > > This can be done by custom Lucene Analyzer. The result should be in
> > mahout
> > > sequence files format. Then you need to vectorize data (
> > > http://mahout.apache.org/users/basics/creating-vectors-from-text.html
> ).
> > > Then run clustering algorithm and interpret results.
> > >
> > > You can look at my experiments here
> > >
> > >
> >
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
> > >
> > >
> > > 2014/1/13 Vikas Parashar <vi...@fosteringlinux.com>
> > >
> > > > Hi folks,
> > > >
> > > > Have anyone tried to do categorization on crawl data. If yes then how
> > > can i
> > > > achieve this? Which algorithm will help me?
> > > >
> > > > --
> > > > Thanks & Regards:-
> > > > Vikas Parashar
> > > > Sr. Linux administrator Cum Developer
> > > > Mobile: +91 958 208 8852
> > > > Email: vikas.parashar@fosteringlinglinux.com
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards:-
> > Vikas Parashar
> > Sr. Linux administrator Cum Developer
> > Mobile: +91 958 208 8852
> > Email: vikas.parashar@fosteringlinglinux.com
> >
>

Re: categorization on crawl data

Posted by Константин Слисенко <ks...@gmail.com>.

Hi Vikas!

As I understand, you need to improve indexing of your data for exact
search. You can look to classification algorithms (
http://mahout.apache.org/users/classification/classifyingyourdata.html).
You can define topics and train classifier. Then classifier will split your
data into several groups and then you can index your data.

But I'm not sure that mahout is good for exact search if you want to find
switches with exact 24 ports. I think it could be better if index your data
another way (using hadoop) and get exact parameters of every switch in
network, then you import this data into database with indexes. You can also
integrate Lucene to store database IDs.


2014/1/14 Vikas Parashar <vi...@fosteringlinux.com>

> Thanks buddy,
>
> Actually, i have crawled data in my system. Let's say "data related to all
> firewall,switches and router domains". With nutch i have crawled all the
> data in my segments(according to depth).
>
> Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
> can easily search(like a google search) in my data.
>
> Now, my pain points begin; when my client needs attributes type search. For
> e.g. I need to get all switches that have 24 ports. For that type of
> search, i supposed mahout will be in action. I don't know; i am going in
> right direction or not. But, what i am thinking, if i shall be able to
> trained my machine in such way so that it gives us desired results. We all
> know, that machine will take some time to give us some +ve result. Because,
> every machine need some time to become expert. But that is fine with me.
>
> But again, for that we need to categorize my crawled data in at-least 3
> parts(according to above example).
>
> Any guess! how can i achieve this.
>
>
>
>
>
>
> On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
> <ks...@gmail.com>wrote:
>
> > Hi Vikas!
> >
> > For categorization any data you can try clustering algorithms, see this
> > link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
> > Simple algorithms by my opinion is k-means
> > http://mahout.apache.org/users/clustering/k-means-clustering.html.
> >
> > Which data do you have?
> >
> > If it is text data, you should first extract text, then do some
> > preprocessing for better quality - remove stop-words (is, are, the, ...),
> > switch words to lower case, also use Porter stem filter (
> >
> >
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> > ).
> > This can be done by custom Lucene Analyzer. The result should be in
> mahout
> > sequence files format. Then you need to vectorize data (
> > http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
> > Then run clustering algorithm and interpret results.
> >
> > You can look at my experiments here
> >
> >
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
> >
> >
> > 2014/1/13 Vikas Parashar <vi...@fosteringlinux.com>
> >
> > > Hi folks,
> > >
> > > Have anyone tried to do categorization on crawl data. If yes then how
> > can i
> > > achieve this? Which algorithm will help me?
> > >
> > > --
> > > Thanks & Regards:-
> > > Vikas Parashar
> > > Sr. Linux administrator Cum Developer
> > > Mobile: +91 958 208 8852
> > > Email: vikas.parashar@fosteringlinglinux.com
> > >
> >
>
>
>
> --
> Thanks & Regards:-
> Vikas Parashar
> Sr. Linux administrator Cum Developer
> Mobile: +91 958 208 8852
> Email: vikas.parashar@fosteringlinglinux.com
>

Re: categorization on crawl data

Posted by Vikas Parashar <vi...@fosteringlinux.com>.

Thanks buddy,

Actually, i have crawled data in my system. Let's say "data related to all
firewall,switches and router domains". With nutch i have crawled all the
data in my segments(according to depth).

Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
can easily search(like a google search) in my data.

Now, my pain points begin; when my client needs attributes type search. For
e.g. I need to get all switches that have 24 ports. For that type of
search, i supposed mahout will be in action. I don't know; i am going in
right direction or not. But, what i am thinking, if i shall be able to
trained my machine in such way so that it gives us desired results. We all
know, that machine will take some time to give us some +ve result. Because,
every machine need some time to become expert. But that is fine with me.

But again, for that we need to categorize my crawled data in at-least 3
parts(according to above example).

Any guess! how can i achieve this.

On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
<ks...@gmail.com>wrote:

> Hi Vikas!
>
> For categorization any data you can try clustering algorithms, see this
> link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
> Simple algorithms by my opinion is k-means
> http://mahout.apache.org/users/clustering/k-means-clustering.html.
>
> Which data do you have?
>
> If it is text data, you should first extract text, then do some
> preprocessing for better quality - remove stop-words (is, are, the, ...),
> switch words to lower case, also use Porter stem filter (
>
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> ).
> This can be done by custom Lucene Analyzer. The result should be in mahout
> sequence files format. Then you need to vectorize data (
> http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
> Then run clustering algorithm and interpret results.
>
> You can look at my experiments here
>
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
>
>
> 2014/1/13 Vikas Parashar <vi...@fosteringlinux.com>
>
> > Hi folks,
> >
> > Have anyone tried to do categorization on crawl data. If yes then how
> can i
> > achieve this? Which algorithm will help me?
> >
> > --
> > Thanks & Regards:-
> > Vikas Parashar
> > Sr. Linux administrator Cum Developer
> > Mobile: +91 958 208 8852
> > Email: vikas.parashar@fosteringlinglinux.com
> >
>

-- 
Thanks & Regards:-
Vikas Parashar
Sr. Linux administrator Cum Developer
Mobile: +91 958 208 8852
Email: vikas.parashar@fosteringlinglinux.com

Re: categorization on crawl data

Posted by Константин Слисенко <ks...@gmail.com>.

Hi Vikas!

For categorization any data you can try clustering algorithms, see this
link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
Simple algorithms by my opinion is k-means
http://mahout.apache.org/users/clustering/k-means-clustering.html.

Which data do you have?

If it is text data, you should first extract text, then do some
preprocessing for better quality - remove stop-words (is, are, the, ...),
switch words to lower case, also use Porter stem filter (
http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html).
This can be done by custom Lucene Analyzer. The result should be in mahout
sequence files format. Then you need to vectorize data (
http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
Then run clustering algorithm and interpret results.

You can look at my experiments here
https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout


2014/1/13 Vikas Parashar <vi...@fosteringlinux.com>

> Hi folks,
>
> Have anyone tried to do categorization on crawl data. If yes then how can i
> achieve this? Which algorithm will help me?
>
> --
> Thanks & Regards:-
> Vikas Parashar
> Sr. Linux administrator Cum Developer
> Mobile: +91 958 208 8852
> Email: vikas.parashar@fosteringlinglinux.com
>