You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grzegorz Ewald <gr...@gmail.com> on 2014/09/05 16:02:15 UTC

Unstructured text classification

Hi Mahout users!

I'm starting to deal with unstructured text classification, namely
classification of web pages of unknown structure. The number of possible
categories would probably be quite small (as for now I believe that three
categories are enough).

Later I would add another level of data processing based on document
structure (existence of meta tags and so on).

Do you have any experience or suggestions? Somehow I don't feel like using
bag of words approach (but maybe i am wrong?).

-- 
Regards,
Grzegorz

<ma...@gmail.com>

Re: Unstructured text classification

Posted by Ted Dunning <te...@gmail.com>.

Jonathan has it right here.  Don't forget to include some structural
information as well as the text content itself.  By structural information,
I mean things like the domain, how much boilerplate, possibly some sort of
classifier that runs on the boiler plate and so on.  You may also want to
include aggregate features like the number of different kinds of markup
elements, a the CSS tags that are used, the amount of javascript on the
page, domains linked to and more.

These additional information elements make classification of certain kinds
of text almost trivial.  Comment spam, for instance, will often have links
to particular domains.  A compromised web-page might have in-line
javascript that no other page normally has.

Numerical features will make use of Naive Bayes a bit problematic.  You
might be able to mitigate this by binning the features and converting the
numbers into deciles where each decile is a different symbol.

On Fri, Sep 5, 2014 at 10:24 AM, Jonathan Cooper-Ellis <jc...@ziftr.com>
wrote:

> Hi Grzegorz,
>
> You can use the boilerpipe library to extract main content from your sites
> (Tika supports this) and pass that to a NB classifier and probably get
> pretty good results.
>
> Hope that helps!
>
> On Friday, September 5, 2014, Grzegorz Ewald <gr...@gmail.com>
> wrote:
>
> > Hi Mahout users!
> >
> > I'm starting to deal with unstructured text classification, namely
> > classification of web pages of unknown structure. The number of possible
> > categories would probably be quite small (as for now I believe that three
> > categories are enough).
> >
> > Later I would add another level of data processing based on document
> > structure (existence of meta tags and so on).
> >
> > Do you have any experience or suggestions? Somehow I don't feel like
> using
> > bag of words approach (but maybe i am wrong?).
> >
> > --
> > Regards,
> > Grzegorz
> >
> > <mailto:grzegorz.ewald@gmail.com <javascript:;>>
> >
>

Re: Unstructured text classification

Posted by Jonathan Cooper-Ellis <jc...@ziftr.com>.

Hi Grzegorz,

You can use the boilerpipe library to extract main content from your sites
(Tika supports this) and pass that to a NB classifier and probably get
pretty good results.

Hope that helps!

On Friday, September 5, 2014, Grzegorz Ewald <gr...@gmail.com>
wrote:

> Hi Mahout users!
>
> I'm starting to deal with unstructured text classification, namely
> classification of web pages of unknown structure. The number of possible
> categories would probably be quite small (as for now I believe that three
> categories are enough).
>
> Later I would add another level of data processing based on document
> structure (existence of meta tags and so on).
>
> Do you have any experience or suggestions? Somehow I don't feel like using
> bag of words approach (but maybe i am wrong?).
>
> --
> Regards,
> Grzegorz
>
> <mailto:grzegorz.ewald@gmail.com <javascript:;>>
>