You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by chad savage <cs...@activeathletemedia.com> on 2006/12/05 07:01:42 UTC

classifying content

Hello All,

I'm doing some research on how to classify documents into pre-defined 
categories.
Some methods I have come across are Ontologies, topic maps, url/site 
based and simple keyword analysis.
I'm leaning towards topic maps and Ontologies being the strongest and 
most documented in theory and in practice.
Does the group have any recommendations on where to start?
Software packages to help develop the owl/rdf files? Protoge?
Any consultancies out there that handle this process?
Downfalls to using these?
And finally, integrating them into nutch/lucene.

Thanks in advance,
Chad

Re: classifying content

Posted by Dennis Kubes <nu...@dragonflymc.com>.

You may also want to look at bayesian statistics, support vector 
machines, and machine learning algorithms.

Dennis

kauu wrote:
> this is exactly also what i wander
>
> On 12/5/06, chad savage <cs...@activeathletemedia.com> wrote:
>>
>> Hello All,
>>
>> I'm doing some research on how to classify documents into pre-defined
>> categories.
>> Some methods I have come across are Ontologies, topic maps, url/site
>> based and simple keyword analysis.
>> I'm leaning towards topic maps and Ontologies being the strongest and
>> most documented in theory and in practice.
>> Does the group have any recommendations on where to start?
>> Software packages to help develop the owl/rdf files? Protoge?
>> Any consultancies out there that handle this process?
>> Downfalls to using these?
>> And finally, integrating them into nutch/lucene.
>>
>> Thanks in advance,
>> Chad
>>
>>
>
>

Re: classifying content

Posted by kauu <ba...@gmail.com>.

this is exactly also what i wander

On 12/5/06, chad savage <cs...@activeathletemedia.com> wrote:
>
> Hello All,
>
> I'm doing some research on how to classify documents into pre-defined
> categories.
> Some methods I have come across are Ontologies, topic maps, url/site
> based and simple keyword analysis.
> I'm leaning towards topic maps and Ontologies being the strongest and
> most documented in theory and in practice.
> Does the group have any recommendations on where to start?
> Software packages to help develop the owl/rdf files? Protoge?
> Any consultancies out there that handle this process?
> Downfalls to using these?
> And finally, integrating them into nutch/lucene.
>
> Thanks in advance,
> Chad
>
>


-- 
www.babatu.com

Re: classifying content

Posted by kauu <ba...@gmail.com>.

hi:
  i don't very be well with nutch, and i think that nutch should classify
the pages after fetching  them  every time  into different place. then u can
search them and display them .

On 12/8/06, Shay Lawless <se...@gmail.com> wrote:
>
> Hi Chad,
>
> I use a focused web crawler called the metacombine project (
> http://www.metacombine.org/) to classify the content retrieved during a
> web
> crawl. It implements the Heritrix web crawler from the Internet Archive
> and
> the Rainbow Text Classifier from CMU. Not sure if you can use it to crawl
> for multiple categories at once, might take a bit of alteration, I use it
> to
> crawl for one specific topic or category at a time. Have a look a the web
> site. If it sounds like something that might work for you give me a shout
> back.
>
> Thanks
>
> Shay
>
> On 07/12/06, chad savage <cs...@activeathletemedia.com> wrote:
> >
> > Hey Eelco,
> >
> > We would like to organize information into a hierarchical category
> > system.  It's all general web content(html from the web).
> > Yes, there are a number of references to varying techniques on the net
> > (scientific papers, theoretical, practical, mind boggling). My problem
> > is determining the best method. and of course implementing it with my
> > limited nutch/java abilities.  May have to outsource most of this.
> > Not to mention the many formats for ontologies: owl,rdf,daml, some
> > others I am sure I'm missing.
> >
> > We would like to be able to crawl the web and categorize the pages into
> > buckets.  We currently have a number of separate configs for nutch all
> > crawling different subsets of our web sites with multiple indexes as a
> > start for being able to search separate categories.  The goal is to have
> > one crawl that can scan all of the websites and index the content into
> > these predetermined buckets and keep them in one master index.
> >
> > If there are any groups out there that handle this I would be more than
> > happy to discuss techniques and possible outsourcing.
> >
> > Chad
> >
> >
> > Eelco Lempsink wrote:
> > > On 5-dec-2006, at 7:01, chad savage wrote:
> > >> I'm doing some research on how to classify documents into pre-defined
> > >> categories.
> > >
> > > On basis of...?  The technique that's the most appropriate depends on
> > > the type of documents and the type of categories. For instance, are
> > > the documents structured (e.g. all XML using a common definition) or
> > > unstructured data (HTML from the web)?  Are you looking the place
> > > documents in a large hierarchical category system or is it a simple
> > > binary decision (e.g. 'Spam' or 'No spam').
> > >
> > > If you know what you want and how it's called it should be relatively
> > > easy to find information and scientific papers about it.
> > >
> > > --Regards,
> > >
> > > Eelco Lempsink
> > >
> >
>
>


-- 
www.babatu.com

Re: classifying content

Posted by Shay Lawless <se...@gmail.com>.

Hi Chad,

I use a focused web crawler called the metacombine project (
http://www.metacombine.org/) to classify the content retrieved during a web
crawl. It implements the Heritrix web crawler from the Internet Archive and
the Rainbow Text Classifier from CMU. Not sure if you can use it to crawl
for multiple categories at once, might take a bit of alteration, I use it to
crawl for one specific topic or category at a time. Have a look a the web
site. If it sounds like something that might work for you give me a shout
back.

Thanks

Shay

On 07/12/06, chad savage <cs...@activeathletemedia.com> wrote:
>
> Hey Eelco,
>
> We would like to organize information into a hierarchical category
> system.  It's all general web content(html from the web).
> Yes, there are a number of references to varying techniques on the net
> (scientific papers, theoretical, practical, mind boggling). My problem
> is determining the best method. and of course implementing it with my
> limited nutch/java abilities.  May have to outsource most of this.
> Not to mention the many formats for ontologies: owl,rdf,daml, some
> others I am sure I'm missing.
>
> We would like to be able to crawl the web and categorize the pages into
> buckets.  We currently have a number of separate configs for nutch all
> crawling different subsets of our web sites with multiple indexes as a
> start for being able to search separate categories.  The goal is to have
> one crawl that can scan all of the websites and index the content into
> these predetermined buckets and keep them in one master index.
>
> If there are any groups out there that handle this I would be more than
> happy to discuss techniques and possible outsourcing.
>
> Chad
>
>
> Eelco Lempsink wrote:
> > On 5-dec-2006, at 7:01, chad savage wrote:
> >> I'm doing some research on how to classify documents into pre-defined
> >> categories.
> >
> > On basis of...?  The technique that's the most appropriate depends on
> > the type of documents and the type of categories. For instance, are
> > the documents structured (e.g. all XML using a common definition) or
> > unstructured data (HTML from the web)?  Are you looking the place
> > documents in a large hierarchical category system or is it a simple
> > binary decision (e.g. 'Spam' or 'No spam').
> >
> > If you know what you want and how it's called it should be relatively
> > easy to find information and scientific papers about it.
> >
> > --Regards,
> >
> > Eelco Lempsink
> >
>

Re: classifying content

Posted by Eelco Lempsink <le...@paragin.nl>.

Hey Chad,

On 7-dec-2006, at 18:52, chad savage wrote:
> We would like to organize information into a hierarchical category  
> system.  It's all general web content(html from the web).
> Yes, there are a number of references to varying techniques on the  
> net (scientific papers, theoretical, practical, mind boggling). My  
> problem is determining the best method. and of course implementing  
> it with my limited nutch/java abilities.  May have to outsource  
> most of this.
> Not to mention the many formats for ontologies: owl,rdf,daml, some  
> others I am sure I'm missing.

Unfortunately, letting a machine organize information is not a  
trivial problem, so if you have no previous experience with it, you  
might easily be overwhelmed by all the theories and file formats.   
Fortunately, though, you might not need to use such a technique at  
all, because often there are other ways to classify text, for example  
simple metadata:

> We would like to be able to crawl the web and categorize the pages  
> into buckets.  We currently have a number of separate configs for  
> nutch all crawling different subsets of our web sites with multiple  
> indexes as a start for being able to search separate categories.   
> The goal is to have one crawl that can scan all of the websites and  
> index the content into these predetermined buckets and keep them in  
> one master index.

When you say "our websites" do you mean websites you maintain?  In  
that case it could be trivial, depending on your content management  
system, to add some extra information to each page about which  
'bucket' it should be placed in.

Otherwise, since you apparently have some configurations separating  
the different categories, it might be possible to translate that to a  
plugin which hooks in as a HtmlParseFilter and attaches some metadata  
to your parsed content.

On the wiki you'll find an example using similar techniques.  See  
http://wiki.apache.org/nutch/WritingPluginExample

-- 
Regards,

Eelco Lempsink

Re: classifying content

Posted by chad savage <cs...@activeathletemedia.com>.

Hey Eelco,

We would like to organize information into a hierarchical category 
system.  It's all general web content(html from the web).
Yes, there are a number of references to varying techniques on the net 
(scientific papers, theoretical, practical, mind boggling). My problem 
is determining the best method. and of course implementing it with my 
limited nutch/java abilities.  May have to outsource most of this.
Not to mention the many formats for ontologies: owl,rdf,daml, some 
others I am sure I'm missing.

We would like to be able to crawl the web and categorize the pages into 
buckets.  We currently have a number of separate configs for nutch all 
crawling different subsets of our web sites with multiple indexes as a 
start for being able to search separate categories.  The goal is to have 
one crawl that can scan all of the websites and index the content into 
these predetermined buckets and keep them in one master index.

If there are any groups out there that handle this I would be more than 
happy to discuss techniques and possible outsourcing.

Chad

Eelco Lempsink wrote:
> On 5-dec-2006, at 7:01, chad savage wrote:
>> I'm doing some research on how to classify documents into pre-defined 
>> categories.
>
> On basis of...?  The technique that's the most appropriate depends on 
> the type of documents and the type of categories. For instance, are 
> the documents structured (e.g. all XML using a common definition) or 
> unstructured data (HTML from the web)?  Are you looking the place 
> documents in a large hierarchical category system or is it a simple 
> binary decision (e.g. 'Spam' or 'No spam').
>
> If you know what you want and how it's called it should be relatively 
> easy to find information and scientific papers about it.
>
> --Regards,
>
> Eelco Lempsink
>

Re: classifying content

Posted by Eelco Lempsink <le...@paragin.nl>.

On 5-dec-2006, at 7:01, chad savage wrote:
> I'm doing some research on how to classify documents into pre- 
> defined categories.

On basis of...?  The technique that's the most appropriate depends on  
the type of documents and the type of categories. For instance, are  
the documents structured (e.g. all XML using a common definition) or  
unstructured data (HTML from the web)?  Are you looking the place  
documents in a large hierarchical category system or is it a simple  
binary decision (e.g. 'Spam' or 'No spam').

If you know what you want and how it's called it should be relatively  
easy to find information and scientific papers about it.

-- 
Regards,

Eelco Lempsink