You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by chad savage <cs...@activeathletemedia.com> on 2006/12/05 07:01:42 UTC
classifying content
Hello All,
I'm doing some research on how to classify documents into pre-defined
categories.
Some methods I have come across are Ontologies, topic maps, url/site
based and simple keyword analysis.
I'm leaning towards topic maps and Ontologies being the strongest and
most documented in theory and in practice.
Does the group have any recommendations on where to start?
Software packages to help develop the owl/rdf files? Protoge?
Any consultancies out there that handle this process?
Downfalls to using these?
And finally, integrating them into nutch/lucene.
Thanks in advance,
Chad
Re: classifying content
Posted by Dennis Kubes <nu...@dragonflymc.com>.
You may also want to look at bayesian statistics, support vector
machines, and machine learning algorithms.
Dennis
kauu wrote:
> this is exactly also what i wander
>
> On 12/5/06, chad savage <cs...@activeathletemedia.com> wrote:
>>
>> Hello All,
>>
>> I'm doing some research on how to classify documents into pre-defined
>> categories.
>> Some methods I have come across are Ontologies, topic maps, url/site
>> based and simple keyword analysis.
>> I'm leaning towards topic maps and Ontologies being the strongest and
>> most documented in theory and in practice.
>> Does the group have any recommendations on where to start?
>> Software packages to help develop the owl/rdf files? Protoge?
>> Any consultancies out there that handle this process?
>> Downfalls to using these?
>> And finally, integrating them into nutch/lucene.
>>
>> Thanks in advance,
>> Chad
>>
>>
>
>
Re: classifying content
Posted by kauu <ba...@gmail.com>.
this is exactly also what i wander
On 12/5/06, chad savage <cs...@activeathletemedia.com> wrote:
>
> Hello All,
>
> I'm doing some research on how to classify documents into pre-defined
> categories.
> Some methods I have come across are Ontologies, topic maps, url/site
> based and simple keyword analysis.
> I'm leaning towards topic maps and Ontologies being the strongest and
> most documented in theory and in practice.
> Does the group have any recommendations on where to start?
> Software packages to help develop the owl/rdf files? Protoge?
> Any consultancies out there that handle this process?
> Downfalls to using these?
> And finally, integrating them into nutch/lucene.
>
> Thanks in advance,
> Chad
>
>
--
www.babatu.com
Re: classifying content
Posted by kauu <ba...@gmail.com>.
hi:
i don't very be well with nutch, and i think that nutch should classify
the pages after fetching them every time into different place. then u can
search them and display them .
On 12/8/06, Shay Lawless <se...@gmail.com> wrote:
>
> Hi Chad,
>
> I use a focused web crawler called the metacombine project (
> http://www.metacombine.org/) to classify the content retrieved during a
> web
> crawl. It implements the Heritrix web crawler from the Internet Archive
> and
> the Rainbow Text Classifier from CMU. Not sure if you can use it to crawl
> for multiple categories at once, might take a bit of alteration, I use it
> to
> crawl for one specific topic or category at a time. Have a look a the web
> site. If it sounds like something that might work for you give me a shout
> back.
>
> Thanks
>
> Shay
>
> On 07/12/06, chad savage <cs...@activeathletemedia.com> wrote:
> >
> > Hey Eelco,
> >
> > We would like to organize information into a hierarchical category
> > system. It's all general web content(html from the web).
> > Yes, there are a number of references to varying techniques on the net
> > (scientific papers, theoretical, practical, mind boggling). My problem
> > is determining the best method. and of course implementing it with my
> > limited nutch/java abilities. May have to outsource most of this.
> > Not to mention the many formats for ontologies: owl,rdf,daml, some
> > others I am sure I'm missing.
> >
> > We would like to be able to crawl the web and categorize the pages into
> > buckets. We currently have a number of separate configs for nutch all
> > crawling different subsets of our web sites with multiple indexes as a
> > start for being able to search separate categories. The goal is to have
> > one crawl that can scan all of the websites and index the content into
> > these predetermined buckets and keep them in one master index.
> >
> > If there are any groups out there that handle this I would be more than
> > happy to discuss techniques and possible outsourcing.
> >
> > Chad
> >
> >
> > Eelco Lempsink wrote:
> > > On 5-dec-2006, at 7:01, chad savage wrote:
> > >> I'm doing some research on how to classify documents into pre-defined
> > >> categories.
> > >
> > > On basis of...? The technique that's the most appropriate depends on
> > > the type of documents and the type of categories. For instance, are
> > > the documents structured (e.g. all XML using a common definition) or
> > > unstructured data (HTML from the web)? Are you looking the place
> > > documents in a large hierarchical category system or is it a simple
> > > binary decision (e.g. 'Spam' or 'No spam').
> > >
> > > If you know what you want and how it's called it should be relatively
> > > easy to find information and scientific papers about it.
> > >
> > > --Regards,
> > >
> > > Eelco Lempsink
> > >
> >
>
>
--
www.babatu.com
Re: classifying content
Posted by Shay Lawless <se...@gmail.com>.
Hi Chad,
I use a focused web crawler called the metacombine project (
http://www.metacombine.org/) to classify the content retrieved during a web
crawl. It implements the Heritrix web crawler from the Internet Archive and
the Rainbow Text Classifier from CMU. Not sure if you can use it to crawl
for multiple categories at once, might take a bit of alteration, I use it to
crawl for one specific topic or category at a time. Have a look a the web
site. If it sounds like something that might work for you give me a shout
back.
Thanks
Shay
On 07/12/06, chad savage <cs...@activeathletemedia.com> wrote:
>
> Hey Eelco,
>
> We would like to organize information into a hierarchical category
> system. It's all general web content(html from the web).
> Yes, there are a number of references to varying techniques on the net
> (scientific papers, theoretical, practical, mind boggling). My problem
> is determining the best method. and of course implementing it with my
> limited nutch/java abilities. May have to outsource most of this.
> Not to mention the many formats for ontologies: owl,rdf,daml, some
> others I am sure I'm missing.
>
> We would like to be able to crawl the web and categorize the pages into
> buckets. We currently have a number of separate configs for nutch all
> crawling different subsets of our web sites with multiple indexes as a
> start for being able to search separate categories. The goal is to have
> one crawl that can scan all of the websites and index the content into
> these predetermined buckets and keep them in one master index.
>
> If there are any groups out there that handle this I would be more than
> happy to discuss techniques and possible outsourcing.
>
> Chad
>
>
> Eelco Lempsink wrote:
> > On 5-dec-2006, at 7:01, chad savage wrote:
> >> I'm doing some research on how to classify documents into pre-defined
> >> categories.
> >
> > On basis of...? The technique that's the most appropriate depends on
> > the type of documents and the type of categories. For instance, are
> > the documents structured (e.g. all XML using a common definition) or
> > unstructured data (HTML from the web)? Are you looking the place
> > documents in a large hierarchical category system or is it a simple
> > binary decision (e.g. 'Spam' or 'No spam').
> >
> > If you know what you want and how it's called it should be relatively
> > easy to find information and scientific papers about it.
> >
> > --Regards,
> >
> > Eelco Lempsink
> >
>
Re: classifying content
Posted by Eelco Lempsink <le...@paragin.nl>.
Hey Chad,
On 7-dec-2006, at 18:52, chad savage wrote:
> We would like to organize information into a hierarchical category
> system. It's all general web content(html from the web).
> Yes, there are a number of references to varying techniques on the
> net (scientific papers, theoretical, practical, mind boggling). My
> problem is determining the best method. and of course implementing
> it with my limited nutch/java abilities. May have to outsource
> most of this.
> Not to mention the many formats for ontologies: owl,rdf,daml, some
> others I am sure I'm missing.
Unfortunately, letting a machine organize information is not a
trivial problem, so if you have no previous experience with it, you
might easily be overwhelmed by all the theories and file formats.
Fortunately, though, you might not need to use such a technique at
all, because often there are other ways to classify text, for example
simple metadata:
> We would like to be able to crawl the web and categorize the pages
> into buckets. We currently have a number of separate configs for
> nutch all crawling different subsets of our web sites with multiple
> indexes as a start for being able to search separate categories.
> The goal is to have one crawl that can scan all of the websites and
> index the content into these predetermined buckets and keep them in
> one master index.
When you say "our websites" do you mean websites you maintain? In
that case it could be trivial, depending on your content management
system, to add some extra information to each page about which
'bucket' it should be placed in.
Otherwise, since you apparently have some configurations separating
the different categories, it might be possible to translate that to a
plugin which hooks in as a HtmlParseFilter and attaches some metadata
to your parsed content.
On the wiki you'll find an example using similar techniques. See
http://wiki.apache.org/nutch/WritingPluginExample
--
Regards,
Eelco Lempsink
Re: classifying content
Posted by chad savage <cs...@activeathletemedia.com>.
Hey Eelco,
We would like to organize information into a hierarchical category
system. It's all general web content(html from the web).
Yes, there are a number of references to varying techniques on the net
(scientific papers, theoretical, practical, mind boggling). My problem
is determining the best method. and of course implementing it with my
limited nutch/java abilities. May have to outsource most of this.
Not to mention the many formats for ontologies: owl,rdf,daml, some
others I am sure I'm missing.
We would like to be able to crawl the web and categorize the pages into
buckets. We currently have a number of separate configs for nutch all
crawling different subsets of our web sites with multiple indexes as a
start for being able to search separate categories. The goal is to have
one crawl that can scan all of the websites and index the content into
these predetermined buckets and keep them in one master index.
If there are any groups out there that handle this I would be more than
happy to discuss techniques and possible outsourcing.
Chad
Eelco Lempsink wrote:
> On 5-dec-2006, at 7:01, chad savage wrote:
>> I'm doing some research on how to classify documents into pre-defined
>> categories.
>
> On basis of...? The technique that's the most appropriate depends on
> the type of documents and the type of categories. For instance, are
> the documents structured (e.g. all XML using a common definition) or
> unstructured data (HTML from the web)? Are you looking the place
> documents in a large hierarchical category system or is it a simple
> binary decision (e.g. 'Spam' or 'No spam').
>
> If you know what you want and how it's called it should be relatively
> easy to find information and scientific papers about it.
>
> --Regards,
>
> Eelco Lempsink
>
Re: classifying content
Posted by Eelco Lempsink <le...@paragin.nl>.
On 5-dec-2006, at 7:01, chad savage wrote:
> I'm doing some research on how to classify documents into pre-
> defined categories.
On basis of...? The technique that's the most appropriate depends on
the type of documents and the type of categories. For instance, are
the documents structured (e.g. all XML using a common definition) or
unstructured data (HTML from the web)? Are you looking the place
documents in a large hierarchical category system or is it a simple
binary decision (e.g. 'Spam' or 'No spam').
If you know what you want and how it's called it should be relatively
easy to find information and scientific papers about it.
--
Regards,
Eelco Lempsink