You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Byron Miller <by...@yahoo.com> on 2006/02/07 17:45:23 UTC

Categorizing content

Is there an easy way to categorize content on parse? 
I have an extensive list of adult terms and i would
like to update meta info on the page if the
combination of terms exist to flag it as adult content
so i can exclude it from the search results unless
people opt in.

I'd like to also look at bayesian filtering during the
parse phase to look for hidden font (text same color
as background) and spammy pages or for sites with 3+
adsense ads or other particulars and score
appropriately.

Has anyone experiemented with this?

Re: Categorizing content

Posted by Byron Miller <by...@yahoo.com>.

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> There is - if it's an HTML page, add HTMLFilter. If
> it's other type of 
> content, I'm afraid there is no general
> post-processing hook to add plugins.

I'll check that out! Thanks for pointing me to this.


> > I'd like to also look at bayesian filtering during
> the
> > parse phase to look for hidden font (text same
> color
> > as background) and spammy pages or for sites with
> 3+
> > adsense ads or other particulars and score
> > appropriately.
> >
> > Has anyone experiemented with this?
> >   
> 
> Again, HTMLFilters is the place to add such things.
> 
> Now, an interesting thing would be to keep this
> categorization around, 
> so that next time you can skip/demote pages, which
> are known as spam. 
> This is the purpose of the "CrawlDatum metadata"
> patch... coming soon, I 
> hope :-)

That's what i'm waiting (Rather excited) for :)

Looking to initially flag adult related pages, but use
existing filtering processing to look for patterns to
flag as spam as well.

-byron

Re: Categorizing content

Posted by Andrzej Bialecki <ab...@getopt.org>.
Byron Miller wrote:
> Is there an easy way to categorize content on parse? 
> I have an extensive list of adult terms and i would
> like to update meta info on the page if the
> combination of terms exist to flag it as adult content
> so i can exclude it from the search results unless
> people opt in.
>   

There is - if it's an HTML page, add HTMLFilter. If it's other type of 
content, I'm afraid there is no general post-processing hook to add plugins.

> I'd like to also look at bayesian filtering during the
> parse phase to look for hidden font (text same color
> as background) and spammy pages or for sites with 3+
> adsense ads or other particulars and score
> appropriately.
>
> Has anyone experiemented with this?
>   

Again, HTMLFilters is the place to add such things.

Now, an interesting thing would be to keep this categorization around, 
so that next time you can skip/demote pages, which are known as spam. 
This is the purpose of the "CrawlDatum metadata" patch... coming soon, I 
hope :-)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Categorizing content

Posted by 盖世豪侠 <ma...@gmail.com>.
Hi
I think you have to hack the parsed content from the parse-html plugin and
filter the string with your terms?
It will of course contain modifying  or adding some codes.


2006/2/8, Byron Miller <by...@yahoo.com>:
>
> Is there an easy way to categorize content on parse?
> I have an extensive list of adult terms and i would
> like to update meta info on the page if the
> combination of terms exist to flag it as adult content
> so i can exclude it from the search results unless
> people opt in.
>
> I'd like to also look at bayesian filtering during the
> parse phase to look for hidden font (text same color
> as background) and spammy pages or for sites with 3+
> adsense ads or other particulars and score
> appropriately.
>
> Has anyone experiemented with this?
>



--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。

Re: Categorizing content

Posted by 盖世豪侠 <ma...@gmail.com>.
It sounds OK.
But I think if you don't check it on line, maybe you will get many
unrequired contents in your index.


2006/2/8, Jack Tang <hi...@gmail.com>:
>
> Hi Byron
>
> I am thinking will it be faster to do this offline? I mean you can
> re-visit webdb and link db and generate the index.
>
> /Jack
>
> On 2/8/06, Byron Miller <by...@yahoo.com> wrote:
> > Is there an easy way to categorize content on parse?
> > I have an extensive list of adult terms and i would
> > like to update meta info on the page if the
> > combination of terms exist to flag it as adult content
> > so i can exclude it from the search results unless
> > people opt in.
> >
> > I'd like to also look at bayesian filtering during the
> > parse phase to look for hidden font (text same color
> > as background) and spammy pages or for sites with 3+
> > adsense ads or other particulars and score
> > appropriately.
> >
> > Has anyone experiemented with this?
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>



--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。

Re: Categorizing content

Posted by Jack Tang <hi...@gmail.com>.
Hi Byron

I am thinking will it be faster to do this offline? I mean you can
re-visit webdb and link db and generate the index.

/Jack

On 2/8/06, Byron Miller <by...@yahoo.com> wrote:
> Is there an easy way to categorize content on parse?
> I have an extensive list of adult terms and i would
> like to update meta info on the page if the
> combination of terms exist to flag it as adult content
> so i can exclude it from the search results unless
> people opt in.
>
> I'd like to also look at bayesian filtering during the
> parse phase to look for hidden font (text same color
> as background) and spammy pages or for sites with 3+
> adsense ads or other particulars and score
> appropriately.
>
> Has anyone experiemented with this?
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars