You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by JAB <ge...@baesystems.com> on 2012/06/27 20:59:45 UTC

Nutch Author, Publication, and Religion Detection

I've written some simple Nutch plug-ins to detect a document's Author,
Publication Date, and if its an article about Religion (including what
religion its talking about). I was wondering if anyone knows of any open
source plug-ins any group has written to cover these plug-in issues, rather
than me relying on my own custom solutions. I'm new to Nutch/Gate
development.

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Sounds great, glad you got something.
Lewis

On Thu, Jul 5, 2012 at 6:04 PM, JAB <ge...@baesystems.com> wrote:
>
> Thanks for the advice. Currently I'm looking at a simplified GATE Gazetteer
> approach. My customer isn't clear on what he wants and the requirements I
> came up with may be overkill.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-Nutch-Author-Publication-and-Religion-Detection-tp3991965p3993282.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: Nutch Author, Publication, and Religion Detection

Posted by JAB <ge...@baesystems.com>.
Thanks for the advice. Currently I'm looking at a simplified GATE Gazetteer
approach. My customer isn't clear on what he wants and the requirements I
came up with may be overkill. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Re-Nutch-Author-Publication-and-Religion-Detection-tp3991965p3993282.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Author, Publication, and Religion Detection

Posted by Julien Nioche <li...@gmail.com>.
Guys,

Assuming that you have a training dataset Machine Learning would be a good
way of classifying a document. Apache Mahout could be used or an API like
https://github.com/DigitalPebble/TextClassification would work as well. We
used it in Nutch for some projects and it worked fine (but I am biased as
it is ours). GATE is a useful resource a well, probably a bit of an
overkill for a task like this unless you want to use it to generate more
intelligent features.

HTH

Julien

On 29 June 2012 13:18, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Jim,
>
> The thing about this problem is that I assume religion information
> would not be included in the document metadata therefore it's not a
> simple case of using one of the existing implementations e.g.
> parse-metatags to grab this data...
>
> I think it would be something more a long the lines of text processing
> post (or @runtime) fetching. Documents could then be classified
> accordingly. I recently spoke with someone who undertook such an
> exercise but not using Nutch I must admit. If you are familar with
> GATE [0] you could create some kind of plugin to identify this kind of
> information but I am not familiar with the process of retaining it for
> indexing as I have not thoroughly tried the concept.
>
> hth
>
> Lewis
>
> [0] http://gate.ac.uk/
>
> On Fri, Jun 29, 2012 at 12:36 PM, Jim Chandler <ja...@gmail.com>
> wrote:
> > Lewis,
> >
> > I work with George.  What we are trying to do is identify whether or not
> a
> > document is religious in nature or not.  And if so what that religion is.
> >  We are aware this could be a difficult undertaking, and we would like
> not
> > to reinvent the wheel.
> >
> > HTH
> > Jim
> >
> > On Thu, Jun 28, 2012 at 5:16 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> >> Hi George,
> >>
> >> Where are each of these fields present within the document?
> >>
> >> Lewis
> >>
> >> > On Wed, Jun 27, 2012 at 7:59 PM, JAB <ge...@baesystems.com>
> >> wrote:
> >> >> I've written some simple Nutch plug-ins to detect a document's
> Author,
> >> >> Publication Date, and if its an article about Religion (including
> what
> >> >> religion its talking about). I was wondering if anyone knows of any
> open
> >> >> source plug-ins any group has written to cover these plug-in issues,
> >> rather
> >> >> than me relying on my own custom solutions. I'm new to Nutch/Gate
> >> >> development.
> >> >>
> >> >> --
> >> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
> >> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> >> >
> >> >
> >> >
> >> > --
> >> > Lewis
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
>
>
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Jim,

The thing about this problem is that I assume religion information
would not be included in the document metadata therefore it's not a
simple case of using one of the existing implementations e.g.
parse-metatags to grab this data...

I think it would be something more a long the lines of text processing
post (or @runtime) fetching. Documents could then be classified
accordingly. I recently spoke with someone who undertook such an
exercise but not using Nutch I must admit. If you are familar with
GATE [0] you could create some kind of plugin to identify this kind of
information but I am not familiar with the process of retaining it for
indexing as I have not thoroughly tried the concept.

hth

Lewis

[0] http://gate.ac.uk/

On Fri, Jun 29, 2012 at 12:36 PM, Jim Chandler <ja...@gmail.com> wrote:
> Lewis,
>
> I work with George.  What we are trying to do is identify whether or not a
> document is religious in nature or not.  And if so what that religion is.
>  We are aware this could be a difficult undertaking, and we would like not
> to reinvent the wheel.
>
> HTH
> Jim
>
> On Thu, Jun 28, 2012 at 5:16 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi George,
>>
>> Where are each of these fields present within the document?
>>
>> Lewis
>>
>> > On Wed, Jun 27, 2012 at 7:59 PM, JAB <ge...@baesystems.com>
>> wrote:
>> >> I've written some simple Nutch plug-ins to detect a document's Author,
>> >> Publication Date, and if its an article about Religion (including what
>> >> religion its talking about). I was wondering if anyone knows of any open
>> >> source plug-ins any group has written to cover these plug-in issues,
>> rather
>> >> than me relying on my own custom solutions. I'm new to Nutch/Gate
>> >> development.
>> >>
>> >> --
>> >> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
>> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>> >
>> >
>> >
>> > --
>> > Lewis
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: Nutch Author, Publication, and Religion Detection

Posted by Jim Chandler <ja...@gmail.com>.
Lewis,

I work with George.  What we are trying to do is identify whether or not a
document is religious in nature or not.  And if so what that religion is.
 We are aware this could be a difficult undertaking, and we would like not
to reinvent the wheel.

HTH
Jim

On Thu, Jun 28, 2012 at 5:16 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi George,
>
> Where are each of these fields present within the document?
>
> Lewis
>
> > On Wed, Jun 27, 2012 at 7:59 PM, JAB <ge...@baesystems.com>
> wrote:
> >> I've written some simple Nutch plug-ins to detect a document's Author,
> >> Publication Date, and if its an article about Religion (including what
> >> religion its talking about). I was wondering if anyone knows of any open
> >> source plug-ins any group has written to cover these plug-in issues,
> rather
> >> than me relying on my own custom solutions. I'm new to Nutch/Gate
> >> development.
> >>
> >> --
> >> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> >
> >
> >
> > --
> > Lewis
>
>
>
> --
> Lewis
>

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi George,

Where are each of these fields present within the document?

Lewis

> On Wed, Jun 27, 2012 at 7:59 PM, JAB <ge...@baesystems.com> wrote:
>> I've written some simple Nutch plug-ins to detect a document's Author,
>> Publication Date, and if its an article about Religion (including what
>> religion its talking about). I was wondering if anyone knows of any open
>> source plug-ins any group has written to cover these plug-in issues, rather
>> than me relying on my own custom solutions. I'm new to Nutch/Gate
>> development.
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>
>
> --
> Lewis



-- 
Lewis

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi JAB,

This conversation got shifted over to user@ please note the mailing address.

http://www.mail-archive.com/user%40nutch.apache.org/msg06787.html

Thanks

Lewis

On Tue, Jul 3, 2012 at 2:33 PM, JAB <ge...@baesystems.com> wrote:
> You mention "Julien's comments and grabbing the code that he's made available
> for
> similar tasks".  Please  provide a link to where his comments and code is.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992754.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis

Re: Nutch Author, Publication, and Religion Detection

Posted by JAB <ge...@baesystems.com>.
You mention "Julien's comments and grabbing the code that he's made available
for
similar tasks".  Please  provide a link to where his comments and code is.

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992754.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
OK so please let us know how you get on.

Although you seem to have a clear idea about how you're going to
progress with the issue, I would seriously consider taking on board
Julien's comments and grabbing the code that he's made available for
similar tasks.

All the best

Lewis

On Fri, Jun 29, 2012 at 7:19 PM, JAB <ge...@baesystems.com> wrote:
> Hi Lewis;
>
> 'm looking at creating Nutch plugin to determine if a document is an article
> on religion, and what religion its primarily talking about. Then, adding an
> annotation called 'religion' to the document on what the primary category of
> the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
> Muslim, or Unknown (if it can't be determined). No annotation will be added
> if its not an article on religion. Next, another annotation on what
> sub-category the religion is. For example, under Christian would be Catholic
> or Protestant. Then possibly a third annotation for  the denomination.
> Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
> Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
> religious breakdowns so I don't know if this it the appropriate way to
> categorize them.
>
> ******
> Design:
>
> I created a java class on religion that extends IndexingFilter class. I next
> determine if its an article on religion. I do so by counting the number of
> occurrences of certain key words in the document. Example, if 'God' appears
> more then 10 times, its an article on religion. If it mentions 'Christian'
> more than a certain number of times and more often than other religions, the
> sub-category would be 'Christian'. The first match on denomination search
> would be assumed to be the  denomination. I'm also using a
> language-detection plugin
> (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to
> determine the language of the document so I can search for words in the
> document's native language. I don't know if this is the best approach to
> solving this issue.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis

Re: Nutch Author, Publication, and Religion Detection

Posted by JAB <ge...@baesystems.com>.
Hi Lewis;

'm looking at creating Nutch plugin to determine if a document is an article
on religion, and what religion its primarily talking about. Then, adding an
annotation called 'religion' to the document on what the primary category of
the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
Muslim, or Unknown (if it can't be determined). No annotation will be added
if its not an article on religion. Next, another annotation on what
sub-category the religion is. For example, under Christian would be Catholic
or Protestant. Then possibly a third annotation for  the denomination.
Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
religious breakdowns so I don't know if this it the appropriate way to
categorize them.

****** 
Design:

I created a java class on religion that extends IndexingFilter class. I next
determine if its an article on religion. I do so by counting the number of
occurrences of certain key words in the document. Example, if 'God' appears
more then 10 times, its an article on religion. If it mentions 'Christian'
more than a certain number of times and more often than other religions, the
sub-category would be 'Christian'. The first match on denomination search
would be assumed to be the  denomination. I'm also using a
language-detection plugin
(http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to
determine the language of the document so I can search for words in the
document's native language. I don't know if this is the best approach to
solving this issue.




--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Nutch Author, Publication, and Religion Detection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi George,

Where are each of these fields present within the document?



On Wed, Jun 27, 2012 at 7:59 PM, JAB <ge...@baesystems.com> wrote:
> I've written some simple Nutch plug-ins to detect a document's Author,
> Publication Date, and if its an article about Religion (including what
> religion its talking about). I was wondering if anyone knows of any open
> source plug-ins any group has written to cover these plug-in issues, rather
> than me relying on my own custom solutions. I'm new to Nutch/Gate
> development.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis