You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paritosh Ranjan <pr...@xebia.com> on 2012/01/24 08:34:01 UTC

Suggestions Needed : Developing application using Mahout

Hi,

I need some suggestions regarding the possibility of developing an application using Mahout.

The application is regarding person names. We have all the information about which name part is of what type, and how often the name part is used as a particular type ( known as frequency )

i.e.

Dr - Title preceding (frequency = 1100)
Dr - FamilyName (frequency = 200)
Señor - Salutation ( frequency = 500 )
Paritosh -  Given Name ( frequency = 900 )
Ranjan - Family Name ( frequency = 800 )
Ranjan - Given Name ( frequency = 200 )

As you can see, same names can be found as different types. But, the relevance ( frequency ) of finding it in each type is different.

Other background information that we have is name type patterns which are commonly found.

i.e.

Paritosh Ranjan can be interpreted as :
a) Paritosh [GivenName], and Ranjan [FamilyName]
b) Paritosh [GivenName], and Ranjan [GivenName ]

But we know that [GivenName,FamilyName] is more common than [GivenName,GivenName].

Similarly there are many other patterns involving other types like Salution, TitlePreceeding, TitleSucceeding, MiddleName etc.
The patterns also involve regex i.e. [GivenName+][FamilyName]. i.e. One or more [GivenName] followed by a [FamilyName].

These patterns have a priority, some patterns are more popular and some are less popular.

The user enters a name eg. Mr. Paritosh Ranjan.
And the output is :

Mr.[Salutation], Paritosh[GivenName],Ranjan[FamilyName]
Mr.[TitlePreceeding], Paritosh[GivenName],Ranjan[FamilyName]
Mr.[Salutation], Paritosh[GivenName],Ranjan[GivenName]

These patterns in combination with frequency form a combined score of the name found. And the results are sorted in that order.

The total number of names and type information stored is around 50 million.

Question:

Can Mahout help in building such an application? The expectation from the application is to be fast and scalable.
If yes, then what all (techniques, algorithms) should be used.

Thanks and Regards,
Paritosh Ranjan

RE: Suggestions Needed : Developing application using Mahout

Posted by Paritosh Ranjan <pr...@xebia.com>.

Ted and Dhruv, Thanks for your suggestions.
I think I your suggestions would be useful. I will do a POC first with with sequential HMM, and then might need the MapReduce Version to check the scalability. 

________________________________________
From: dhruv21@gmail.com [dhruv21@gmail.com] on behalf of Dhruv Kumar [dkumar@ecs.umass.edu]
Sent: Wednesday, January 25, 2012 1:17 AM
To: user@mahout.apache.org
Subject: Re: Suggestions Needed : Developing application using Mahout

HMMs seem to be a good fit for this problem. They are used ubiquitously for
pattern detection.

If you are interested in application's scalability, I would suggest having
a look at MAHOUT-627. It contains a patch for a Map Reduce variant of HMM
training using the Baum Welch algorithm. Depending on your situation, you
could use this method to train your model, and then use our other HMM APIs
to decode the pattern.

On Tue, Jan 24, 2012 at 3:48 PM, Ted Dunning <te...@gmail.com> wrote:

> THere are a bunch of papers on this.  Search "named entity recognizer CRF"
> on google.
>
> The basic idea is that an HMM or CRF has internal state that can be used to
> mark named entities.  We don't have to define what the hidden states mean,
> just help the HMM or CRF find an internal representation that has the right
> outputs.
>
> The basic idea is that the text and text characteristics are inputs that
> are applied at each successive word position.  The internal state is, well,
> internal, but it triggers observable transitions in the output which are
> the label that we want to have.  For NER, this label is typically something
> like NAMED_ENTITY / NORMAL_TEXT or something more refined.
>
> On Tue, Jan 24, 2012 at 3:55 AM, Paritosh Ranjan <pr...@xebia.com>
> wrote:
>
> > If you can also also guide a bit on how to use HMM or CRF for this
> problem
> > ( on a high level ), then that would be of great help too.
> >
>

Re: Suggestions Needed : Developing application using Mahout

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

HMMs seem to be a good fit for this problem. They are used ubiquitously for
pattern detection.

If you are interested in application's scalability, I would suggest having
a look at MAHOUT-627. It contains a patch for a Map Reduce variant of HMM
training using the Baum Welch algorithm. Depending on your situation, you
could use this method to train your model, and then use our other HMM APIs
to decode the pattern.

On Tue, Jan 24, 2012 at 3:48 PM, Ted Dunning <te...@gmail.com> wrote:

> THere are a bunch of papers on this.  Search "named entity recognizer CRF"
> on google.
>
> The basic idea is that an HMM or CRF has internal state that can be used to
> mark named entities.  We don't have to define what the hidden states mean,
> just help the HMM or CRF find an internal representation that has the right
> outputs.
>
> The basic idea is that the text and text characteristics are inputs that
> are applied at each successive word position.  The internal state is, well,
> internal, but it triggers observable transitions in the output which are
> the label that we want to have.  For NER, this label is typically something
> like NAMED_ENTITY / NORMAL_TEXT or something more refined.
>
> On Tue, Jan 24, 2012 at 3:55 AM, Paritosh Ranjan <pr...@xebia.com>
> wrote:
>
> > If you can also also guide a bit on how to use HMM or CRF for this
> problem
> > ( on a high level ), then that would be of great help too.
> >
>

Re: Suggestions Needed : Developing application using Mahout

Posted by Ted Dunning <te...@gmail.com>.

THere are a bunch of papers on this.  Search "named entity recognizer CRF"
on google.

The basic idea is that an HMM or CRF has internal state that can be used to
mark named entities.  We don't have to define what the hidden states mean,
just help the HMM or CRF find an internal representation that has the right
outputs.

The basic idea is that the text and text characteristics are inputs that
are applied at each successive word position.  The internal state is, well,
internal, but it triggers observable transitions in the output which are
the label that we want to have.  For NER, this label is typically something
like NAMED_ENTITY / NORMAL_TEXT or something more refined.

On Tue, Jan 24, 2012 at 3:55 AM, Paritosh Ranjan <pr...@xebia.com> wrote:

> If you can also also guide a bit on how to use HMM or CRF for this problem
> ( on a high level ), then that would be of great help too.
>

RE: Suggestions Needed : Developing application using Mahout

Posted by Paritosh Ranjan <pr...@xebia.com>.

Thanks for the suggestions Ted.

I read about HMM, Viterbi and CRF on a very high level and it looks that they might be useful for this problem.
I will read them in detail and try to find out a solution based on them.

If you can also also guide a bit on how to use HMM or CRF for this problem ( on a high level ), then that would be of great help too.
________________________________________
From: Ted Dunning [ted.dunning@gmail.com]
Sent: Tuesday, January 24, 2012 8:41 AM
To: user@mahout.apache.org
Subject: Re: Suggestions Needed : Developing application using Mahout

The HMM implementations might be of help, but I think that a small CRF
implementation that is oriented around string transduction would be more
helpful.

The Stanford Named Entity Recognizer (NER) has such an implementation.  I
think NLTK has one.  I think GATE has one as well.

The basic technology is something that computes string and markup
probabilities and searches the space of markups using something like beam
search.

On Mon, Jan 23, 2012 at 11:34 PM, Paritosh Ranjan <pr...@xebia.com> wrote:

> Hi,
>
> I need some suggestions regarding the possibility of developing an
> application using Mahout.
>
> The application is regarding person names. We have all the information
> about which name part is of what type, and how often the name part is used
> as a particular type ( known as frequency )
>
> i.e.
>
> Dr - Title preceding (frequency = 1100)
> Dr - FamilyName (frequency = 200)
> Señor - Salutation ( frequency = 500 )
> Paritosh -  Given Name ( frequency = 900 )
> Ranjan - Family Name ( frequency = 800 )
> Ranjan - Given Name ( frequency = 200 )
>
> As you can see, same names can be found as different types. But, the
> relevance ( frequency ) of finding it in each type is different.
>
> Other background information that we have is name type patterns which are
> commonly found.
>
> i.e.
>
> Paritosh Ranjan can be interpreted as :
> a) Paritosh [GivenName], and Ranjan [FamilyName]
> b) Paritosh [GivenName], and Ranjan [GivenName ]
>
> But we know that [GivenName,FamilyName] is more common than
> [GivenName,GivenName].
>
> Similarly there are many other patterns involving other types like
> Salution, TitlePreceeding, TitleSucceeding, MiddleName etc.
> The patterns also involve regex i.e. [GivenName+][FamilyName]. i.e. One or
> more [GivenName] followed by a [FamilyName].
>
> These patterns have a priority, some patterns are more popular and some
> are less popular.
>
> The user enters a name eg. Mr. Paritosh Ranjan.
> And the output is :
>
> Mr.[Salutation], Paritosh[GivenName],Ranjan[FamilyName]
> Mr.[TitlePreceeding], Paritosh[GivenName],Ranjan[FamilyName]
> Mr.[Salutation], Paritosh[GivenName],Ranjan[GivenName]
>
> These patterns in combination with frequency form a combined score of the
> name found. And the results are sorted in that order.
>
> The total number of names and type information stored is around 50 million.
>
> Question:
>
> Can Mahout help in building such an application? The expectation from the
> application is to be fast and scalable.
> If yes, then what all (techniques, algorithms) should be used.
>
> Thanks and Regards,
> Paritosh Ranjan
>
>
>
>
>
>
>
>

Re: Suggestions Needed : Developing application using Mahout

Posted by Ted Dunning <te...@gmail.com>.

The HMM implementations might be of help, but I think that a small CRF
implementation that is oriented around string transduction would be more
helpful.

The Stanford Named Entity Recognizer (NER) has such an implementation.  I
think NLTK has one.  I think GATE has one as well.

The basic technology is something that computes string and markup
probabilities and searches the space of markups using something like beam
search.

On Mon, Jan 23, 2012 at 11:34 PM, Paritosh Ranjan <pr...@xebia.com> wrote:

> Hi,
>
> I need some suggestions regarding the possibility of developing an
> application using Mahout.
>
> The application is regarding person names. We have all the information
> about which name part is of what type, and how often the name part is used
> as a particular type ( known as frequency )
>
> i.e.
>
> Dr - Title preceding (frequency = 1100)
> Dr - FamilyName (frequency = 200)
> Señor - Salutation ( frequency = 500 )
> Paritosh -  Given Name ( frequency = 900 )
> Ranjan - Family Name ( frequency = 800 )
> Ranjan - Given Name ( frequency = 200 )
>
> As you can see, same names can be found as different types. But, the
> relevance ( frequency ) of finding it in each type is different.
>
> Other background information that we have is name type patterns which are
> commonly found.
>
> i.e.
>
> Paritosh Ranjan can be interpreted as :
> a) Paritosh [GivenName], and Ranjan [FamilyName]
> b) Paritosh [GivenName], and Ranjan [GivenName ]
>
> But we know that [GivenName,FamilyName] is more common than
> [GivenName,GivenName].
>
> Similarly there are many other patterns involving other types like
> Salution, TitlePreceeding, TitleSucceeding, MiddleName etc.
> The patterns also involve regex i.e. [GivenName+][FamilyName]. i.e. One or
> more [GivenName] followed by a [FamilyName].
>
> These patterns have a priority, some patterns are more popular and some
> are less popular.
>
> The user enters a name eg. Mr. Paritosh Ranjan.
> And the output is :
>
> Mr.[Salutation], Paritosh[GivenName],Ranjan[FamilyName]
> Mr.[TitlePreceeding], Paritosh[GivenName],Ranjan[FamilyName]
> Mr.[Salutation], Paritosh[GivenName],Ranjan[GivenName]
>
> These patterns in combination with frequency form a combined score of the
> name found. And the results are sorted in that order.
>
> The total number of names and type information stored is around 50 million.
>
> Question:
>
> Can Mahout help in building such an application? The expectation from the
> application is to be fast and scalable.
> If yes, then what all (techniques, algorithms) should be used.
>
> Thanks and Regards,
> Paritosh Ranjan
>
>
>
>
>
>
>
>