You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/06/28 21:56:28 UTC

Model to detect the gender

Hello everybody,

we built a NER model to find persons (name) inside our documents.
We are looking for the best approach to understand if the name is
male/female.

Possible solutions:
- Plain dictionary?
- Regex to check the initial and/letters of the name?
- Classifier? (naive bayes? Maxent?)

Thanks

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Hi Mondher,
you gave me really good advice! Thank you!
Let me recap a little bit.

Basically I need a dictionary to understand if a name can be male/female or
both. If I am sure that's male or female i will not go further, otherwise
IF i find an entity that can be both I will do the classification task.

The classification is build with a list of features, these features
represent the "state" of specific surrounding tokens.
The classification is done via the Doccat Trainer
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.doccat

Now i have to create a .train file to train my model with MALE FEMALE
classes.

Should i build the model doing something like:

*FEMALE*  False   True   UNCERTAIN   1   FEMALE   3   FEMALE   4   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*FEMALE*  False   True   UNCERTAIN   1   FEMALE   1   FEMALE   3   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*FEMALE*  False   True   UNCERTAIN   1   FEMALE   2   FEMALE   1   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   3   MALE   4   UNCERTAIN   2
  EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   1   MALE   3   UNCERTAIN   2
  EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   2   MALE   1   UNCERTAIN   2
  EMPTY   0   EMPTY   0

This way?
Obviously, that's a stupid data, I just repeated it. I am asking that to
understand "how to add those features into the training of the classifier"

Thank you really much! I am looking forward to your reply.
Damiano



2016-07-01 15:05 GMT+02:00 Mondher Bouazizi <mo...@gmail.com>:

> Hi,
>
> Sorry for my late reply. I didn't understand well your last email, but here
> is what I meant:
>
> Given a simple dictionary you have that has the following columns:
>
> Name           Type           Gender
> Agatha         First           F
> John            First           M
> Smith          Both           B
>
> where:
> - "First" refers to first name, "Last" (not in the example) refers to last
> name, and Both means it can be both.
> - "F" refers to female, "M" refers to males, and "B" refers to both
> genders.
>
> and given the following two sentences:
>
> 1. "It was nice meeting you John. I hope we meet again soon."
>
> 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
> she knows something"
>
> In the first example, when you check in the dictionary, the name "John" is
> a male name, so no need to go any further.
> However, in the second example, the name "Smith", which is a family name in
> our case, can be fit for both, males and females. Therefore, we need to
> extract features from the surrounding context and perform a classification
> task.
> Here are some of the features I think they would be interesting to use:
>
> . Presence of a male initiative before the word {True, False}
> . Presence of a female initiative before the word {True, False}
>
> . Gender of the first personal pronoun (subject or object form) to the
> right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the right (in
> words)         Values=NUMERIC
> . Gender of the second personal pronoun to the right of the
> name                                 Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun right
>                  Values=NUMERIC
> . Gender of the third personal pronoun to the right of the
> name                                      Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the third personal pronoun right (in
> words)                  Values=NUMERIC
>
> . Gender of the first personal pronoun (subject or object form) to the left
> of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the left (in
> words)            Values=NUMERIC
> . Gender of the second personal pronoun to the left of the
> name                                    Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun left
>                     Values=NUMERIC
> . Gender of the third personal pronoun to the left of the
> name                                        Values={MALE, FEMALE,
> UNCERTAIN, EMPTY}
> . Distance between the name and the third personal pronoun left (in
> words)                    Values=NUMERIC
>
> In the second example here are the values you have for your features
>
> F1 = False
> F2 = True
> F3 = UNCERTAIN
> F4 = 1
> F5 = FEMALE
> F6 = 3
> F7 = FEMALE
> F8 = 4
> F9 = UNCERTAIN
> F10 = 2
> F11 = EMPTY
> F12 = 0
> F13 = EMPTY
> F14 = 0
>
> Of course the choice of features depends on the type of data, and the
> features themselves might not work well for some texts such as ones
> collected from twitter for example.
>
> I hope this help you.
>
> Best regards
>
> Mondher
>
>
> On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Hi Mondher,
> > could you give me a raw example to understand how i should train the
> > classifier model?
> >
> > Thank you in advance!
> > Damiano
> >
> >
> > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mo...@gmail.com>:
> >
> > > Hi,
> > >
> > > I would recommend a hybrid approach where, in a first step, you use a
> > plain
> > > dictionary and then perform the classification if needed.
> > >
> > > It's straightforward, but I think it would present better performances
> > than
> > > just performing a classification task.
> > >
> > > In the first step you use a dictionary of names along with an attribute
> > > specifying whether the name fits for males, females or both. In case
> the
> > > name fits for males or females exclusively, then no need to go any
> > further.
> > >
> > > If the name fits for both genders, or is a family name etc., a second
> > step
> > > is needed where you extract features from the context (surrounding
> words,
> > > etc.) and perform a classification task using any machine learning
> > > algorithm.
> > >
> > > Another way would be using the information itself (whether the name
> fits
> > > for males, females or both) as a feature when you perform the
> > > classification.
> > >
> > > Best regards,
> > >
> > > Mondher
> > >
> > > I am not sure
> > >
> > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> damianoporta@gmail.com>
> > > wrote:
> > >
> > > > Awesome! Thank you so much WIlliam!
> > > >
> > > > 2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:
> > > >
> > > > > To create a NER model OpenNLP extracts features from the context,
> > > things
> > > > > such as: word prefix and suffix, next word, previous word, previous
> > > word
> > > > > prefix and suffix, next word prefix and suffix etc.
> > > > > When you don't configure the feature generator it will apply the
> > > default:
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > > >
> > > > > Default feature generator:
> > > > >
> > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > > CachedFeatureGenerator(
> > > > >          *new* AdaptiveFeatureGenerator[]{
> > > > >            *new* WindowFeatureGenerator(*new*
> > TokenFeatureGenerator(),
> > > 2,
> > > > > 2),
> > > > >            *new* WindowFeatureGenerator(*new*
> > > > > TokenClassFeatureGenerator(true), 2, 2),
> > > > >            *new* OutcomePriorFeatureGenerator(),
> > > > >            *new* PreviousMapFeatureGenerator(),
> > > > >            *new* BigramNameFeatureGenerator(),
> > > > >            *new* SentenceFeatureGenerator(true, false)
> > > > >            });
> > > > >
> > > > >
> > > > > These default features should work for most cases (specially
> > English),
> > > > but
> > > > > they of course can be incremented. If you do so, your model will
> take
> > > new
> > > > > features in account. So yes, you are putting the features in your
> > > model.
> > > > >
> > > > > To configure custom features is not easy. I would start with the
> > > default
> > > > > and use 10-fold cross-validation and take notes of its
> effectiveness.
> > > > Than
> > > > > change/add a feature, evaluate and take notes. Sometimes a feature
> > that
> > > > we
> > > > > are sure would help can destroy the model effectiveness.
> > > > >
> > > > > Regards
> > > > > William
> > > > >
> > > > >
> > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > > >
> > > > > > Thank you William! Really appreciated!
> > > > > >
> > > > > > I only do not get one point, when you said "You could increment
> > your
> > > > > > model using
> > > > > > Custom Feature Generators" does it mean that i can "put" these
> > > features
> > > > > > inside ONE *.bin* file (model) that implement different things,
> or,
> > > > name
> > > > > > finder is one thing and those feature generators other?
> > > > > >
> > > > > > Thank you in advance for the clarification.
> > > > > >
> > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <william.colen@gmail.com
> >:
> > > > > >
> > > > > > > Not exactly. You would create a new NER model to replace yours.
> > > > > > >
> > > > > > > In this approach you would need a corpus like this:
> > > > > > >
> > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will
> join
> > > the
> > > > > > board
> > > > > > > as a nonexecutive director Nov. 29 .
> > > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier
> > N.V. ,
> > > > the
> > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson
> <END>
> > > is
> > > > > > > retiring , she was a board member for 5 years .
> > > > > > >
> > > > > > >
> > > > > > > I am not an English native speaker, so I am not sure if the
> > example
> > > > is
> > > > > > > clear enough. I tried to use Jessie as a neutral name and "she"
> > as
> > > > > > > disambiguation.
> > > > > > >
> > > > > > > With a corpus big enough maybe you could create a model that
> > > outputs
> > > > > both
> > > > > > > classes, personMale and personFemale. To train a model you can
> > > follow
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > > >
> > > > > > > Let's say your results are not good enough. You could increment
> > > your
> > > > > > model
> > > > > > > using Custom Feature Generators (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > > ).
> > > > > > >
> > > > > > > One of the implemented featuregen can take a dictionary (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > > ).
> > > > > > > You can also implement other convenient FeatureGenerator, for
> > > > instance
> > > > > > > regex.
> > > > > > >
> > > > > > > Again, it is just a wild guess of how to implement it. I don't
> > know
> > > > if
> > > > > it
> > > > > > > would perform well. I was only thinking how to implement a
> gender
> > > ML
> > > > > > model
> > > > > > > that uses the surrounding context.
> > > > > > >
> > > > > > > Hope I could clarify.
> > > > > > >
> > > > > > > William
> > > > > > >
> > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> damianoporta@gmail.com
> > >:
> > > > > > >
> > > > > > > > Hi William,
> > > > > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > > > > >
> > > > > > > > 1. NER (personM for example)
> > > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > > >
> > > > > > > > Yes we can split out model in two for M and F, it is not a
> big
> > > > > problem,
> > > > > > > we
> > > > > > > > have a database grouped by gender.
> > > > > > > >
> > > > > > > > I only have a doubt regarding the use of a dictionary.
> Because
> > if
> > > > we
> > > > > > use
> > > > > > > a
> > > > > > > > dictionary to create the model, we could only use it to
> detect
> > > > names
> > > > > > > > without using NER. No?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> > william.colen@gmail.com
> > > >:
> > > > > > > >
> > > > > > > > > Do you plan to use the surrounding context? If yes, maybe
> you
> > > > could
> > > > > > try
> > > > > > > > to
> > > > > > > > > split NER in two categories: PersonM and PersonF. Just an
> > idea,
> > > > > never
> > > > > > > > read
> > > > > > > > > or tried anything like it. You would need a training corpus
> > > with
> > > > > > these
> > > > > > > > > classes.
> > > > > > > > >
> > > > > > > > > You could add both the plain dictionary and the regex as
> NER
> > > > > features
> > > > > > > as
> > > > > > > > > well and check how it improves.
> > > > > > > > >
> > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > > damianoporta@gmail.com
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hello everybody,
> > > > > > > > > >
> > > > > > > > > > we built a NER model to find persons (name) inside our
> > > > documents.
> > > > > > > > > > We are looking for the best approach to understand if the
> > > name
> > > > is
> > > > > > > > > > male/female.
> > > > > > > > > >
> > > > > > > > > > Possible solutions:
> > > > > > > > > > - Plain dictionary?
> > > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Jorn, please, could you link me that model?

2016-07-04 14:42 GMT+02:00 Joern Kottmann <ko...@gmail.com>:

> The co-referencer we used used to have in opennlp-tools has a model to
> detect the gender of names. That could could be extracted and put into a
> stand alone component.
>
> Jörn
>
> On Mon, Jul 4, 2016 at 2:41 PM, Joern Kottmann <ko...@gmail.com> wrote:
>
> > I was speaking about the second case. We could build a dedicated
> component
> > specialized in extracting properties about already detected entities.
> >
> > Jörn
> >
> > On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <da...@gmail.com>
> > wrote:
> >
> >> Hello Jorn,
> >> Do you mean that i need to "extend" my NER model to find other
> >> name-related
> >> entities too?
> >>
> >> OR
> >>
> >> Find the entities with a dictionary and then train a maxent model that
> >> finds other properties like person title, job position etc?
> >>
> >> Thanks for the clarification.
> >>
> >>
> >> 2016-07-04 12:15 GMT+02:00 Joern Kottmann <ko...@gmail.com>:
> >>
> >> > Hello,
> >> >
> >> > there are also other interesting properties e.g. person title (e.g.
> >> > professor, doctor), job title/position,
> >> > company legal form. And much more for other entity types.
> >> >
> >> > Maybe it would be worth it to build a dedicated component to extract
> >> > properties from entities.
> >> >
> >> > Jörn
> >> >
> >> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
> >> > mondher.bouazizi@gmail.com
> >> > > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Sorry for my late reply. I didn't understand well your last email,
> but
> >> > here
> >> > > is what I meant:
> >> > >
> >> > > Given a simple dictionary you have that has the following columns:
> >> > >
> >> > > Name           Type           Gender
> >> > > Agatha         First           F
> >> > > John            First           M
> >> > > Smith          Both           B
> >> > >
> >> > > where:
> >> > > - "First" refers to first name, "Last" (not in the example) refers
> to
> >> > last
> >> > > name, and Both means it can be both.
> >> > > - "F" refers to female, "M" refers to males, and "B" refers to both
> >> > > genders.
> >> > >
> >> > > and given the following two sentences:
> >> > >
> >> > > 1. "It was nice meeting you John. I hope we meet again soon."
> >> > >
> >> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case
> and
> >> > felt
> >> > > she knows something"
> >> > >
> >> > > In the first example, when you check in the dictionary, the name
> >> "John"
> >> > is
> >> > > a male name, so no need to go any further.
> >> > > However, in the second example, the name "Smith", which is a family
> >> name
> >> > in
> >> > > our case, can be fit for both, males and females. Therefore, we need
> >> to
> >> > > extract features from the surrounding context and perform a
> >> > classification
> >> > > task.
> >> > > Here are some of the features I think they would be interesting to
> >> use:
> >> > >
> >> > > . Presence of a male initiative before the word {True, False}
> >> > > . Presence of a female initiative before the word {True, False}
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > > right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> >> right
> >> > (in
> >> > > words)         Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the right of the
> >> > > name                                 Values={MALE, FEMALE,
> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun right
> >> > >                  Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the right of the
> >> > > name                                      Values={MALE, FEMALE,
> >> > UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the third personal pronoun right (in
> >> > > words)                  Values=NUMERIC
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > left
> >> > > of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> left
> >> > (in
> >> > > words)            Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the left of the
> >> > > name                                    Values={MALE, FEMALE,
> >> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun left
> >> > >                     Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the left of the
> >> > > name                                        Values={MALE, FEMALE,
> >> > > UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the third personal pronoun left (in
> >> > > words)                    Values=NUMERIC
> >> > >
> >> > > In the second example here are the values you have for your features
> >> > >
> >> > > F1 = False
> >> > > F2 = True
> >> > > F3 = UNCERTAIN
> >> > > F4 = 1
> >> > > F5 = FEMALE
> >> > > F6 = 3
> >> > > F7 = FEMALE
> >> > > F8 = 4
> >> > > F9 = UNCERTAIN
> >> > > F10 = 2
> >> > > F11 = EMPTY
> >> > > F12 = 0
> >> > > F13 = EMPTY
> >> > > F14 = 0
> >> > >
> >> > > Of course the choice of features depends on the type of data, and
> the
> >> > > features themselves might not work well for some texts such as ones
> >> > > collected from twitter for example.
> >> > >
> >> > > I hope this help you.
> >> > >
> >> > > Best regards
> >> > >
> >> > > Mondher
> >> > >
> >> > >
> >> > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <
> >> damianoporta@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Mondher,
> >> > > > could you give me a raw example to understand how i should train
> the
> >> > > > classifier model?
> >> > > >
> >> > > > Thank you in advance!
> >> > > > Damiano
> >> > > >
> >> > > >
> >> > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <
> >> mondher.bouazizi@gmail.com
> >> > >:
> >> > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > I would recommend a hybrid approach where, in a first step, you
> >> use a
> >> > > > plain
> >> > > > > dictionary and then perform the classification if needed.
> >> > > > >
> >> > > > > It's straightforward, but I think it would present better
> >> > performances
> >> > > > than
> >> > > > > just performing a classification task.
> >> > > > >
> >> > > > > In the first step you use a dictionary of names along with an
> >> > attribute
> >> > > > > specifying whether the name fits for males, females or both. In
> >> case
> >> > > the
> >> > > > > name fits for males or females exclusively, then no need to go
> any
> >> > > > further.
> >> > > > >
> >> > > > > If the name fits for both genders, or is a family name etc., a
> >> second
> >> > > > step
> >> > > > > is needed where you extract features from the context
> (surrounding
> >> > > words,
> >> > > > > etc.) and perform a classification task using any machine
> learning
> >> > > > > algorithm.
> >> > > > >
> >> > > > > Another way would be using the information itself (whether the
> >> name
> >> > > fits
> >> > > > > for males, females or both) as a feature when you perform the
> >> > > > > classification.
> >> > > > >
> >> > > > > Best regards,
> >> > > > >
> >> > > > > Mondher
> >> > > > >
> >> > > > > I am not sure
> >> > > > >
> >> > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> >> > > damianoporta@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Awesome! Thank you so much WIlliam!
> >> > > > > >
> >> > > > > > 2016-06-29 13:36 GMT+02:00 William Colen <
> >> william.colen@gmail.com
> >> > >:
> >> > > > > >
> >> > > > > > > To create a NER model OpenNLP extracts features from the
> >> context,
> >> > > > > things
> >> > > > > > > such as: word prefix and suffix, next word, previous word,
> >> > previous
> >> > > > > word
> >> > > > > > > prefix and suffix, next word prefix and suffix etc.
> >> > > > > > > When you don't configure the feature generator it will apply
> >> the
> >> > > > > default:
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >> > > > > > >
> >> > > > > > > Default feature generator:
> >> > > > > > >
> >> > > > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> >> > > > > CachedFeatureGenerator(
> >> > > > > > >          *new* AdaptiveFeatureGenerator[]{
> >> > > > > > >            *new* WindowFeatureGenerator(*new*
> >> > > > TokenFeatureGenerator(),
> >> > > > > 2,
> >> > > > > > > 2),
> >> > > > > > >            *new* WindowFeatureGenerator(*new*
> >> > > > > > > TokenClassFeatureGenerator(true), 2, 2),
> >> > > > > > >            *new* OutcomePriorFeatureGenerator(),
> >> > > > > > >            *new* PreviousMapFeatureGenerator(),
> >> > > > > > >            *new* BigramNameFeatureGenerator(),
> >> > > > > > >            *new* SentenceFeatureGenerator(true, false)
> >> > > > > > >            });
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > These default features should work for most cases (specially
> >> > > > English),
> >> > > > > > but
> >> > > > > > > they of course can be incremented. If you do so, your model
> >> will
> >> > > take
> >> > > > > new
> >> > > > > > > features in account. So yes, you are putting the features in
> >> your
> >> > > > > model.
> >> > > > > > >
> >> > > > > > > To configure custom features is not easy. I would start with
> >> the
> >> > > > > default
> >> > > > > > > and use 10-fold cross-validation and take notes of its
> >> > > effectiveness.
> >> > > > > > Than
> >> > > > > > > change/add a feature, evaluate and take notes. Sometimes a
> >> > feature
> >> > > > that
> >> > > > > > we
> >> > > > > > > are sure would help can destroy the model effectiveness.
> >> > > > > > >
> >> > > > > > > Regards
> >> > > > > > > William
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <
> >> damianoporta@gmail.com
> >> > >:
> >> > > > > > >
> >> > > > > > > > Thank you William! Really appreciated!
> >> > > > > > > >
> >> > > > > > > > I only do not get one point, when you said "You could
> >> increment
> >> > > > your
> >> > > > > > > > model using
> >> > > > > > > > Custom Feature Generators" does it mean that i can "put"
> >> these
> >> > > > > features
> >> > > > > > > > inside ONE *.bin* file (model) that implement different
> >> things,
> >> > > or,
> >> > > > > > name
> >> > > > > > > > finder is one thing and those feature generators other?
> >> > > > > > > >
> >> > > > > > > > Thank you in advance for the clarification.
> >> > > > > > > >
> >> > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <
> >> > william.colen@gmail.com
> >> > > >:
> >> > > > > > > >
> >> > > > > > > > > Not exactly. You would create a new NER model to replace
> >> > yours.
> >> > > > > > > > >
> >> > > > > > > > > In this approach you would need a corpus like this:
> >> > > > > > > > >
> >> > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old ,
> >> will
> >> > > join
> >> > > > > the
> >> > > > > > > > board
> >> > > > > > > > > as a nonexecutive director Nov. 29 .
> >> > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of
> >> Elsevier
> >> > > > N.V. ,
> >> > > > > > the
> >> > > > > > > > > Dutch publishing group . <START:personFemale> Jessie
> >> Robson
> >> > > <END>
> >> > > > > is
> >> > > > > > > > > retiring , she was a board member for 5 years .
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I am not an English native speaker, so I am not sure if
> >> the
> >> > > > example
> >> > > > > > is
> >> > > > > > > > > clear enough. I tried to use Jessie as a neutral name
> and
> >> > "she"
> >> > > > as
> >> > > > > > > > > disambiguation.
> >> > > > > > > > >
> >> > > > > > > > > With a corpus big enough maybe you could create a model
> >> that
> >> > > > > outputs
> >> > > > > > > both
> >> > > > > > > > > classes, personMale and personFemale. To train a model
> you
> >> > can
> >> > > > > follow
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >> > > > > > > > >
> >> > > > > > > > > Let's say your results are not good enough. You could
> >> > increment
> >> > > > > your
> >> > > > > > > > model
> >> > > > > > > > > using Custom Feature Generators (
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> >> > > > > > > > > and
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> >> > > > > > > > > ).
> >> > > > > > > > >
> >> > > > > > > > > One of the implemented featuregen can take a dictionary
> (
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> >> > > > > > > > > ).
> >> > > > > > > > > You can also implement other convenient
> FeatureGenerator,
> >> for
> >> > > > > > instance
> >> > > > > > > > > regex.
> >> > > > > > > > >
> >> > > > > > > > > Again, it is just a wild guess of how to implement it. I
> >> > don't
> >> > > > know
> >> > > > > > if
> >> > > > > > > it
> >> > > > > > > > > would perform well. I was only thinking how to
> implement a
> >> > > gender
> >> > > > > ML
> >> > > > > > > > model
> >> > > > > > > > > that uses the surrounding context.
> >> > > > > > > > >
> >> > > > > > > > > Hope I could clarify.
> >> > > > > > > > >
> >> > > > > > > > > William
> >> > > > > > > > >
> >> > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> >> > > damianoporta@gmail.com
> >> > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Hi William,
> >> > > > > > > > > > Ok, so you are talking about a kind of pipe where we
> >> > execute:
> >> > > > > > > > > >
> >> > > > > > > > > > 1. NER (personM for example)
> >> > > > > > > > > > 2. Regex (filter to reduce false positives)
> >> > > > > > > > > > 3. Plain dictionary (filter as above) ?
> >> > > > > > > > > >
> >> > > > > > > > > > Yes we can split out model in two for M and F, it is
> >> not a
> >> > > big
> >> > > > > > > problem,
> >> > > > > > > > > we
> >> > > > > > > > > > have a database grouped by gender.
> >> > > > > > > > > >
> >> > > > > > > > > > I only have a doubt regarding the use of a dictionary.
> >> > > Because
> >> > > > if
> >> > > > > > we
> >> > > > > > > > use
> >> > > > > > > > > a
> >> > > > > > > > > > dictionary to create the model, we could only use it
> to
> >> > > detect
> >> > > > > > names
> >> > > > > > > > > > without using NER. No?
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> >> > > > william.colen@gmail.com
> >> > > > > >:
> >> > > > > > > > > >
> >> > > > > > > > > > > Do you plan to use the surrounding context? If yes,
> >> maybe
> >> > > you
> >> > > > > > could
> >> > > > > > > > try
> >> > > > > > > > > > to
> >> > > > > > > > > > > split NER in two categories: PersonM and PersonF.
> >> Just an
> >> > > > idea,
> >> > > > > > > never
> >> > > > > > > > > > read
> >> > > > > > > > > > > or tried anything like it. You would need a training
> >> > corpus
> >> > > > > with
> >> > > > > > > > these
> >> > > > > > > > > > > classes.
> >> > > > > > > > > > >
> >> > > > > > > > > > > You could add both the plain dictionary and the
> regex
> >> as
> >> > > NER
> >> > > > > > > features
> >> > > > > > > > > as
> >> > > > > > > > > > > well and check how it improves.
> >> > > > > > > > > > >
> >> > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> >> > > > > damianoporta@gmail.com
> >> > > > > > >:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hello everybody,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > we built a NER model to find persons (name) inside
> >> our
> >> > > > > > documents.
> >> > > > > > > > > > > > We are looking for the best approach to understand
> >> if
> >> > the
> >> > > > > name
> >> > > > > > is
> >> > > > > > > > > > > > male/female.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Possible solutions:
> >> > > > > > > > > > > > - Plain dictionary?
> >> > > > > > > > > > > > - Regex to check the initial and/letters of the
> >> name?
> >> > > > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Model to detect the gender

Posted by Joern Kottmann <ko...@gmail.com>.
The co-referencer we used used to have in opennlp-tools has a model to
detect the gender of names. That could could be extracted and put into a
stand alone component.

Jörn

On Mon, Jul 4, 2016 at 2:41 PM, Joern Kottmann <ko...@gmail.com> wrote:

> I was speaking about the second case. We could build a dedicated component
> specialized in extracting properties about already detected entities.
>
> Jörn
>
> On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
>> Hello Jorn,
>> Do you mean that i need to "extend" my NER model to find other
>> name-related
>> entities too?
>>
>> OR
>>
>> Find the entities with a dictionary and then train a maxent model that
>> finds other properties like person title, job position etc?
>>
>> Thanks for the clarification.
>>
>>
>> 2016-07-04 12:15 GMT+02:00 Joern Kottmann <ko...@gmail.com>:
>>
>> > Hello,
>> >
>> > there are also other interesting properties e.g. person title (e.g.
>> > professor, doctor), job title/position,
>> > company legal form. And much more for other entity types.
>> >
>> > Maybe it would be worth it to build a dedicated component to extract
>> > properties from entities.
>> >
>> > Jörn
>> >
>> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
>> > mondher.bouazizi@gmail.com
>> > > wrote:
>> >
>> > > Hi,
>> > >
>> > > Sorry for my late reply. I didn't understand well your last email, but
>> > here
>> > > is what I meant:
>> > >
>> > > Given a simple dictionary you have that has the following columns:
>> > >
>> > > Name           Type           Gender
>> > > Agatha         First           F
>> > > John            First           M
>> > > Smith          Both           B
>> > >
>> > > where:
>> > > - "First" refers to first name, "Last" (not in the example) refers to
>> > last
>> > > name, and Both means it can be both.
>> > > - "F" refers to female, "M" refers to males, and "B" refers to both
>> > > genders.
>> > >
>> > > and given the following two sentences:
>> > >
>> > > 1. "It was nice meeting you John. I hope we meet again soon."
>> > >
>> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and
>> > felt
>> > > she knows something"
>> > >
>> > > In the first example, when you check in the dictionary, the name
>> "John"
>> > is
>> > > a male name, so no need to go any further.
>> > > However, in the second example, the name "Smith", which is a family
>> name
>> > in
>> > > our case, can be fit for both, males and females. Therefore, we need
>> to
>> > > extract features from the surrounding context and perform a
>> > classification
>> > > task.
>> > > Here are some of the features I think they would be interesting to
>> use:
>> > >
>> > > . Presence of a male initiative before the word {True, False}
>> > > . Presence of a female initiative before the word {True, False}
>> > >
>> > > . Gender of the first personal pronoun (subject or object form) to the
>> > > right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
>> > > . Distance between the name and the first personal pronoun to the
>> right
>> > (in
>> > > words)         Values=NUMERIC
>> > > . Gender of the second personal pronoun to the right of the
>> > > name                                 Values={MALE, FEMALE, UNCERTAIN,
>> > > EMPTY}
>> > > . Distance between the name and the second personal pronoun right
>> > >                  Values=NUMERIC
>> > > . Gender of the third personal pronoun to the right of the
>> > > name                                      Values={MALE, FEMALE,
>> > UNCERTAIN,
>> > > EMPTY}
>> > > . Distance between the name and the third personal pronoun right (in
>> > > words)                  Values=NUMERIC
>> > >
>> > > . Gender of the first personal pronoun (subject or object form) to the
>> > left
>> > > of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
>> > > . Distance between the name and the first personal pronoun to the left
>> > (in
>> > > words)            Values=NUMERIC
>> > > . Gender of the second personal pronoun to the left of the
>> > > name                                    Values={MALE, FEMALE,
>> UNCERTAIN,
>> > > EMPTY}
>> > > . Distance between the name and the second personal pronoun left
>> > >                     Values=NUMERIC
>> > > . Gender of the third personal pronoun to the left of the
>> > > name                                        Values={MALE, FEMALE,
>> > > UNCERTAIN, EMPTY}
>> > > . Distance between the name and the third personal pronoun left (in
>> > > words)                    Values=NUMERIC
>> > >
>> > > In the second example here are the values you have for your features
>> > >
>> > > F1 = False
>> > > F2 = True
>> > > F3 = UNCERTAIN
>> > > F4 = 1
>> > > F5 = FEMALE
>> > > F6 = 3
>> > > F7 = FEMALE
>> > > F8 = 4
>> > > F9 = UNCERTAIN
>> > > F10 = 2
>> > > F11 = EMPTY
>> > > F12 = 0
>> > > F13 = EMPTY
>> > > F14 = 0
>> > >
>> > > Of course the choice of features depends on the type of data, and the
>> > > features themselves might not work well for some texts such as ones
>> > > collected from twitter for example.
>> > >
>> > > I hope this help you.
>> > >
>> > > Best regards
>> > >
>> > > Mondher
>> > >
>> > >
>> > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <
>> damianoporta@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Mondher,
>> > > > could you give me a raw example to understand how i should train the
>> > > > classifier model?
>> > > >
>> > > > Thank you in advance!
>> > > > Damiano
>> > > >
>> > > >
>> > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <
>> mondher.bouazizi@gmail.com
>> > >:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > I would recommend a hybrid approach where, in a first step, you
>> use a
>> > > > plain
>> > > > > dictionary and then perform the classification if needed.
>> > > > >
>> > > > > It's straightforward, but I think it would present better
>> > performances
>> > > > than
>> > > > > just performing a classification task.
>> > > > >
>> > > > > In the first step you use a dictionary of names along with an
>> > attribute
>> > > > > specifying whether the name fits for males, females or both. In
>> case
>> > > the
>> > > > > name fits for males or females exclusively, then no need to go any
>> > > > further.
>> > > > >
>> > > > > If the name fits for both genders, or is a family name etc., a
>> second
>> > > > step
>> > > > > is needed where you extract features from the context (surrounding
>> > > words,
>> > > > > etc.) and perform a classification task using any machine learning
>> > > > > algorithm.
>> > > > >
>> > > > > Another way would be using the information itself (whether the
>> name
>> > > fits
>> > > > > for males, females or both) as a feature when you perform the
>> > > > > classification.
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > Mondher
>> > > > >
>> > > > > I am not sure
>> > > > >
>> > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
>> > > damianoporta@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Awesome! Thank you so much WIlliam!
>> > > > > >
>> > > > > > 2016-06-29 13:36 GMT+02:00 William Colen <
>> william.colen@gmail.com
>> > >:
>> > > > > >
>> > > > > > > To create a NER model OpenNLP extracts features from the
>> context,
>> > > > > things
>> > > > > > > such as: word prefix and suffix, next word, previous word,
>> > previous
>> > > > > word
>> > > > > > > prefix and suffix, next word prefix and suffix etc.
>> > > > > > > When you don't configure the feature generator it will apply
>> the
>> > > > > default:
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
>> > > > > > >
>> > > > > > > Default feature generator:
>> > > > > > >
>> > > > > > > AdaptiveFeatureGenerator featureGenerator = *new*
>> > > > > CachedFeatureGenerator(
>> > > > > > >          *new* AdaptiveFeatureGenerator[]{
>> > > > > > >            *new* WindowFeatureGenerator(*new*
>> > > > TokenFeatureGenerator(),
>> > > > > 2,
>> > > > > > > 2),
>> > > > > > >            *new* WindowFeatureGenerator(*new*
>> > > > > > > TokenClassFeatureGenerator(true), 2, 2),
>> > > > > > >            *new* OutcomePriorFeatureGenerator(),
>> > > > > > >            *new* PreviousMapFeatureGenerator(),
>> > > > > > >            *new* BigramNameFeatureGenerator(),
>> > > > > > >            *new* SentenceFeatureGenerator(true, false)
>> > > > > > >            });
>> > > > > > >
>> > > > > > >
>> > > > > > > These default features should work for most cases (specially
>> > > > English),
>> > > > > > but
>> > > > > > > they of course can be incremented. If you do so, your model
>> will
>> > > take
>> > > > > new
>> > > > > > > features in account. So yes, you are putting the features in
>> your
>> > > > > model.
>> > > > > > >
>> > > > > > > To configure custom features is not easy. I would start with
>> the
>> > > > > default
>> > > > > > > and use 10-fold cross-validation and take notes of its
>> > > effectiveness.
>> > > > > > Than
>> > > > > > > change/add a feature, evaluate and take notes. Sometimes a
>> > feature
>> > > > that
>> > > > > > we
>> > > > > > > are sure would help can destroy the model effectiveness.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > > William
>> > > > > > >
>> > > > > > >
>> > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <
>> damianoporta@gmail.com
>> > >:
>> > > > > > >
>> > > > > > > > Thank you William! Really appreciated!
>> > > > > > > >
>> > > > > > > > I only do not get one point, when you said "You could
>> increment
>> > > > your
>> > > > > > > > model using
>> > > > > > > > Custom Feature Generators" does it mean that i can "put"
>> these
>> > > > > features
>> > > > > > > > inside ONE *.bin* file (model) that implement different
>> things,
>> > > or,
>> > > > > > name
>> > > > > > > > finder is one thing and those feature generators other?
>> > > > > > > >
>> > > > > > > > Thank you in advance for the clarification.
>> > > > > > > >
>> > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <
>> > william.colen@gmail.com
>> > > >:
>> > > > > > > >
>> > > > > > > > > Not exactly. You would create a new NER model to replace
>> > yours.
>> > > > > > > > >
>> > > > > > > > > In this approach you would need a corpus like this:
>> > > > > > > > >
>> > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old ,
>> will
>> > > join
>> > > > > the
>> > > > > > > > board
>> > > > > > > > > as a nonexecutive director Nov. 29 .
>> > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of
>> Elsevier
>> > > > N.V. ,
>> > > > > > the
>> > > > > > > > > Dutch publishing group . <START:personFemale> Jessie
>> Robson
>> > > <END>
>> > > > > is
>> > > > > > > > > retiring , she was a board member for 5 years .
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I am not an English native speaker, so I am not sure if
>> the
>> > > > example
>> > > > > > is
>> > > > > > > > > clear enough. I tried to use Jessie as a neutral name and
>> > "she"
>> > > > as
>> > > > > > > > > disambiguation.
>> > > > > > > > >
>> > > > > > > > > With a corpus big enough maybe you could create a model
>> that
>> > > > > outputs
>> > > > > > > both
>> > > > > > > > > classes, personMale and personFemale. To train a model you
>> > can
>> > > > > follow
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
>> > > > > > > > >
>> > > > > > > > > Let's say your results are not good enough. You could
>> > increment
>> > > > > your
>> > > > > > > > model
>> > > > > > > > > using Custom Feature Generators (
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
>> > > > > > > > > and
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
>> > > > > > > > > ).
>> > > > > > > > >
>> > > > > > > > > One of the implemented featuregen can take a dictionary (
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
>> > > > > > > > > ).
>> > > > > > > > > You can also implement other convenient FeatureGenerator,
>> for
>> > > > > > instance
>> > > > > > > > > regex.
>> > > > > > > > >
>> > > > > > > > > Again, it is just a wild guess of how to implement it. I
>> > don't
>> > > > know
>> > > > > > if
>> > > > > > > it
>> > > > > > > > > would perform well. I was only thinking how to implement a
>> > > gender
>> > > > > ML
>> > > > > > > > model
>> > > > > > > > > that uses the surrounding context.
>> > > > > > > > >
>> > > > > > > > > Hope I could clarify.
>> > > > > > > > >
>> > > > > > > > > William
>> > > > > > > > >
>> > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
>> > > damianoporta@gmail.com
>> > > > >:
>> > > > > > > > >
>> > > > > > > > > > Hi William,
>> > > > > > > > > > Ok, so you are talking about a kind of pipe where we
>> > execute:
>> > > > > > > > > >
>> > > > > > > > > > 1. NER (personM for example)
>> > > > > > > > > > 2. Regex (filter to reduce false positives)
>> > > > > > > > > > 3. Plain dictionary (filter as above) ?
>> > > > > > > > > >
>> > > > > > > > > > Yes we can split out model in two for M and F, it is
>> not a
>> > > big
>> > > > > > > problem,
>> > > > > > > > > we
>> > > > > > > > > > have a database grouped by gender.
>> > > > > > > > > >
>> > > > > > > > > > I only have a doubt regarding the use of a dictionary.
>> > > Because
>> > > > if
>> > > > > > we
>> > > > > > > > use
>> > > > > > > > > a
>> > > > > > > > > > dictionary to create the model, we could only use it to
>> > > detect
>> > > > > > names
>> > > > > > > > > > without using NER. No?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
>> > > > william.colen@gmail.com
>> > > > > >:
>> > > > > > > > > >
>> > > > > > > > > > > Do you plan to use the surrounding context? If yes,
>> maybe
>> > > you
>> > > > > > could
>> > > > > > > > try
>> > > > > > > > > > to
>> > > > > > > > > > > split NER in two categories: PersonM and PersonF.
>> Just an
>> > > > idea,
>> > > > > > > never
>> > > > > > > > > > read
>> > > > > > > > > > > or tried anything like it. You would need a training
>> > corpus
>> > > > > with
>> > > > > > > > these
>> > > > > > > > > > > classes.
>> > > > > > > > > > >
>> > > > > > > > > > > You could add both the plain dictionary and the regex
>> as
>> > > NER
>> > > > > > > features
>> > > > > > > > > as
>> > > > > > > > > > > well and check how it improves.
>> > > > > > > > > > >
>> > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
>> > > > > damianoporta@gmail.com
>> > > > > > >:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hello everybody,
>> > > > > > > > > > > >
>> > > > > > > > > > > > we built a NER model to find persons (name) inside
>> our
>> > > > > > documents.
>> > > > > > > > > > > > We are looking for the best approach to understand
>> if
>> > the
>> > > > > name
>> > > > > > is
>> > > > > > > > > > > > male/female.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Possible solutions:
>> > > > > > > > > > > > - Plain dictionary?
>> > > > > > > > > > > > - Regex to check the initial and/letters of the
>> name?
>> > > > > > > > > > > > - Classifier? (naive bayes? Maxent?)
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Model to detect the gender

Posted by Joern Kottmann <ko...@gmail.com>.
I was speaking about the second case. We could build a dedicated component
specialized in extracting properties about already detected entities.

Jörn

On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <da...@gmail.com>
wrote:

> Hello Jorn,
> Do you mean that i need to "extend" my NER model to find other name-related
> entities too?
>
> OR
>
> Find the entities with a dictionary and then train a maxent model that
> finds other properties like person title, job position etc?
>
> Thanks for the clarification.
>
>
> 2016-07-04 12:15 GMT+02:00 Joern Kottmann <ko...@gmail.com>:
>
> > Hello,
> >
> > there are also other interesting properties e.g. person title (e.g.
> > professor, doctor), job title/position,
> > company legal form. And much more for other entity types.
> >
> > Maybe it would be worth it to build a dedicated component to extract
> > properties from entities.
> >
> > Jörn
> >
> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
> > mondher.bouazizi@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > Sorry for my late reply. I didn't understand well your last email, but
> > here
> > > is what I meant:
> > >
> > > Given a simple dictionary you have that has the following columns:
> > >
> > > Name           Type           Gender
> > > Agatha         First           F
> > > John            First           M
> > > Smith          Both           B
> > >
> > > where:
> > > - "First" refers to first name, "Last" (not in the example) refers to
> > last
> > > name, and Both means it can be both.
> > > - "F" refers to female, "M" refers to males, and "B" refers to both
> > > genders.
> > >
> > > and given the following two sentences:
> > >
> > > 1. "It was nice meeting you John. I hope we meet again soon."
> > >
> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and
> > felt
> > > she knows something"
> > >
> > > In the first example, when you check in the dictionary, the name "John"
> > is
> > > a male name, so no need to go any further.
> > > However, in the second example, the name "Smith", which is a family
> name
> > in
> > > our case, can be fit for both, males and females. Therefore, we need to
> > > extract features from the surrounding context and perform a
> > classification
> > > task.
> > > Here are some of the features I think they would be interesting to use:
> > >
> > > . Presence of a male initiative before the word {True, False}
> > > . Presence of a female initiative before the word {True, False}
> > >
> > > . Gender of the first personal pronoun (subject or object form) to the
> > > right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> > > . Distance between the name and the first personal pronoun to the right
> > (in
> > > words)         Values=NUMERIC
> > > . Gender of the second personal pronoun to the right of the
> > > name                                 Values={MALE, FEMALE, UNCERTAIN,
> > > EMPTY}
> > > . Distance between the name and the second personal pronoun right
> > >                  Values=NUMERIC
> > > . Gender of the third personal pronoun to the right of the
> > > name                                      Values={MALE, FEMALE,
> > UNCERTAIN,
> > > EMPTY}
> > > . Distance between the name and the third personal pronoun right (in
> > > words)                  Values=NUMERIC
> > >
> > > . Gender of the first personal pronoun (subject or object form) to the
> > left
> > > of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> > > . Distance between the name and the first personal pronoun to the left
> > (in
> > > words)            Values=NUMERIC
> > > . Gender of the second personal pronoun to the left of the
> > > name                                    Values={MALE, FEMALE,
> UNCERTAIN,
> > > EMPTY}
> > > . Distance between the name and the second personal pronoun left
> > >                     Values=NUMERIC
> > > . Gender of the third personal pronoun to the left of the
> > > name                                        Values={MALE, FEMALE,
> > > UNCERTAIN, EMPTY}
> > > . Distance between the name and the third personal pronoun left (in
> > > words)                    Values=NUMERIC
> > >
> > > In the second example here are the values you have for your features
> > >
> > > F1 = False
> > > F2 = True
> > > F3 = UNCERTAIN
> > > F4 = 1
> > > F5 = FEMALE
> > > F6 = 3
> > > F7 = FEMALE
> > > F8 = 4
> > > F9 = UNCERTAIN
> > > F10 = 2
> > > F11 = EMPTY
> > > F12 = 0
> > > F13 = EMPTY
> > > F14 = 0
> > >
> > > Of course the choice of features depends on the type of data, and the
> > > features themselves might not work well for some texts such as ones
> > > collected from twitter for example.
> > >
> > > I hope this help you.
> > >
> > > Best regards
> > >
> > > Mondher
> > >
> > >
> > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <damianoporta@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Mondher,
> > > > could you give me a raw example to understand how i should train the
> > > > classifier model?
> > > >
> > > > Thank you in advance!
> > > > Damiano
> > > >
> > > >
> > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <
> mondher.bouazizi@gmail.com
> > >:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would recommend a hybrid approach where, in a first step, you
> use a
> > > > plain
> > > > > dictionary and then perform the classification if needed.
> > > > >
> > > > > It's straightforward, but I think it would present better
> > performances
> > > > than
> > > > > just performing a classification task.
> > > > >
> > > > > In the first step you use a dictionary of names along with an
> > attribute
> > > > > specifying whether the name fits for males, females or both. In
> case
> > > the
> > > > > name fits for males or females exclusively, then no need to go any
> > > > further.
> > > > >
> > > > > If the name fits for both genders, or is a family name etc., a
> second
> > > > step
> > > > > is needed where you extract features from the context (surrounding
> > > words,
> > > > > etc.) and perform a classification task using any machine learning
> > > > > algorithm.
> > > > >
> > > > > Another way would be using the information itself (whether the name
> > > fits
> > > > > for males, females or both) as a feature when you perform the
> > > > > classification.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Mondher
> > > > >
> > > > > I am not sure
> > > > >
> > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> > > damianoporta@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Awesome! Thank you so much WIlliam!
> > > > > >
> > > > > > 2016-06-29 13:36 GMT+02:00 William Colen <
> william.colen@gmail.com
> > >:
> > > > > >
> > > > > > > To create a NER model OpenNLP extracts features from the
> context,
> > > > > things
> > > > > > > such as: word prefix and suffix, next word, previous word,
> > previous
> > > > > word
> > > > > > > prefix and suffix, next word prefix and suffix etc.
> > > > > > > When you don't configure the feature generator it will apply
> the
> > > > > default:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > > > > >
> > > > > > > Default feature generator:
> > > > > > >
> > > > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > > > > CachedFeatureGenerator(
> > > > > > >          *new* AdaptiveFeatureGenerator[]{
> > > > > > >            *new* WindowFeatureGenerator(*new*
> > > > TokenFeatureGenerator(),
> > > > > 2,
> > > > > > > 2),
> > > > > > >            *new* WindowFeatureGenerator(*new*
> > > > > > > TokenClassFeatureGenerator(true), 2, 2),
> > > > > > >            *new* OutcomePriorFeatureGenerator(),
> > > > > > >            *new* PreviousMapFeatureGenerator(),
> > > > > > >            *new* BigramNameFeatureGenerator(),
> > > > > > >            *new* SentenceFeatureGenerator(true, false)
> > > > > > >            });
> > > > > > >
> > > > > > >
> > > > > > > These default features should work for most cases (specially
> > > > English),
> > > > > > but
> > > > > > > they of course can be incremented. If you do so, your model
> will
> > > take
> > > > > new
> > > > > > > features in account. So yes, you are putting the features in
> your
> > > > > model.
> > > > > > >
> > > > > > > To configure custom features is not easy. I would start with
> the
> > > > > default
> > > > > > > and use 10-fold cross-validation and take notes of its
> > > effectiveness.
> > > > > > Than
> > > > > > > change/add a feature, evaluate and take notes. Sometimes a
> > feature
> > > > that
> > > > > > we
> > > > > > > are sure would help can destroy the model effectiveness.
> > > > > > >
> > > > > > > Regards
> > > > > > > William
> > > > > > >
> > > > > > >
> > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <
> damianoporta@gmail.com
> > >:
> > > > > > >
> > > > > > > > Thank you William! Really appreciated!
> > > > > > > >
> > > > > > > > I only do not get one point, when you said "You could
> increment
> > > > your
> > > > > > > > model using
> > > > > > > > Custom Feature Generators" does it mean that i can "put"
> these
> > > > > features
> > > > > > > > inside ONE *.bin* file (model) that implement different
> things,
> > > or,
> > > > > > name
> > > > > > > > finder is one thing and those feature generators other?
> > > > > > > >
> > > > > > > > Thank you in advance for the clarification.
> > > > > > > >
> > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <
> > william.colen@gmail.com
> > > >:
> > > > > > > >
> > > > > > > > > Not exactly. You would create a new NER model to replace
> > yours.
> > > > > > > > >
> > > > > > > > > In this approach you would need a corpus like this:
> > > > > > > > >
> > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old ,
> will
> > > join
> > > > > the
> > > > > > > > board
> > > > > > > > > as a nonexecutive director Nov. 29 .
> > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of
> Elsevier
> > > > N.V. ,
> > > > > > the
> > > > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson
> > > <END>
> > > > > is
> > > > > > > > > retiring , she was a board member for 5 years .
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I am not an English native speaker, so I am not sure if the
> > > > example
> > > > > > is
> > > > > > > > > clear enough. I tried to use Jessie as a neutral name and
> > "she"
> > > > as
> > > > > > > > > disambiguation.
> > > > > > > > >
> > > > > > > > > With a corpus big enough maybe you could create a model
> that
> > > > > outputs
> > > > > > > both
> > > > > > > > > classes, personMale and personFemale. To train a model you
> > can
> > > > > follow
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > > > > >
> > > > > > > > > Let's say your results are not good enough. You could
> > increment
> > > > > your
> > > > > > > > model
> > > > > > > > > using Custom Feature Generators (
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > > > > and
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > > > > ).
> > > > > > > > >
> > > > > > > > > One of the implemented featuregen can take a dictionary (
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > > > > ).
> > > > > > > > > You can also implement other convenient FeatureGenerator,
> for
> > > > > > instance
> > > > > > > > > regex.
> > > > > > > > >
> > > > > > > > > Again, it is just a wild guess of how to implement it. I
> > don't
> > > > know
> > > > > > if
> > > > > > > it
> > > > > > > > > would perform well. I was only thinking how to implement a
> > > gender
> > > > > ML
> > > > > > > > model
> > > > > > > > > that uses the surrounding context.
> > > > > > > > >
> > > > > > > > > Hope I could clarify.
> > > > > > > > >
> > > > > > > > > William
> > > > > > > > >
> > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> > > damianoporta@gmail.com
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hi William,
> > > > > > > > > > Ok, so you are talking about a kind of pipe where we
> > execute:
> > > > > > > > > >
> > > > > > > > > > 1. NER (personM for example)
> > > > > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > > > > >
> > > > > > > > > > Yes we can split out model in two for M and F, it is not
> a
> > > big
> > > > > > > problem,
> > > > > > > > > we
> > > > > > > > > > have a database grouped by gender.
> > > > > > > > > >
> > > > > > > > > > I only have a doubt regarding the use of a dictionary.
> > > Because
> > > > if
> > > > > > we
> > > > > > > > use
> > > > > > > > > a
> > > > > > > > > > dictionary to create the model, we could only use it to
> > > detect
> > > > > > names
> > > > > > > > > > without using NER. No?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> > > > william.colen@gmail.com
> > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Do you plan to use the surrounding context? If yes,
> maybe
> > > you
> > > > > > could
> > > > > > > > try
> > > > > > > > > > to
> > > > > > > > > > > split NER in two categories: PersonM and PersonF. Just
> an
> > > > idea,
> > > > > > > never
> > > > > > > > > > read
> > > > > > > > > > > or tried anything like it. You would need a training
> > corpus
> > > > > with
> > > > > > > > these
> > > > > > > > > > > classes.
> > > > > > > > > > >
> > > > > > > > > > > You could add both the plain dictionary and the regex
> as
> > > NER
> > > > > > > features
> > > > > > > > > as
> > > > > > > > > > > well and check how it improves.
> > > > > > > > > > >
> > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > > > > damianoporta@gmail.com
> > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hello everybody,
> > > > > > > > > > > >
> > > > > > > > > > > > we built a NER model to find persons (name) inside
> our
> > > > > > documents.
> > > > > > > > > > > > We are looking for the best approach to understand if
> > the
> > > > > name
> > > > > > is
> > > > > > > > > > > > male/female.
> > > > > > > > > > > >
> > > > > > > > > > > > Possible solutions:
> > > > > > > > > > > > - Plain dictionary?
> > > > > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Hello Jorn,
Do you mean that i need to "extend" my NER model to find other name-related
entities too?

OR

Find the entities with a dictionary and then train a maxent model that
finds other properties like person title, job position etc?

Thanks for the clarification.


2016-07-04 12:15 GMT+02:00 Joern Kottmann <ko...@gmail.com>:

> Hello,
>
> there are also other interesting properties e.g. person title (e.g.
> professor, doctor), job title/position,
> company legal form. And much more for other entity types.
>
> Maybe it would be worth it to build a dedicated component to extract
> properties from entities.
>
> Jörn
>
> On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
> mondher.bouazizi@gmail.com
> > wrote:
>
> > Hi,
> >
> > Sorry for my late reply. I didn't understand well your last email, but
> here
> > is what I meant:
> >
> > Given a simple dictionary you have that has the following columns:
> >
> > Name           Type           Gender
> > Agatha         First           F
> > John            First           M
> > Smith          Both           B
> >
> > where:
> > - "First" refers to first name, "Last" (not in the example) refers to
> last
> > name, and Both means it can be both.
> > - "F" refers to female, "M" refers to males, and "B" refers to both
> > genders.
> >
> > and given the following two sentences:
> >
> > 1. "It was nice meeting you John. I hope we meet again soon."
> >
> > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and
> felt
> > she knows something"
> >
> > In the first example, when you check in the dictionary, the name "John"
> is
> > a male name, so no need to go any further.
> > However, in the second example, the name "Smith", which is a family name
> in
> > our case, can be fit for both, males and females. Therefore, we need to
> > extract features from the surrounding context and perform a
> classification
> > task.
> > Here are some of the features I think they would be interesting to use:
> >
> > . Presence of a male initiative before the word {True, False}
> > . Presence of a female initiative before the word {True, False}
> >
> > . Gender of the first personal pronoun (subject or object form) to the
> > right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> > . Distance between the name and the first personal pronoun to the right
> (in
> > words)         Values=NUMERIC
> > . Gender of the second personal pronoun to the right of the
> > name                                 Values={MALE, FEMALE, UNCERTAIN,
> > EMPTY}
> > . Distance between the name and the second personal pronoun right
> >                  Values=NUMERIC
> > . Gender of the third personal pronoun to the right of the
> > name                                      Values={MALE, FEMALE,
> UNCERTAIN,
> > EMPTY}
> > . Distance between the name and the third personal pronoun right (in
> > words)                  Values=NUMERIC
> >
> > . Gender of the first personal pronoun (subject or object form) to the
> left
> > of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> > . Distance between the name and the first personal pronoun to the left
> (in
> > words)            Values=NUMERIC
> > . Gender of the second personal pronoun to the left of the
> > name                                    Values={MALE, FEMALE, UNCERTAIN,
> > EMPTY}
> > . Distance between the name and the second personal pronoun left
> >                     Values=NUMERIC
> > . Gender of the third personal pronoun to the left of the
> > name                                        Values={MALE, FEMALE,
> > UNCERTAIN, EMPTY}
> > . Distance between the name and the third personal pronoun left (in
> > words)                    Values=NUMERIC
> >
> > In the second example here are the values you have for your features
> >
> > F1 = False
> > F2 = True
> > F3 = UNCERTAIN
> > F4 = 1
> > F5 = FEMALE
> > F6 = 3
> > F7 = FEMALE
> > F8 = 4
> > F9 = UNCERTAIN
> > F10 = 2
> > F11 = EMPTY
> > F12 = 0
> > F13 = EMPTY
> > F14 = 0
> >
> > Of course the choice of features depends on the type of data, and the
> > features themselves might not work well for some texts such as ones
> > collected from twitter for example.
> >
> > I hope this help you.
> >
> > Best regards
> >
> > Mondher
> >
> >
> > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <da...@gmail.com>
> > wrote:
> >
> > > Hi Mondher,
> > > could you give me a raw example to understand how i should train the
> > > classifier model?
> > >
> > > Thank you in advance!
> > > Damiano
> > >
> > >
> > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mondher.bouazizi@gmail.com
> >:
> > >
> > > > Hi,
> > > >
> > > > I would recommend a hybrid approach where, in a first step, you use a
> > > plain
> > > > dictionary and then perform the classification if needed.
> > > >
> > > > It's straightforward, but I think it would present better
> performances
> > > than
> > > > just performing a classification task.
> > > >
> > > > In the first step you use a dictionary of names along with an
> attribute
> > > > specifying whether the name fits for males, females or both. In case
> > the
> > > > name fits for males or females exclusively, then no need to go any
> > > further.
> > > >
> > > > If the name fits for both genders, or is a family name etc., a second
> > > step
> > > > is needed where you extract features from the context (surrounding
> > words,
> > > > etc.) and perform a classification task using any machine learning
> > > > algorithm.
> > > >
> > > > Another way would be using the information itself (whether the name
> > fits
> > > > for males, females or both) as a feature when you perform the
> > > > classification.
> > > >
> > > > Best regards,
> > > >
> > > > Mondher
> > > >
> > > > I am not sure
> > > >
> > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> > damianoporta@gmail.com>
> > > > wrote:
> > > >
> > > > > Awesome! Thank you so much WIlliam!
> > > > >
> > > > > 2016-06-29 13:36 GMT+02:00 William Colen <william.colen@gmail.com
> >:
> > > > >
> > > > > > To create a NER model OpenNLP extracts features from the context,
> > > > things
> > > > > > such as: word prefix and suffix, next word, previous word,
> previous
> > > > word
> > > > > > prefix and suffix, next word prefix and suffix etc.
> > > > > > When you don't configure the feature generator it will apply the
> > > > default:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > > > >
> > > > > > Default feature generator:
> > > > > >
> > > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > > > CachedFeatureGenerator(
> > > > > >          *new* AdaptiveFeatureGenerator[]{
> > > > > >            *new* WindowFeatureGenerator(*new*
> > > TokenFeatureGenerator(),
> > > > 2,
> > > > > > 2),
> > > > > >            *new* WindowFeatureGenerator(*new*
> > > > > > TokenClassFeatureGenerator(true), 2, 2),
> > > > > >            *new* OutcomePriorFeatureGenerator(),
> > > > > >            *new* PreviousMapFeatureGenerator(),
> > > > > >            *new* BigramNameFeatureGenerator(),
> > > > > >            *new* SentenceFeatureGenerator(true, false)
> > > > > >            });
> > > > > >
> > > > > >
> > > > > > These default features should work for most cases (specially
> > > English),
> > > > > but
> > > > > > they of course can be incremented. If you do so, your model will
> > take
> > > > new
> > > > > > features in account. So yes, you are putting the features in your
> > > > model.
> > > > > >
> > > > > > To configure custom features is not easy. I would start with the
> > > > default
> > > > > > and use 10-fold cross-validation and take notes of its
> > effectiveness.
> > > > > Than
> > > > > > change/add a feature, evaluate and take notes. Sometimes a
> feature
> > > that
> > > > > we
> > > > > > are sure would help can destroy the model effectiveness.
> > > > > >
> > > > > > Regards
> > > > > > William
> > > > > >
> > > > > >
> > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <damianoporta@gmail.com
> >:
> > > > > >
> > > > > > > Thank you William! Really appreciated!
> > > > > > >
> > > > > > > I only do not get one point, when you said "You could increment
> > > your
> > > > > > > model using
> > > > > > > Custom Feature Generators" does it mean that i can "put" these
> > > > features
> > > > > > > inside ONE *.bin* file (model) that implement different things,
> > or,
> > > > > name
> > > > > > > finder is one thing and those feature generators other?
> > > > > > >
> > > > > > > Thank you in advance for the clarification.
> > > > > > >
> > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <
> william.colen@gmail.com
> > >:
> > > > > > >
> > > > > > > > Not exactly. You would create a new NER model to replace
> yours.
> > > > > > > >
> > > > > > > > In this approach you would need a corpus like this:
> > > > > > > >
> > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will
> > join
> > > > the
> > > > > > > board
> > > > > > > > as a nonexecutive director Nov. 29 .
> > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier
> > > N.V. ,
> > > > > the
> > > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson
> > <END>
> > > > is
> > > > > > > > retiring , she was a board member for 5 years .
> > > > > > > >
> > > > > > > >
> > > > > > > > I am not an English native speaker, so I am not sure if the
> > > example
> > > > > is
> > > > > > > > clear enough. I tried to use Jessie as a neutral name and
> "she"
> > > as
> > > > > > > > disambiguation.
> > > > > > > >
> > > > > > > > With a corpus big enough maybe you could create a model that
> > > > outputs
> > > > > > both
> > > > > > > > classes, personMale and personFemale. To train a model you
> can
> > > > follow
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > > > >
> > > > > > > > Let's say your results are not good enough. You could
> increment
> > > > your
> > > > > > > model
> > > > > > > > using Custom Feature Generators (
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > > > and
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > > > ).
> > > > > > > >
> > > > > > > > One of the implemented featuregen can take a dictionary (
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > > > ).
> > > > > > > > You can also implement other convenient FeatureGenerator, for
> > > > > instance
> > > > > > > > regex.
> > > > > > > >
> > > > > > > > Again, it is just a wild guess of how to implement it. I
> don't
> > > know
> > > > > if
> > > > > > it
> > > > > > > > would perform well. I was only thinking how to implement a
> > gender
> > > > ML
> > > > > > > model
> > > > > > > > that uses the surrounding context.
> > > > > > > >
> > > > > > > > Hope I could clarify.
> > > > > > > >
> > > > > > > > William
> > > > > > > >
> > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> > damianoporta@gmail.com
> > > >:
> > > > > > > >
> > > > > > > > > Hi William,
> > > > > > > > > Ok, so you are talking about a kind of pipe where we
> execute:
> > > > > > > > >
> > > > > > > > > 1. NER (personM for example)
> > > > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > > > >
> > > > > > > > > Yes we can split out model in two for M and F, it is not a
> > big
> > > > > > problem,
> > > > > > > > we
> > > > > > > > > have a database grouped by gender.
> > > > > > > > >
> > > > > > > > > I only have a doubt regarding the use of a dictionary.
> > Because
> > > if
> > > > > we
> > > > > > > use
> > > > > > > > a
> > > > > > > > > dictionary to create the model, we could only use it to
> > detect
> > > > > names
> > > > > > > > > without using NER. No?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> > > william.colen@gmail.com
> > > > >:
> > > > > > > > >
> > > > > > > > > > Do you plan to use the surrounding context? If yes, maybe
> > you
> > > > > could
> > > > > > > try
> > > > > > > > > to
> > > > > > > > > > split NER in two categories: PersonM and PersonF. Just an
> > > idea,
> > > > > > never
> > > > > > > > > read
> > > > > > > > > > or tried anything like it. You would need a training
> corpus
> > > > with
> > > > > > > these
> > > > > > > > > > classes.
> > > > > > > > > >
> > > > > > > > > > You could add both the plain dictionary and the regex as
> > NER
> > > > > > features
> > > > > > > > as
> > > > > > > > > > well and check how it improves.
> > > > > > > > > >
> > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > > > damianoporta@gmail.com
> > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hello everybody,
> > > > > > > > > > >
> > > > > > > > > > > we built a NER model to find persons (name) inside our
> > > > > documents.
> > > > > > > > > > > We are looking for the best approach to understand if
> the
> > > > name
> > > > > is
> > > > > > > > > > > male/female.
> > > > > > > > > > >
> > > > > > > > > > > Possible solutions:
> > > > > > > > > > > - Plain dictionary?
> > > > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Joern Kottmann <ko...@gmail.com>.
Hello,

there are also other interesting properties e.g. person title (e.g.
professor, doctor), job title/position,
company legal form. And much more for other entity types.

Maybe it would be worth it to build a dedicated component to extract
properties from entities.

Jörn

On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <mondher.bouazizi@gmail.com
> wrote:

> Hi,
>
> Sorry for my late reply. I didn't understand well your last email, but here
> is what I meant:
>
> Given a simple dictionary you have that has the following columns:
>
> Name           Type           Gender
> Agatha         First           F
> John            First           M
> Smith          Both           B
>
> where:
> - "First" refers to first name, "Last" (not in the example) refers to last
> name, and Both means it can be both.
> - "F" refers to female, "M" refers to males, and "B" refers to both
> genders.
>
> and given the following two sentences:
>
> 1. "It was nice meeting you John. I hope we meet again soon."
>
> 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
> she knows something"
>
> In the first example, when you check in the dictionary, the name "John" is
> a male name, so no need to go any further.
> However, in the second example, the name "Smith", which is a family name in
> our case, can be fit for both, males and females. Therefore, we need to
> extract features from the surrounding context and perform a classification
> task.
> Here are some of the features I think they would be interesting to use:
>
> . Presence of a male initiative before the word {True, False}
> . Presence of a female initiative before the word {True, False}
>
> . Gender of the first personal pronoun (subject or object form) to the
> right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the right (in
> words)         Values=NUMERIC
> . Gender of the second personal pronoun to the right of the
> name                                 Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun right
>                  Values=NUMERIC
> . Gender of the third personal pronoun to the right of the
> name                                      Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the third personal pronoun right (in
> words)                  Values=NUMERIC
>
> . Gender of the first personal pronoun (subject or object form) to the left
> of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the left (in
> words)            Values=NUMERIC
> . Gender of the second personal pronoun to the left of the
> name                                    Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun left
>                     Values=NUMERIC
> . Gender of the third personal pronoun to the left of the
> name                                        Values={MALE, FEMALE,
> UNCERTAIN, EMPTY}
> . Distance between the name and the third personal pronoun left (in
> words)                    Values=NUMERIC
>
> In the second example here are the values you have for your features
>
> F1 = False
> F2 = True
> F3 = UNCERTAIN
> F4 = 1
> F5 = FEMALE
> F6 = 3
> F7 = FEMALE
> F8 = 4
> F9 = UNCERTAIN
> F10 = 2
> F11 = EMPTY
> F12 = 0
> F13 = EMPTY
> F14 = 0
>
> Of course the choice of features depends on the type of data, and the
> features themselves might not work well for some texts such as ones
> collected from twitter for example.
>
> I hope this help you.
>
> Best regards
>
> Mondher
>
>
> On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Hi Mondher,
> > could you give me a raw example to understand how i should train the
> > classifier model?
> >
> > Thank you in advance!
> > Damiano
> >
> >
> > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mo...@gmail.com>:
> >
> > > Hi,
> > >
> > > I would recommend a hybrid approach where, in a first step, you use a
> > plain
> > > dictionary and then perform the classification if needed.
> > >
> > > It's straightforward, but I think it would present better performances
> > than
> > > just performing a classification task.
> > >
> > > In the first step you use a dictionary of names along with an attribute
> > > specifying whether the name fits for males, females or both. In case
> the
> > > name fits for males or females exclusively, then no need to go any
> > further.
> > >
> > > If the name fits for both genders, or is a family name etc., a second
> > step
> > > is needed where you extract features from the context (surrounding
> words,
> > > etc.) and perform a classification task using any machine learning
> > > algorithm.
> > >
> > > Another way would be using the information itself (whether the name
> fits
> > > for males, females or both) as a feature when you perform the
> > > classification.
> > >
> > > Best regards,
> > >
> > > Mondher
> > >
> > > I am not sure
> > >
> > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> damianoporta@gmail.com>
> > > wrote:
> > >
> > > > Awesome! Thank you so much WIlliam!
> > > >
> > > > 2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:
> > > >
> > > > > To create a NER model OpenNLP extracts features from the context,
> > > things
> > > > > such as: word prefix and suffix, next word, previous word, previous
> > > word
> > > > > prefix and suffix, next word prefix and suffix etc.
> > > > > When you don't configure the feature generator it will apply the
> > > default:
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > > >
> > > > > Default feature generator:
> > > > >
> > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > > CachedFeatureGenerator(
> > > > >          *new* AdaptiveFeatureGenerator[]{
> > > > >            *new* WindowFeatureGenerator(*new*
> > TokenFeatureGenerator(),
> > > 2,
> > > > > 2),
> > > > >            *new* WindowFeatureGenerator(*new*
> > > > > TokenClassFeatureGenerator(true), 2, 2),
> > > > >            *new* OutcomePriorFeatureGenerator(),
> > > > >            *new* PreviousMapFeatureGenerator(),
> > > > >            *new* BigramNameFeatureGenerator(),
> > > > >            *new* SentenceFeatureGenerator(true, false)
> > > > >            });
> > > > >
> > > > >
> > > > > These default features should work for most cases (specially
> > English),
> > > > but
> > > > > they of course can be incremented. If you do so, your model will
> take
> > > new
> > > > > features in account. So yes, you are putting the features in your
> > > model.
> > > > >
> > > > > To configure custom features is not easy. I would start with the
> > > default
> > > > > and use 10-fold cross-validation and take notes of its
> effectiveness.
> > > > Than
> > > > > change/add a feature, evaluate and take notes. Sometimes a feature
> > that
> > > > we
> > > > > are sure would help can destroy the model effectiveness.
> > > > >
> > > > > Regards
> > > > > William
> > > > >
> > > > >
> > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > > >
> > > > > > Thank you William! Really appreciated!
> > > > > >
> > > > > > I only do not get one point, when you said "You could increment
> > your
> > > > > > model using
> > > > > > Custom Feature Generators" does it mean that i can "put" these
> > > features
> > > > > > inside ONE *.bin* file (model) that implement different things,
> or,
> > > > name
> > > > > > finder is one thing and those feature generators other?
> > > > > >
> > > > > > Thank you in advance for the clarification.
> > > > > >
> > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <william.colen@gmail.com
> >:
> > > > > >
> > > > > > > Not exactly. You would create a new NER model to replace yours.
> > > > > > >
> > > > > > > In this approach you would need a corpus like this:
> > > > > > >
> > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will
> join
> > > the
> > > > > > board
> > > > > > > as a nonexecutive director Nov. 29 .
> > > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier
> > N.V. ,
> > > > the
> > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson
> <END>
> > > is
> > > > > > > retiring , she was a board member for 5 years .
> > > > > > >
> > > > > > >
> > > > > > > I am not an English native speaker, so I am not sure if the
> > example
> > > > is
> > > > > > > clear enough. I tried to use Jessie as a neutral name and "she"
> > as
> > > > > > > disambiguation.
> > > > > > >
> > > > > > > With a corpus big enough maybe you could create a model that
> > > outputs
> > > > > both
> > > > > > > classes, personMale and personFemale. To train a model you can
> > > follow
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > > >
> > > > > > > Let's say your results are not good enough. You could increment
> > > your
> > > > > > model
> > > > > > > using Custom Feature Generators (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > > ).
> > > > > > >
> > > > > > > One of the implemented featuregen can take a dictionary (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > > ).
> > > > > > > You can also implement other convenient FeatureGenerator, for
> > > > instance
> > > > > > > regex.
> > > > > > >
> > > > > > > Again, it is just a wild guess of how to implement it. I don't
> > know
> > > > if
> > > > > it
> > > > > > > would perform well. I was only thinking how to implement a
> gender
> > > ML
> > > > > > model
> > > > > > > that uses the surrounding context.
> > > > > > >
> > > > > > > Hope I could clarify.
> > > > > > >
> > > > > > > William
> > > > > > >
> > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> damianoporta@gmail.com
> > >:
> > > > > > >
> > > > > > > > Hi William,
> > > > > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > > > > >
> > > > > > > > 1. NER (personM for example)
> > > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > > >
> > > > > > > > Yes we can split out model in two for M and F, it is not a
> big
> > > > > problem,
> > > > > > > we
> > > > > > > > have a database grouped by gender.
> > > > > > > >
> > > > > > > > I only have a doubt regarding the use of a dictionary.
> Because
> > if
> > > > we
> > > > > > use
> > > > > > > a
> > > > > > > > dictionary to create the model, we could only use it to
> detect
> > > > names
> > > > > > > > without using NER. No?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> > william.colen@gmail.com
> > > >:
> > > > > > > >
> > > > > > > > > Do you plan to use the surrounding context? If yes, maybe
> you
> > > > could
> > > > > > try
> > > > > > > > to
> > > > > > > > > split NER in two categories: PersonM and PersonF. Just an
> > idea,
> > > > > never
> > > > > > > > read
> > > > > > > > > or tried anything like it. You would need a training corpus
> > > with
> > > > > > these
> > > > > > > > > classes.
> > > > > > > > >
> > > > > > > > > You could add both the plain dictionary and the regex as
> NER
> > > > > features
> > > > > > > as
> > > > > > > > > well and check how it improves.
> > > > > > > > >
> > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > > damianoporta@gmail.com
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hello everybody,
> > > > > > > > > >
> > > > > > > > > > we built a NER model to find persons (name) inside our
> > > > documents.
> > > > > > > > > > We are looking for the best approach to understand if the
> > > name
> > > > is
> > > > > > > > > > male/female.
> > > > > > > > > >
> > > > > > > > > > Possible solutions:
> > > > > > > > > > - Plain dictionary?
> > > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Mondher Bouazizi <mo...@gmail.com>.
Hi,

Sorry for my late reply. I didn't understand well your last email, but here
is what I meant:

Given a simple dictionary you have that has the following columns:

Name           Type           Gender
Agatha         First           F
John            First           M
Smith          Both           B

where:
- "First" refers to first name, "Last" (not in the example) refers to last
name, and Both means it can be both.
- "F" refers to female, "M" refers to males, and "B" refers to both genders.

and given the following two sentences:

1. "It was nice meeting you John. I hope we meet again soon."

2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
she knows something"

In the first example, when you check in the dictionary, the name "John" is
a male name, so no need to go any further.
However, in the second example, the name "Smith", which is a family name in
our case, can be fit for both, males and females. Therefore, we need to
extract features from the surrounding context and perform a classification
task.
Here are some of the features I think they would be interesting to use:

. Presence of a male initiative before the word {True, False}
. Presence of a female initiative before the word {True, False}

. Gender of the first personal pronoun (subject or object form) to the
right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the right (in
words)         Values=NUMERIC
. Gender of the second personal pronoun to the right of the
name                                 Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the second personal pronoun right
                 Values=NUMERIC
. Gender of the third personal pronoun to the right of the
name                                      Values={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the third personal pronoun right (in
words)                  Values=NUMERIC

. Gender of the first personal pronoun (subject or object form) to the left
of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the left (in
words)            Values=NUMERIC
. Gender of the second personal pronoun to the left of the
name                                    Values={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the second personal pronoun left
                    Values=NUMERIC
. Gender of the third personal pronoun to the left of the
name                                        Values={MALE, FEMALE,
UNCERTAIN, EMPTY}
. Distance between the name and the third personal pronoun left (in
words)                    Values=NUMERIC

In the second example here are the values you have for your features

F1 = False
F2 = True
F3 = UNCERTAIN
F4 = 1
F5 = FEMALE
F6 = 3
F7 = FEMALE
F8 = 4
F9 = UNCERTAIN
F10 = 2
F11 = EMPTY
F12 = 0
F13 = EMPTY
F14 = 0

Of course the choice of features depends on the type of data, and the
features themselves might not work well for some texts such as ones
collected from twitter for example.

I hope this help you.

Best regards

Mondher


On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <da...@gmail.com>
wrote:

> Hi Mondher,
> could you give me a raw example to understand how i should train the
> classifier model?
>
> Thank you in advance!
> Damiano
>
>
> 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mo...@gmail.com>:
>
> > Hi,
> >
> > I would recommend a hybrid approach where, in a first step, you use a
> plain
> > dictionary and then perform the classification if needed.
> >
> > It's straightforward, but I think it would present better performances
> than
> > just performing a classification task.
> >
> > In the first step you use a dictionary of names along with an attribute
> > specifying whether the name fits for males, females or both. In case the
> > name fits for males or females exclusively, then no need to go any
> further.
> >
> > If the name fits for both genders, or is a family name etc., a second
> step
> > is needed where you extract features from the context (surrounding words,
> > etc.) and perform a classification task using any machine learning
> > algorithm.
> >
> > Another way would be using the information itself (whether the name fits
> > for males, females or both) as a feature when you perform the
> > classification.
> >
> > Best regards,
> >
> > Mondher
> >
> > I am not sure
> >
> > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <da...@gmail.com>
> > wrote:
> >
> > > Awesome! Thank you so much WIlliam!
> > >
> > > 2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:
> > >
> > > > To create a NER model OpenNLP extracts features from the context,
> > things
> > > > such as: word prefix and suffix, next word, previous word, previous
> > word
> > > > prefix and suffix, next word prefix and suffix etc.
> > > > When you don't configure the feature generator it will apply the
> > default:
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > >
> > > > Default feature generator:
> > > >
> > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > CachedFeatureGenerator(
> > > >          *new* AdaptiveFeatureGenerator[]{
> > > >            *new* WindowFeatureGenerator(*new*
> TokenFeatureGenerator(),
> > 2,
> > > > 2),
> > > >            *new* WindowFeatureGenerator(*new*
> > > > TokenClassFeatureGenerator(true), 2, 2),
> > > >            *new* OutcomePriorFeatureGenerator(),
> > > >            *new* PreviousMapFeatureGenerator(),
> > > >            *new* BigramNameFeatureGenerator(),
> > > >            *new* SentenceFeatureGenerator(true, false)
> > > >            });
> > > >
> > > >
> > > > These default features should work for most cases (specially
> English),
> > > but
> > > > they of course can be incremented. If you do so, your model will take
> > new
> > > > features in account. So yes, you are putting the features in your
> > model.
> > > >
> > > > To configure custom features is not easy. I would start with the
> > default
> > > > and use 10-fold cross-validation and take notes of its effectiveness.
> > > Than
> > > > change/add a feature, evaluate and take notes. Sometimes a feature
> that
> > > we
> > > > are sure would help can destroy the model effectiveness.
> > > >
> > > > Regards
> > > > William
> > > >
> > > >
> > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > >
> > > > > Thank you William! Really appreciated!
> > > > >
> > > > > I only do not get one point, when you said "You could increment
> your
> > > > > model using
> > > > > Custom Feature Generators" does it mean that i can "put" these
> > features
> > > > > inside ONE *.bin* file (model) that implement different things, or,
> > > name
> > > > > finder is one thing and those feature generators other?
> > > > >
> > > > > Thank you in advance for the clarification.
> > > > >
> > > > > 2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:
> > > > >
> > > > > > Not exactly. You would create a new NER model to replace yours.
> > > > > >
> > > > > > In this approach you would need a corpus like this:
> > > > > >
> > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join
> > the
> > > > > board
> > > > > > as a nonexecutive director Nov. 29 .
> > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier
> N.V. ,
> > > the
> > > > > > Dutch publishing group . <START:personFemale> Jessie Robson <END>
> > is
> > > > > > retiring , she was a board member for 5 years .
> > > > > >
> > > > > >
> > > > > > I am not an English native speaker, so I am not sure if the
> example
> > > is
> > > > > > clear enough. I tried to use Jessie as a neutral name and "she"
> as
> > > > > > disambiguation.
> > > > > >
> > > > > > With a corpus big enough maybe you could create a model that
> > outputs
> > > > both
> > > > > > classes, personMale and personFemale. To train a model you can
> > follow
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > >
> > > > > > Let's say your results are not good enough. You could increment
> > your
> > > > > model
> > > > > > using Custom Feature Generators (
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > and
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > ).
> > > > > >
> > > > > > One of the implemented featuregen can take a dictionary (
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > ).
> > > > > > You can also implement other convenient FeatureGenerator, for
> > > instance
> > > > > > regex.
> > > > > >
> > > > > > Again, it is just a wild guess of how to implement it. I don't
> know
> > > if
> > > > it
> > > > > > would perform well. I was only thinking how to implement a gender
> > ML
> > > > > model
> > > > > > that uses the surrounding context.
> > > > > >
> > > > > > Hope I could clarify.
> > > > > >
> > > > > > William
> > > > > >
> > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <damianoporta@gmail.com
> >:
> > > > > >
> > > > > > > Hi William,
> > > > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > > > >
> > > > > > > 1. NER (personM for example)
> > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > >
> > > > > > > Yes we can split out model in two for M and F, it is not a big
> > > > problem,
> > > > > > we
> > > > > > > have a database grouped by gender.
> > > > > > >
> > > > > > > I only have a doubt regarding the use of a dictionary. Because
> if
> > > we
> > > > > use
> > > > > > a
> > > > > > > dictionary to create the model, we could only use it to detect
> > > names
> > > > > > > without using NER. No?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> william.colen@gmail.com
> > >:
> > > > > > >
> > > > > > > > Do you plan to use the surrounding context? If yes, maybe you
> > > could
> > > > > try
> > > > > > > to
> > > > > > > > split NER in two categories: PersonM and PersonF. Just an
> idea,
> > > > never
> > > > > > > read
> > > > > > > > or tried anything like it. You would need a training corpus
> > with
> > > > > these
> > > > > > > > classes.
> > > > > > > >
> > > > > > > > You could add both the plain dictionary and the regex as NER
> > > > features
> > > > > > as
> > > > > > > > well and check how it improves.
> > > > > > > >
> > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > damianoporta@gmail.com
> > > >:
> > > > > > > >
> > > > > > > > > Hello everybody,
> > > > > > > > >
> > > > > > > > > we built a NER model to find persons (name) inside our
> > > documents.
> > > > > > > > > We are looking for the best approach to understand if the
> > name
> > > is
> > > > > > > > > male/female.
> > > > > > > > >
> > > > > > > > > Possible solutions:
> > > > > > > > > - Plain dictionary?
> > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Hi Mondher,
could you give me a raw example to understand how i should train the
classifier model?

Thank you in advance!
Damiano


2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <mo...@gmail.com>:

> Hi,
>
> I would recommend a hybrid approach where, in a first step, you use a plain
> dictionary and then perform the classification if needed.
>
> It's straightforward, but I think it would present better performances than
> just performing a classification task.
>
> In the first step you use a dictionary of names along with an attribute
> specifying whether the name fits for males, females or both. In case the
> name fits for males or females exclusively, then no need to go any further.
>
> If the name fits for both genders, or is a family name etc., a second step
> is needed where you extract features from the context (surrounding words,
> etc.) and perform a classification task using any machine learning
> algorithm.
>
> Another way would be using the information itself (whether the name fits
> for males, females or both) as a feature when you perform the
> classification.
>
> Best regards,
>
> Mondher
>
> I am not sure
>
> On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Awesome! Thank you so much WIlliam!
> >
> > 2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:
> >
> > > To create a NER model OpenNLP extracts features from the context,
> things
> > > such as: word prefix and suffix, next word, previous word, previous
> word
> > > prefix and suffix, next word prefix and suffix etc.
> > > When you don't configure the feature generator it will apply the
> default:
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > >
> > > Default feature generator:
> > >
> > > AdaptiveFeatureGenerator featureGenerator = *new*
> CachedFeatureGenerator(
> > >          *new* AdaptiveFeatureGenerator[]{
> > >            *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(),
> 2,
> > > 2),
> > >            *new* WindowFeatureGenerator(*new*
> > > TokenClassFeatureGenerator(true), 2, 2),
> > >            *new* OutcomePriorFeatureGenerator(),
> > >            *new* PreviousMapFeatureGenerator(),
> > >            *new* BigramNameFeatureGenerator(),
> > >            *new* SentenceFeatureGenerator(true, false)
> > >            });
> > >
> > >
> > > These default features should work for most cases (specially English),
> > but
> > > they of course can be incremented. If you do so, your model will take
> new
> > > features in account. So yes, you are putting the features in your
> model.
> > >
> > > To configure custom features is not easy. I would start with the
> default
> > > and use 10-fold cross-validation and take notes of its effectiveness.
> > Than
> > > change/add a feature, evaluate and take notes. Sometimes a feature that
> > we
> > > are sure would help can destroy the model effectiveness.
> > >
> > > Regards
> > > William
> > >
> > >
> > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > >
> > > > Thank you William! Really appreciated!
> > > >
> > > > I only do not get one point, when you said "You could increment your
> > > > model using
> > > > Custom Feature Generators" does it mean that i can "put" these
> features
> > > > inside ONE *.bin* file (model) that implement different things, or,
> > name
> > > > finder is one thing and those feature generators other?
> > > >
> > > > Thank you in advance for the clarification.
> > > >
> > > > 2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:
> > > >
> > > > > Not exactly. You would create a new NER model to replace yours.
> > > > >
> > > > > In this approach you would need a corpus like this:
> > > > >
> > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join
> the
> > > > board
> > > > > as a nonexecutive director Nov. 29 .
> > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. ,
> > the
> > > > > Dutch publishing group . <START:personFemale> Jessie Robson <END>
> is
> > > > > retiring , she was a board member for 5 years .
> > > > >
> > > > >
> > > > > I am not an English native speaker, so I am not sure if the example
> > is
> > > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > > disambiguation.
> > > > >
> > > > > With a corpus big enough maybe you could create a model that
> outputs
> > > both
> > > > > classes, personMale and personFemale. To train a model you can
> follow
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > >
> > > > > Let's say your results are not good enough. You could increment
> your
> > > > model
> > > > > using Custom Feature Generators (
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > and
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > ).
> > > > >
> > > > > One of the implemented featuregen can take a dictionary (
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > ).
> > > > > You can also implement other convenient FeatureGenerator, for
> > instance
> > > > > regex.
> > > > >
> > > > > Again, it is just a wild guess of how to implement it. I don't know
> > if
> > > it
> > > > > would perform well. I was only thinking how to implement a gender
> ML
> > > > model
> > > > > that uses the surrounding context.
> > > > >
> > > > > Hope I could clarify.
> > > > >
> > > > > William
> > > > >
> > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > > >
> > > > > > Hi William,
> > > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > > >
> > > > > > 1. NER (personM for example)
> > > > > > 2. Regex (filter to reduce false positives)
> > > > > > 3. Plain dictionary (filter as above) ?
> > > > > >
> > > > > > Yes we can split out model in two for M and F, it is not a big
> > > problem,
> > > > > we
> > > > > > have a database grouped by gender.
> > > > > >
> > > > > > I only have a doubt regarding the use of a dictionary. Because if
> > we
> > > > use
> > > > > a
> > > > > > dictionary to create the model, we could only use it to detect
> > names
> > > > > > without using NER. No?
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <william.colen@gmail.com
> >:
> > > > > >
> > > > > > > Do you plan to use the surrounding context? If yes, maybe you
> > could
> > > > try
> > > > > > to
> > > > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> > > never
> > > > > > read
> > > > > > > or tried anything like it. You would need a training corpus
> with
> > > > these
> > > > > > > classes.
> > > > > > >
> > > > > > > You could add both the plain dictionary and the regex as NER
> > > features
> > > > > as
> > > > > > > well and check how it improves.
> > > > > > >
> > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> damianoporta@gmail.com
> > >:
> > > > > > >
> > > > > > > > Hello everybody,
> > > > > > > >
> > > > > > > > we built a NER model to find persons (name) inside our
> > documents.
> > > > > > > > We are looking for the best approach to understand if the
> name
> > is
> > > > > > > > male/female.
> > > > > > > >
> > > > > > > > Possible solutions:
> > > > > > > > - Plain dictionary?
> > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Mondher Bouazizi <mo...@gmail.com>.
Hi,

I would recommend a hybrid approach where, in a first step, you use a plain
dictionary and then perform the classification if needed.

It's straightforward, but I think it would present better performances than
just performing a classification task.

In the first step you use a dictionary of names along with an attribute
specifying whether the name fits for males, females or both. In case the
name fits for males or females exclusively, then no need to go any further.

If the name fits for both genders, or is a family name etc., a second step
is needed where you extract features from the context (surrounding words,
etc.) and perform a classification task using any machine learning
algorithm.

Another way would be using the information itself (whether the name fits
for males, females or both) as a feature when you perform the
classification.

Best regards,

Mondher

I am not sure

On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <da...@gmail.com>
wrote:

> Awesome! Thank you so much WIlliam!
>
> 2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:
>
> > To create a NER model OpenNLP extracts features from the context, things
> > such as: word prefix and suffix, next word, previous word, previous word
> > prefix and suffix, next word prefix and suffix etc.
> > When you don't configure the feature generator it will apply the default:
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >
> > Default feature generator:
> >
> > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
> >          *new* AdaptiveFeatureGenerator[]{
> >            *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> > 2),
> >            *new* WindowFeatureGenerator(*new*
> > TokenClassFeatureGenerator(true), 2, 2),
> >            *new* OutcomePriorFeatureGenerator(),
> >            *new* PreviousMapFeatureGenerator(),
> >            *new* BigramNameFeatureGenerator(),
> >            *new* SentenceFeatureGenerator(true, false)
> >            });
> >
> >
> > These default features should work for most cases (specially English),
> but
> > they of course can be incremented. If you do so, your model will take new
> > features in account. So yes, you are putting the features in your model.
> >
> > To configure custom features is not easy. I would start with the default
> > and use 10-fold cross-validation and take notes of its effectiveness.
> Than
> > change/add a feature, evaluate and take notes. Sometimes a feature that
> we
> > are sure would help can destroy the model effectiveness.
> >
> > Regards
> > William
> >
> >
> > 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
> >
> > > Thank you William! Really appreciated!
> > >
> > > I only do not get one point, when you said "You could increment your
> > > model using
> > > Custom Feature Generators" does it mean that i can "put" these features
> > > inside ONE *.bin* file (model) that implement different things, or,
> name
> > > finder is one thing and those feature generators other?
> > >
> > > Thank you in advance for the clarification.
> > >
> > > 2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:
> > >
> > > > Not exactly. You would create a new NER model to replace yours.
> > > >
> > > > In this approach you would need a corpus like this:
> > > >
> > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join the
> > > board
> > > > as a nonexecutive director Nov. 29 .
> > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. ,
> the
> > > > Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> > > > retiring , she was a board member for 5 years .
> > > >
> > > >
> > > > I am not an English native speaker, so I am not sure if the example
> is
> > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > disambiguation.
> > > >
> > > > With a corpus big enough maybe you could create a model that outputs
> > both
> > > > classes, personMale and personFemale. To train a model you can follow
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > >
> > > > Let's say your results are not good enough. You could increment your
> > > model
> > > > using Custom Feature Generators (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > and
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > ).
> > > >
> > > > One of the implemented featuregen can take a dictionary (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > ).
> > > > You can also implement other convenient FeatureGenerator, for
> instance
> > > > regex.
> > > >
> > > > Again, it is just a wild guess of how to implement it. I don't know
> if
> > it
> > > > would perform well. I was only thinking how to implement a gender ML
> > > model
> > > > that uses the surrounding context.
> > > >
> > > > Hope I could clarify.
> > > >
> > > > William
> > > >
> > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > >
> > > > > Hi William,
> > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > >
> > > > > 1. NER (personM for example)
> > > > > 2. Regex (filter to reduce false positives)
> > > > > 3. Plain dictionary (filter as above) ?
> > > > >
> > > > > Yes we can split out model in two for M and F, it is not a big
> > problem,
> > > > we
> > > > > have a database grouped by gender.
> > > > >
> > > > > I only have a doubt regarding the use of a dictionary. Because if
> we
> > > use
> > > > a
> > > > > dictionary to create the model, we could only use it to detect
> names
> > > > > without using NER. No?
> > > > >
> > > > >
> > > > >
> > > > > 2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:
> > > > >
> > > > > > Do you plan to use the surrounding context? If yes, maybe you
> could
> > > try
> > > > > to
> > > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> > never
> > > > > read
> > > > > > or tried anything like it. You would need a training corpus with
> > > these
> > > > > > classes.
> > > > > >
> > > > > > You could add both the plain dictionary and the regex as NER
> > features
> > > > as
> > > > > > well and check how it improves.
> > > > > >
> > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <damianoporta@gmail.com
> >:
> > > > > >
> > > > > > > Hello everybody,
> > > > > > >
> > > > > > > we built a NER model to find persons (name) inside our
> documents.
> > > > > > > We are looking for the best approach to understand if the name
> is
> > > > > > > male/female.
> > > > > > >
> > > > > > > Possible solutions:
> > > > > > > - Plain dictionary?
> > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Awesome! Thank you so much WIlliam!

2016-06-29 13:36 GMT+02:00 William Colen <wi...@gmail.com>:

> To create a NER model OpenNLP extracts features from the context, things
> such as: word prefix and suffix, next word, previous word, previous word
> prefix and suffix, next word prefix and suffix etc.
> When you don't configure the feature generator it will apply the default:
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
>
> Default feature generator:
>
> AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
>          *new* AdaptiveFeatureGenerator[]{
>            *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> 2),
>            *new* WindowFeatureGenerator(*new*
> TokenClassFeatureGenerator(true), 2, 2),
>            *new* OutcomePriorFeatureGenerator(),
>            *new* PreviousMapFeatureGenerator(),
>            *new* BigramNameFeatureGenerator(),
>            *new* SentenceFeatureGenerator(true, false)
>            });
>
>
> These default features should work for most cases (specially English), but
> they of course can be incremented. If you do so, your model will take new
> features in account. So yes, you are putting the features in your model.
>
> To configure custom features is not easy. I would start with the default
> and use 10-fold cross-validation and take notes of its effectiveness. Than
> change/add a feature, evaluate and take notes. Sometimes a feature that we
> are sure would help can destroy the model effectiveness.
>
> Regards
> William
>
>
> 2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Thank you William! Really appreciated!
> >
> > I only do not get one point, when you said "You could increment your
> > model using
> > Custom Feature Generators" does it mean that i can "put" these features
> > inside ONE *.bin* file (model) that implement different things, or, name
> > finder is one thing and those feature generators other?
> >
> > Thank you in advance for the clarification.
> >
> > 2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:
> >
> > > Not exactly. You would create a new NER model to replace yours.
> > >
> > > In this approach you would need a corpus like this:
> > >
> > > <START:personMale> Pierre Vinken <END> , 61 years old , will join the
> > board
> > > as a nonexecutive director Nov. 29 .
> > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the
> > > Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> > > retiring , she was a board member for 5 years .
> > >
> > >
> > > I am not an English native speaker, so I am not sure if the example is
> > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > disambiguation.
> > >
> > > With a corpus big enough maybe you could create a model that outputs
> both
> > > classes, personMale and personFemale. To train a model you can follow
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > >
> > > Let's say your results are not good enough. You could increment your
> > model
> > > using Custom Feature Generators (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > and
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > ).
> > >
> > > One of the implemented featuregen can take a dictionary (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > ).
> > > You can also implement other convenient FeatureGenerator, for instance
> > > regex.
> > >
> > > Again, it is just a wild guess of how to implement it. I don't know if
> it
> > > would perform well. I was only thinking how to implement a gender ML
> > model
> > > that uses the surrounding context.
> > >
> > > Hope I could clarify.
> > >
> > > William
> > >
> > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > >
> > > > Hi William,
> > > > Ok, so you are talking about a kind of pipe where we execute:
> > > >
> > > > 1. NER (personM for example)
> > > > 2. Regex (filter to reduce false positives)
> > > > 3. Plain dictionary (filter as above) ?
> > > >
> > > > Yes we can split out model in two for M and F, it is not a big
> problem,
> > > we
> > > > have a database grouped by gender.
> > > >
> > > > I only have a doubt regarding the use of a dictionary. Because if we
> > use
> > > a
> > > > dictionary to create the model, we could only use it to detect names
> > > > without using NER. No?
> > > >
> > > >
> > > >
> > > > 2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:
> > > >
> > > > > Do you plan to use the surrounding context? If yes, maybe you could
> > try
> > > > to
> > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> never
> > > > read
> > > > > or tried anything like it. You would need a training corpus with
> > these
> > > > > classes.
> > > > >
> > > > > You could add both the plain dictionary and the regex as NER
> features
> > > as
> > > > > well and check how it improves.
> > > > >
> > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > > >
> > > > > > Hello everybody,
> > > > > >
> > > > > > we built a NER model to find persons (name) inside our documents.
> > > > > > We are looking for the best approach to understand if the name is
> > > > > > male/female.
> > > > > >
> > > > > > Possible solutions:
> > > > > > - Plain dictionary?
> > > > > > - Regex to check the initial and/letters of the name?
> > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by William Colen <wi...@gmail.com>.
To create a NER model OpenNLP extracts features from the context, things
such as: word prefix and suffix, next word, previous word, previous word
prefix and suffix, next word prefix and suffix etc.
When you don't configure the feature generator it will apply the default:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api

Default feature generator:

AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
         *new* AdaptiveFeatureGenerator[]{
           *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, 2),
           *new* WindowFeatureGenerator(*new*
TokenClassFeatureGenerator(true), 2, 2),
           *new* OutcomePriorFeatureGenerator(),
           *new* PreviousMapFeatureGenerator(),
           *new* BigramNameFeatureGenerator(),
           *new* SentenceFeatureGenerator(true, false)
           });


These default features should work for most cases (specially English), but
they of course can be incremented. If you do so, your model will take new
features in account. So yes, you are putting the features in your model.

To configure custom features is not easy. I would start with the default
and use 10-fold cross-validation and take notes of its effectiveness. Than
change/add a feature, evaluate and take notes. Sometimes a feature that we
are sure would help can destroy the model effectiveness.

Regards
William


2016-06-29 7:00 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Thank you William! Really appreciated!
>
> I only do not get one point, when you said "You could increment your
> model using
> Custom Feature Generators" does it mean that i can "put" these features
> inside ONE *.bin* file (model) that implement different things, or, name
> finder is one thing and those feature generators other?
>
> Thank you in advance for the clarification.
>
> 2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:
>
> > Not exactly. You would create a new NER model to replace yours.
> >
> > In this approach you would need a corpus like this:
> >
> > <START:personMale> Pierre Vinken <END> , 61 years old , will join the
> board
> > as a nonexecutive director Nov. 29 .
> > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the
> > Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> > retiring , she was a board member for 5 years .
> >
> >
> > I am not an English native speaker, so I am not sure if the example is
> > clear enough. I tried to use Jessie as a neutral name and "she" as
> > disambiguation.
> >
> > With a corpus big enough maybe you could create a model that outputs both
> > classes, personMale and personFemale. To train a model you can follow
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >
> > Let's say your results are not good enough. You could increment your
> model
> > using Custom Feature Generators (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > and
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > ).
> >
> > One of the implemented featuregen can take a dictionary (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > ).
> > You can also implement other convenient FeatureGenerator, for instance
> > regex.
> >
> > Again, it is just a wild guess of how to implement it. I don't know if it
> > would perform well. I was only thinking how to implement a gender ML
> model
> > that uses the surrounding context.
> >
> > Hope I could clarify.
> >
> > William
> >
> > 2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:
> >
> > > Hi William,
> > > Ok, so you are talking about a kind of pipe where we execute:
> > >
> > > 1. NER (personM for example)
> > > 2. Regex (filter to reduce false positives)
> > > 3. Plain dictionary (filter as above) ?
> > >
> > > Yes we can split out model in two for M and F, it is not a big problem,
> > we
> > > have a database grouped by gender.
> > >
> > > I only have a doubt regarding the use of a dictionary. Because if we
> use
> > a
> > > dictionary to create the model, we could only use it to detect names
> > > without using NER. No?
> > >
> > >
> > >
> > > 2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:
> > >
> > > > Do you plan to use the surrounding context? If yes, maybe you could
> try
> > > to
> > > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > > read
> > > > or tried anything like it. You would need a training corpus with
> these
> > > > classes.
> > > >
> > > > You could add both the plain dictionary and the regex as NER features
> > as
> > > > well and check how it improves.
> > > >
> > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > > >
> > > > > Hello everybody,
> > > > >
> > > > > we built a NER model to find persons (name) inside our documents.
> > > > > We are looking for the best approach to understand if the name is
> > > > > male/female.
> > > > >
> > > > > Possible solutions:
> > > > > - Plain dictionary?
> > > > > - Regex to check the initial and/letters of the name?
> > > > > - Classifier? (naive bayes? Maxent?)
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Thank you William! Really appreciated!

I only do not get one point, when you said "You could increment your
model using
Custom Feature Generators" does it mean that i can "put" these features
inside ONE *.bin* file (model) that implement different things, or, name
finder is one thing and those feature generators other?

Thank you in advance for the clarification.

2016-06-29 1:23 GMT+02:00 William Colen <wi...@gmail.com>:

> Not exactly. You would create a new NER model to replace yours.
>
> In this approach you would need a corpus like this:
>
> <START:personMale> Pierre Vinken <END> , 61 years old , will join the board
> as a nonexecutive director Nov. 29 .
> Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the
> Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> retiring , she was a board member for 5 years .
>
>
> I am not an English native speaker, so I am not sure if the example is
> clear enough. I tried to use Jessie as a neutral name and "she" as
> disambiguation.
>
> With a corpus big enough maybe you could create a model that outputs both
> classes, personMale and personFemale. To train a model you can follow
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
>
> Let's say your results are not good enough. You could increment your model
> using Custom Feature Generators (
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> and
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> ).
>
> One of the implemented featuregen can take a dictionary (
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> ).
> You can also implement other convenient FeatureGenerator, for instance
> regex.
>
> Again, it is just a wild guess of how to implement it. I don't know if it
> would perform well. I was only thinking how to implement a gender ML model
> that uses the surrounding context.
>
> Hope I could clarify.
>
> William
>
> 2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Hi William,
> > Ok, so you are talking about a kind of pipe where we execute:
> >
> > 1. NER (personM for example)
> > 2. Regex (filter to reduce false positives)
> > 3. Plain dictionary (filter as above) ?
> >
> > Yes we can split out model in two for M and F, it is not a big problem,
> we
> > have a database grouped by gender.
> >
> > I only have a doubt regarding the use of a dictionary. Because if we use
> a
> > dictionary to create the model, we could only use it to detect names
> > without using NER. No?
> >
> >
> >
> > 2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:
> >
> > > Do you plan to use the surrounding context? If yes, maybe you could try
> > to
> > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > read
> > > or tried anything like it. You would need a training corpus with these
> > > classes.
> > >
> > > You could add both the plain dictionary and the regex as NER features
> as
> > > well and check how it improves.
> > >
> > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > >
> > > > Hello everybody,
> > > >
> > > > we built a NER model to find persons (name) inside our documents.
> > > > We are looking for the best approach to understand if the name is
> > > > male/female.
> > > >
> > > > Possible solutions:
> > > > - Plain dictionary?
> > > > - Regex to check the initial and/letters of the name?
> > > > - Classifier? (naive bayes? Maxent?)
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Re: Model to detect the gender

Posted by William Colen <wi...@gmail.com>.
Not exactly. You would create a new NER model to replace yours.

In this approach you would need a corpus like this:

<START:personMale> Pierre Vinken <END> , 61 years old , will join the board
as a nonexecutive director Nov. 29 .
Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the
Dutch publishing group . <START:personFemale> Jessie Robson <END> is
retiring , she was a board member for 5 years .


I am not an English native speaker, so I am not sure if the example is
clear enough. I tried to use Jessie as a neutral name and "she" as
disambiguation.

With a corpus big enough maybe you could create a model that outputs both
classes, personMale and personFemale. To train a model you can follow
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training

Let's say your results are not good enough. You could increment your model
using Custom Feature Generators (
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
and
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
).

One of the implemented featuregen can take a dictionary (
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
).
You can also implement other convenient FeatureGenerator, for instance
regex.

Again, it is just a wild guess of how to implement it. I don't know if it
would perform well. I was only thinking how to implement a gender ML model
that uses the surrounding context.

Hope I could clarify.

William

2016-06-28 19:15 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Hi William,
> Ok, so you are talking about a kind of pipe where we execute:
>
> 1. NER (personM for example)
> 2. Regex (filter to reduce false positives)
> 3. Plain dictionary (filter as above) ?
>
> Yes we can split out model in two for M and F, it is not a big problem, we
> have a database grouped by gender.
>
> I only have a doubt regarding the use of a dictionary. Because if we use a
> dictionary to create the model, we could only use it to detect names
> without using NER. No?
>
>
>
> 2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:
>
> > Do you plan to use the surrounding context? If yes, maybe you could try
> to
> > split NER in two categories: PersonM and PersonF. Just an idea, never
> read
> > or tried anything like it. You would need a training corpus with these
> > classes.
> >
> > You could add both the plain dictionary and the regex as NER features as
> > well and check how it improves.
> >
> > 2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:
> >
> > > Hello everybody,
> > >
> > > we built a NER model to find persons (name) inside our documents.
> > > We are looking for the best approach to understand if the name is
> > > male/female.
> > >
> > > Possible solutions:
> > > - Plain dictionary?
> > > - Regex to check the initial and/letters of the name?
> > > - Classifier? (naive bayes? Maxent?)
> > >
> > > Thanks
> > >
> >
>

Re: Model to detect the gender

Posted by Damiano Porta <da...@gmail.com>.
Hi William,
Ok, so you are talking about a kind of pipe where we execute:

1. NER (personM for example)
2. Regex (filter to reduce false positives)
3. Plain dictionary (filter as above) ?

Yes we can split out model in two for M and F, it is not a big problem, we
have a database grouped by gender.

I only have a doubt regarding the use of a dictionary. Because if we use a
dictionary to create the model, we could only use it to detect names
without using NER. No?



2016-06-29 0:10 GMT+02:00 William Colen <wi...@gmail.com>:

> Do you plan to use the surrounding context? If yes, maybe you could try to
> split NER in two categories: PersonM and PersonF. Just an idea, never read
> or tried anything like it. You would need a training corpus with these
> classes.
>
> You could add both the plain dictionary and the regex as NER features as
> well and check how it improves.
>
> 2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Hello everybody,
> >
> > we built a NER model to find persons (name) inside our documents.
> > We are looking for the best approach to understand if the name is
> > male/female.
> >
> > Possible solutions:
> > - Plain dictionary?
> > - Regex to check the initial and/letters of the name?
> > - Classifier? (naive bayes? Maxent?)
> >
> > Thanks
> >
>

Re: Model to detect the gender

Posted by William Colen <wi...@gmail.com>.
Do you plan to use the surrounding context? If yes, maybe you could try to
split NER in two categories: PersonM and PersonF. Just an idea, never read
or tried anything like it. You would need a training corpus with these
classes.

You could add both the plain dictionary and the regex as NER features as
well and check how it improves.

2016-06-28 18:56 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Hello everybody,
>
> we built a NER model to find persons (name) inside our documents.
> We are looking for the best approach to understand if the name is
> male/female.
>
> Possible solutions:
> - Plain dictionary?
> - Regex to check the initial and/letters of the name?
> - Classifier? (naive bayes? Maxent?)
>
> Thanks
>