You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Patrick Collins <pa...@ready2sign.com> on 2011/04/29 08:17:54 UTC

Re: Trying to determine the best ML algorithm to use.

Hi all and Ted,

A month later after you posted a potential solution for using SGD to
classify fields in contracts, I am still working to create the feature
vectors for the contract fields classifier. It is a big job. So I am yet to
run my first classifier test :)

I have come up with my first guess at what will be the feature vectors and I
have some questions for you.

Firstly, see the below table which outlines the feature vectors (you'll need
to transpose it because rows are columns and columns are rows just for
readability). This tables is a vectorized version of the text below. The
vectors outlined in this table should be able to be extracted by my
pre-processing code. I have included 4 sample training rows.

column#feature_nameTraining Row 1Training Row 2Training Row 3Training Row 4
etc1target<<Signature>><<Name>><<Title>><<Address>>
2possible_field_length50120118115
3has_line1000
4aligned_text_above_leftCOMPANY 1 Pty Ltd
By:Title:
5white_space_above2000
6label_left
By:Title:address:
7label_below_leftBy:Title:address:COMPANY 2 LLC
8label_below_center




9label_below_distance1112
10above_field_2

<<Signature>><<Name>>
11above_field_2_distance

22
12above_field_1
<<Signature>><<Name>><<Title>>
13above_field_1_distance

11
14below_field_1<<Name>><<Title>><<Address>><<Signature>>
15below_field_1_distance1115
16below_field_2<<Title>><<Address>><<Signature>><<Name>>
17below_field_2_distance2256



I have a few questions before I start diving into testing the classifier and
finalizing my pre-processing code.

1) there are phrases which clearly contain company names. How do I get the
classifier to recognize them as company names so that it can classify
better?

2) How can I "weight" the fields given what I believe heuristically to be
more important? For instance, the label_left field should be the strongest
predictor of the field type. Should I just leave this to the classifier to
figure out?

3) Note all of the distances are "line distances" from the field. 3 lines
above, 1 line below. These should be able to be used as weights. The closer
the feature, the more important it should be. How should I handle that?

4) Finally it is clear that a beam decoder or other backtracking algorithm
is needed per your original suggestion. I have not come across any practical
implementations or sample code of this, I have only come across super
theoretical papers which I am unable to interpret. If you have seen any open
source implementations I would appreciate you pointing me in the right
direction.


The original source text for the feature vectors extracted above:
=======================8<=========================

IN WITNESS WHEREOF, the parties have entered into this Agreement on the date
first above

written.


COMPANY 1 Pty Ltd



___________________________________________

By:

Title:

address:


COMPANY 2, LLC



___________________________________________

By:

Title:

address:



Acknowledged and accepted by:


COMPANY 3 Ventures Inc


___________________________________________

By:

Title:

address:
=======================8<=========================

Re: Trying to determine the best ML algorithm to use.

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Apr 29, 2011 at 11:18 AM, Patrick Collins <
patrick.collins@ready2sign.com> wrote:

> > > 3) Note all of the distances are "line distances" from the field. 3
> lines
> > > above, 1 line below. These should be able to be used as weights. The
> > closer
> > > the feature, the more important it should be. How should I handle that?
> > >
> > Transfer it to a continues feature, use the distance as the variable.
> >
>
> OK. So should this continuous feature be inverted? So instead of 1 being
> closest and 3 being furthest, I have 1 and .3333. So then closer the field,
> the higher the number? Or does it not matter if the number is big or small.
>

Try both and see which the classifier picks.  Also good to have a log of
distance.

>
> >
> > > 4) Finally it is clear that a beam decoder or other backtracking
> > algorithm
> > > is needed per your original suggestion. I have not come across any
> > > practical
> > > implementations or sample code of this, I have only come across super
> > > theoretical papers which I am unable to interpret. If you have seen any
> > > open
> > > source implementations I would appreciate you pointing me in the right
> > > direction.
> > >
> > >
> > Didn't know the context, so I could not answer the question.
> >
>
> Probably a question for Ted since he originally suggested it to me.
>

The issue here is that you have dependencies of one classification on
another.  This is really important because you might have two fields that
could be, say, signature and title.  This leads to four possibilities.  But
your model should know that two titles or two signatures are implausible so
really there are only two possibilities (title+sig or sig+title).

Normally it is customary to have fields only depend on previous assignments
in the reading order of the document.  That makes the evaluation of scores
at each point easier.  Unfortunately, that could lead to problems where the
best choice at one point leads to problems when we try to do the next
choice.  That would require that we back up and redo the earlier choice so
that there is a decent option for the later choice.  This requires some kind
of search over the space of all choices.

One way to do this is a simple best first search.  In this method, a
priority queue would be kept with scores and partial sequences of
assignments.  Scores are typically log likelihood or something like that.
 Search consists of taking the best element from the queue and inserting all
possible choices with scores updated appropriately.  The queue's size is
normally bounded so we lose the really lousy options at some point and only
look at interestingly good options.  The problem with this approach is that
these scores always look better for shorter sequences of choices
(probabilistic approaches tend to be optimistic about how well they can do
in the future).  This turns this search into a breadth first search which
gets really expensive.  If it works, great.  For lots of apps like speech
recognition, this fails miserably.

Another approach is to insert states which have scores that are adjusted to
be how good they are likely to be at the end of the process.  This turns the
process into something like a modified depth first search and is the basis
of beam search.

Both of these approaches assume causal dependencies.  Your feature set as
stated is non-causal because fields depend up and down the document.  You
will get a distinctly more precise model with this, but the decoding problem
is harder.  One way to decode with a non-causal model is to use an
evolutionary algorithm of some kind to search the space of assignments.  My
guess is that one of Metropolis-Hastings, simulated annealing or just a
simple meta-evolutionary algorithm would work well with this.  The
meta-evolution will be the simplest to implement.  Metropolis will probably
be the most accurate.

Re: Trying to determine the best ML algorithm to use.

Posted by Patrick Collins <pa...@ready2sign.com>.

Hi Stanley,
Thanks for your input appreciated. Some followup questions below.

On Fri, Apr 29, 2011 at 1:28 AM, Stanley Xu <we...@gmail.com> wrote:

> On Fri, Apr 29, 2011 at 2:17 PM, Patrick Collins <
> patrick.collins@ready2sign.com> wrote:
>
> > Hi all and Ted,
> >
> > A month later after you posted a potential solution for using SGD to
> > classify fields in contracts, I am still working to create the feature
> > vectors for the contract fields classifier. It is a big job. So I am yet
> to
> > run my first classifier test :)
> >
> > I have come up with my first guess at what will be the feature vectors
> and
> > I
> > have some questions for you.
> >
> > Firstly, see the below table which outlines the feature vectors (you'll
> > need
> > to transpose it because rows are columns and columns are rows just for
> > readability). This tables is a vectorized version of the text below. The
> > vectors outlined in this table should be able to be extracted by my
> > pre-processing code. I have included 4 sample training rows.
> >
> > column#feature_nameTraining Row 1Training Row 2Training Row 3Training Row
> 4
> > etc1target<<Signature>><<Name>><<Title>><<Address>>
> > 2possible_field_length50120118115
> > 3has_line1000
> > 4aligned_text_above_leftCOMPANY 1 Pty Ltd
> > By:Title:
> > 5white_space_above2000
> > 6label_left
> > By:Title:address:
> > 7label_below_leftBy:Title:address:COMPANY 2 LLC
> > 8label_below_center
> >
> >
> >
> >
> > 9label_below_distance1112
> > 10above_field_2
> >
> > <<Signature>><<Name>>
> > 11above_field_2_distance
> >
> > 22
> > 12above_field_1
> > <<Signature>><<Name>><<Title>>
> > 13above_field_1_distance
> >
> > 11
> > 14below_field_1<<Name>><<Title>><<Address>><<Signature>>
> > 15below_field_1_distance1115
> > 16below_field_2<<Title>><<Address>><<Signature>><<Name>>
> > 17below_field_2_distance2256
> >
> >
> >
> > I have a few questions before I start diving into testing the classifier
> > and
> > finalizing my pre-processing code.
> >
> > 1) there are phrases which clearly contain company names. How do I get
> the
> > classifier to recognize them as company names so that it can classify
> > better?
> >
>


> You could use a StatisWordEncoder to add the feature to the vector, and
> make the encoder name as "Company". (I guess the company name should be a
> category feature, isn't it?)
>
> FeatureEncoder encoder = new StaticWordEncoder("company")
> encoder.addToVector(company_name, vector)....
>

Hmmm... interesting. I think I need to do two passes. One which determines
whether the text nearby is a "Company Name" and if so, then pass that into
the second learner marked as "company name". Words as cues that might
indicate that a field is a company name might include GmbH, Venture
Partners, LLC, Partners, Pty Ltd, etc. Or again, should I leave this to the
primary classifier to figure out.



>
> > 2) How can I "weight" the fields given what I believe heuristically to be
> > more important? For instance, the label_left field should be the
> strongest
> > predictor of the field type. Should I just leave this to the classifier
> to
> > figure out?
> >
> Normally, you could leave it to the classifier, the classifier should be
> able to figure that out from the training data. It will have a large beta
> for the label_left feature
>

OK. Agreed.


>
>
> > 3) Note all of the distances are "line distances" from the field. 3 lines
> > above, 1 line below. These should be able to be used as weights. The
> closer
> > the feature, the more important it should be. How should I handle that?
> >
> Transfer it to a continues feature, use the distance as the variable.
>

OK. So should this continuous feature be inverted? So instead of 1 being
closest and 3 being furthest, I have 1 and .3333. So then closer the field,
the higher the number? Or does it not matter if the number is big or small.


>
> > 4) Finally it is clear that a beam decoder or other backtracking
> algorithm
> > is needed per your original suggestion. I have not come across any
> > practical
> > implementations or sample code of this, I have only come across super
> > theoretical papers which I am unable to interpret. If you have seen any
> > open
> > source implementations I would appreciate you pointing me in the right
> > direction.
> >
> >
> Didn't know the context, so I could not answer the question.
>

Probably a question for Ted since he originally suggested it to me.


>
> >
> > The original source text for the feature vectors extracted above:
> > =======================8<=========================
> >
> > IN WITNESS WHEREOF, the parties have entered into this Agreement on the
> > date
> > first above
> >
> > written.
> >
> >
> > COMPANY 1 Pty Ltd
> >
> >
> >
> > ___________________________________________
> >
> > By:
> >
> > Title:
> >
> > address:
> >
> >
> > COMPANY 2, LLC
> >
> >
> >
> > ___________________________________________
> >
> > By:
> >
> > Title:
> >
> > address:
> >
> >
> >
> > Acknowledged and accepted by:
> >
> >
> > COMPANY 3 Ventures Inc
> >
> >
> > ___________________________________________
> >
> > By:
> >
> > Title:
> >
> > address:
> > =======================8<=========================
> >
>

Re: Trying to determine the best ML algorithm to use.

Posted by Stanley Xu <we...@gmail.com>.

On Fri, Apr 29, 2011 at 2:17 PM, Patrick Collins <
patrick.collins@ready2sign.com> wrote:

> Hi all and Ted,
>
> A month later after you posted a potential solution for using SGD to
> classify fields in contracts, I am still working to create the feature
> vectors for the contract fields classifier. It is a big job. So I am yet to
> run my first classifier test :)
>
> I have come up with my first guess at what will be the feature vectors and
> I
> have some questions for you.
>
> Firstly, see the below table which outlines the feature vectors (you'll
> need
> to transpose it because rows are columns and columns are rows just for
> readability). This tables is a vectorized version of the text below. The
> vectors outlined in this table should be able to be extracted by my
> pre-processing code. I have included 4 sample training rows.
>
> column#feature_nameTraining Row 1Training Row 2Training Row 3Training Row 4
> etc1target<<Signature>><<Name>><<Title>><<Address>>
> 2possible_field_length50120118115
> 3has_line1000
> 4aligned_text_above_leftCOMPANY 1 Pty Ltd
> By:Title:
> 5white_space_above2000
> 6label_left
> By:Title:address:
> 7label_below_leftBy:Title:address:COMPANY 2 LLC
> 8label_below_center
>
>
>
>
> 9label_below_distance1112
> 10above_field_2
>
> <<Signature>><<Name>>
> 11above_field_2_distance
>
> 22
> 12above_field_1
> <<Signature>><<Name>><<Title>>
> 13above_field_1_distance
>
> 11
> 14below_field_1<<Name>><<Title>><<Address>><<Signature>>
> 15below_field_1_distance1115
> 16below_field_2<<Title>><<Address>><<Signature>><<Name>>
> 17below_field_2_distance2256
>
>
>
> I have a few questions before I start diving into testing the classifier
> and
> finalizing my pre-processing code.
>
> 1) there are phrases which clearly contain company names. How do I get the
> classifier to recognize them as company names so that it can classify
> better?
>
> You could use a StatisWordEncoder to add the feature to the vector, and
make the encoder name as "Company". (I guess the company name should be a
category feature, isn't it?)

FeatureEncoder encoder = new StaticWordEncoder("company")
encoder.addToVector(company_name, vector)....


> 2) How can I "weight" the fields given what I believe heuristically to be
> more important? For instance, the label_left field should be the strongest
> predictor of the field type. Should I just leave this to the classifier to
> figure out?
>
> Normally, you could leave it to the classifier, the classifier should be
able to figure that out from the training data. It will have a large beta
for the label_left feature


> 3) Note all of the distances are "line distances" from the field. 3 lines
> above, 1 line below. These should be able to be used as weights. The closer
> the feature, the more important it should be. How should I handle that?
>
> Transfer it to a continues feature, use the distance as the variable.


> 4) Finally it is clear that a beam decoder or other backtracking algorithm
> is needed per your original suggestion. I have not come across any
> practical
> implementations or sample code of this, I have only come across super
> theoretical papers which I am unable to interpret. If you have seen any
> open
> source implementations I would appreciate you pointing me in the right
> direction.
>
>
Didn't know the context, so I could not answer the question.

>
> The original source text for the feature vectors extracted above:
> =======================8<=========================
>
> IN WITNESS WHEREOF, the parties have entered into this Agreement on the
> date
> first above
>
> written.
>
>
> COMPANY 1 Pty Ltd
>
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
>
>
> COMPANY 2, LLC
>
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
>
>
>
> Acknowledged and accepted by:
>
>
> COMPANY 3 Ventures Inc
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
> =======================8<=========================
>

Re: Trying to determine the best ML algorithm to use.

Posted by Patrick Collins <pa...@ready2sign.com>.

Hmmm... it appears I cannot use rich text here and so the below table might
be unreadable. The below is the feature vectors in CSV format instead.
They'll need to be pasted into Excel or SpreadSheet to be readable.

"target","possible_field_length","has_line","aligned_text_above_left","white_space_above","label_left","label_below_left","label_below_center","label_below_distance","above_field_2","above_field_2_distance","above_field_1","above_field_1_distance","below_field_1","below_field_1_distance","below_field_2","below_field_2_distance"
"<<Signature>>",50,1,"COMPANY 1 Pty
Ltd",2,,"By:",,1,,,,,"<<Name>>",1,"<<Title>>",2
"<<Name>>",120,0,,0,"By:","Title:",,1,,,"<<Signature>>",,"<<Title>>",1,"<<Address>>",2
"<<Title>>",118,0,"By:",0,"Title:","address:",,1,"<<Signature>>",2,"<<Name>>",1,"<<Address>>",1,"<<Signature>>",5
"<<Address>>",115,0,"Title:",0,"address:","COMPANY 2
LLC",,2,"<<Name>>",2,"<<Title>>",1,"<<Signature>>",5,"<<Name>>",6


On Thu, Apr 28, 2011 at 11:17 PM, Patrick Collins <
patrick.collins@ready2sign.com> wrote:

> Hi all and Ted,
>
> A month later after you posted a potential solution for using SGD to
> classify fields in contracts, I am still working to create the feature
> vectors for the contract fields classifier. It is a big job. So I am yet to
> run my first classifier test :)
>
> I have come up with my first guess at what will be the feature vectors and
> I have some questions for you.
>
> Firstly, see the below table which outlines the feature vectors (you'll
> need to transpose it because rows are columns and columns are rows just for
> readability). This tables is a vectorized version of the text below. The
> vectors outlined in this table should be able to be extracted by my
> pre-processing code. I have included 4 sample training rows.
>
>  column# feature_name Training Row 1 Training Row 2 Training Row 3Training Row 4
> etc 1 target <<Signature>> <<Name>> <<Title>> <<Address>>
> 2 possible_field_length 50 120 118 115
> 3 has_line 1 0 0 0
> 4 aligned_text_above_left COMPANY 1 Pty Ltd
> By: Title:
> 5 white_space_above 2 0 0 0
> 6 label_left
> By: Title: address:
> 7 label_below_left By: Title: address: COMPANY 2 LLC
> 8 label_below_center
>
>
>
>
> 9 label_below_distance 1 1 1 2
> 10 above_field_2
>
> <<Signature>> <<Name>>
> 11 above_field_2_distance
>
> 2 2
> 12 above_field_1
> <<Signature>> <<Name>> <<Title>>
> 13 above_field_1_distance
>
> 1 1
> 14 below_field_1 <<Name>> <<Title>> <<Address>> <<Signature>>
> 15 below_field_1_distance 1 1 1 5
> 16 below_field_2 <<Title>> <<Address>> <<Signature>> <<Name>>
> 17 below_field_2_distance 2 2 5 6
>
>
>
> I have a few questions before I start diving into testing the classifier
> and finalizing my pre-processing code.
>
> 1) there are phrases which clearly contain company names. How do I get the
> classifier to recognize them as company names so that it can classify
> better?
>
> 2) How can I "weight" the fields given what I believe heuristically to be
> more important? For instance, the label_left field should be the strongest
> predictor of the field type. Should I just leave this to the classifier to
> figure out?
>
> 3) Note all of the distances are "line distances" from the field. 3 lines
> above, 1 line below. These should be able to be used as weights. The closer
> the feature, the more important it should be. How should I handle that?
>
> 4) Finally it is clear that a beam decoder or other backtracking algorithm
> is needed per your original suggestion. I have not come across any practical
> implementations or sample code of this, I have only come across super
> theoretical papers which I am unable to interpret. If you have seen any open
> source implementations I would appreciate you pointing me in the right
> direction.
>
>
> The original source text for the feature vectors extracted above:
> =======================8<=========================
>
> IN WITNESS WHEREOF, the parties have entered into this Agreement on the
> date first above
>
> written.
>
>
> COMPANY 1 Pty Ltd
>
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
>
>
> COMPANY 2, LLC
>
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
>
>
>
> Acknowledged and accepted by:
>
>
> COMPANY 3 Ventures Inc
>
>
> ___________________________________________
>
> By:
>
> Title:
>
> address:
> =======================8<=========================
>
>