You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by William Colen <wi...@gmail.com> on 2014/04/15 19:45:14 UTC

DocumentSample in Doccat

Hello,

I've been working with the Doccat module and I am wondering if we could
improve its data structure for the 1.6.0 release.

Today the DocumentSample has the following attributes:

- String category
- List<String> text

I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?

Also, what do you think of including sentences and paragraph information? I
don't know if there is anything a feature generator can extract from it to
improve the classification.

Thank you,
William

Re: DocumentSample in Doccat

Posted by William Colen <wi...@gmail.com>.

Yes, it would be nice! Any other opinion?

Will you open a Jira for this improvement?

Thank you,
William

2014-04-27 21:59 GMT-03:00 Mark G <ma...@apache.org>:

> In my local copy I have these methods in the interface:
>  Map<String, Double> scoreMap(String text);
>  SortedMap<Double, Set<String>> sortedScoreMap(String text);
>
> and these impls of them in the ME impl
>
>
>   public Map<String, Double> scoreMap(String text) {
>     Map<String, Double> probDist = new HashMap<String, Double>();
>
>     double[] categorize = categorize(text);
>     int catSize = getNumberOfCategories();
>     for (int i = 0; i < catSize; i++) {
>       String category = getCategory(i);
>       probDist.put(category, categorize[getIndex(category)]);
>     }
>     return probDist;
>
>   }
>
>   public SortedMap<Double, Set<String>> sortedScoreMap(String text) {
>     SortedMap<Double, Set<String>> descendingMap = new TreeMap<Double,
> Set<String>>().descendingMap();
>     double[] categorize = categorize(text);
>     int catSize = getNumberOfCategories();
>     for (int i = 0; i < catSize; i++) {
>       String category = getCategory(i);
>       double score = categorize[getIndex(category)];
>       if (descendingMap.containsKey(score)) {
>         descendingMap.get(score).add(category);
>       } else {
>         Set<String> newset = new HashSet<>();
>         newset.add(category);
>         descendingMap.put(score, newset);
>       }
>     }
>     return descendingMap;
>   }
>
>
> They are pretty simple, but if everyone agrees I can commit them (with some
> java docs)
>
>
>
>
>
> On Sat, Apr 26, 2014 at 8:39 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> > On Thu, 2014-04-24 at 19:54 -0300, William Colen wrote:
> > > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > > interface. It is different from other tools, for example, we can't get
> > the
> > > best category of one document with only one call, we need to use two
> > > methods.
> >
> > Yes that is right. +1 to change it. Can we deprecate the old methods and
> > just add new ones to not break backward compatibility?
> >
> > Jörn
> >
> >
>

Re: DocumentSample in Doccat

Posted by Mark G <ma...@apache.org>.

In my local copy I have these methods in the interface:
 Map<String, Double> scoreMap(String text);
 SortedMap<Double, Set<String>> sortedScoreMap(String text);

and these impls of them in the ME impl


  public Map<String, Double> scoreMap(String text) {
    Map<String, Double> probDist = new HashMap<String, Double>();

    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      probDist.put(category, categorize[getIndex(category)]);
    }
    return probDist;

  }

  public SortedMap<Double, Set<String>> sortedScoreMap(String text) {
    SortedMap<Double, Set<String>> descendingMap = new TreeMap<Double,
Set<String>>().descendingMap();
    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      double score = categorize[getIndex(category)];
      if (descendingMap.containsKey(score)) {
        descendingMap.get(score).add(category);
      } else {
        Set<String> newset = new HashSet<>();
        newset.add(category);
        descendingMap.put(score, newset);
      }
    }
    return descendingMap;
  }


They are pretty simple, but if everyone agrees I can commit them (with some
java docs)





On Sat, Apr 26, 2014 at 8:39 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On Thu, 2014-04-24 at 19:54 -0300, William Colen wrote:
> > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > interface. It is different from other tools, for example, we can't get
> the
> > best category of one document with only one call, we need to use two
> > methods.
>
> Yes that is right. +1 to change it. Can we deprecate the old methods and
> just add new ones to not break backward compatibility?
>
> Jörn
>
>

Re: DocumentSample in Doccat

Posted by Jörn Kottmann <ko...@gmail.com>.

On Thu, 2014-04-24 at 19:54 -0300, William Colen wrote:
> Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> interface. It is different from other tools, for example, we can't get the
> best category of one document with only one call, we need to use two
> methods.

Yes that is right. +1 to change it. Can we deprecate the old methods and
just add new ones to not break backward compatibility?

Jörn

Re: DocumentSample in Doccat

Posted by Mark G <ma...@apache.org>.

William here is another thought, we could include something like this to
return a map sorted descending with the best score on top... so you can
call categoriesAsSortedMap("").firstEntry() to get the best score (which
can be the same for more that one category hence the Set as value)

  public NavigableMap<Double, Set<String>> categoriesAsSortedMap(String
text) {
    NavigableMap<Double, Set<String>> descendingMap = new TreeMap<Double,
Set<String>>().descendingMap();
    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      double score = categorize[getIndex(category)];
      if (descendingMap.containsKey(score)) {
        descendingMap.get(score).add(category);
      } else {
        Set<String> newset = new HashSet<>();
        newset.add(category);
        descendingMap.put(score, newset);
      }
    }
    return descendingMap;
  }


On Thu, Apr 24, 2014 at 7:04 PM, Tech mail <gi...@gmail.com> wrote:

> I think it might also be true that the featuregenerator interface in
> doccat is different than the others, also I don't think the tokennamefinder
> interface has a probs() method, which has always made me use the ME impl
> direct.
>
> Sent from my iPhone
>
> > On Apr 24, 2014, at 6:54 PM, William Colen <wi...@gmail.com>
> wrote:
> >
> > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > interface. It is different from other tools, for example, we can't get
> the
> > best category of one document with only one call, we need to use two
> > methods.
> >
> >
> >
> > 2014-04-24 18:43 GMT-03:00 Mark G <ma...@apache.org>:
> >
> >> William, that map looks good to me.
> >> In my current project I find this method convenient for getting back the
> >> probs over the categories in the model as a Map....let me know if
> there's
> >> anything wrong with it :)
> >>
> >> public Map<String, Double> categoriesAsMap(String text) {
> >>    Map<String, Double> probDist = new HashMap<String, Double>();
> >>
> >>    double[] categorize = categorize(text);
> >>    int catSize = getNumberOfCategories();
> >>    for (int i = 0; i < catSize; i++) {
> >>      String category = getCategory(i);
> >>      probDist.put(category, categorize[getIndex(category)]);
> >>    }
> >>    return probDist;
> >>
> >>  }
> >>
> >> perhaps we should consider adding this method to abstract some
> >> details....just a thought
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.colen@gmail.com
> >>> wrote:
> >>
> >>> What do you think of adding the following field to the DocumentSample?
> >>>
> >>> Map<String, Object> extraInformation
> >>>
> >>>
> >>> Also, we could add the following methods to the DocumentCategorizer
> >>> interface:
> >>>
> >>> public double[] categorize(String text[], Map<String, Object>
> >>> extraInformation);
> >>> public double[] categorize(String documentText, Map<String, Object>
> >>> extraInformation);
> >>>
> >>> Any opinion?
> >>>
> >>> Thank you,
> >>> William
> >>>
> >>>
> >>> 2014-04-17 10:39 GMT-03:00 Mark G <gi...@gmail.com>:
> >>>
> >>>> Another general doccat thought I had is this. in my projects that use
> >>>> Doccat, I created a class called a samplecollection, which simply
> >>> wrapped a
> >>>> list<documentsample> but then provided  a method that returned the
> >>> samples
> >>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of
> >> the
> >>>> doccat training format of all the samples). This worked out well
> >> because
> >>> I
> >>>> stored all the samples in a database, and users could CRUD samples for
> >>>> different categories. There was a map reduce job that at job startup
> >> read
> >>>> in the samples from the database into the samplecollection,
> dynamically
> >>>> generated the model, and then used the model to classify all the texts
> >>>> across the cluster; so every MR job ran the latest and greatest model
> >>> based
> >>>> on current samples. Not sure if we're interested in something like
> >> that,
> >>>> but I see several questions on stack overflow asking about iterative
> >>> model
> >>>> building, and a SampleCollection that returns a Model has worked for
> >> me.
> >>> I
> >>>> also created a SampleCRUD interface that abstracts storage and
> >> retrieval
> >>> of
> >>>> the samples.... I had a Postgres and Accumulo impl for sample storage.
> >>>> just a thought, I know this can get very specific and complicated,
> >>> thought
> >>>> we may be able to find a middle ground by providing a framework and
> >> some
> >>>> generic impls.
> >>>> MG
> >>>>
> >>>>
> >>>> On Thu, Apr 17, 2014 at 8:28 AM, William Colen <
> >> william.colen@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Yes, I don't see how to represent the sentences and paragraphs.
> >>>>>
> >>>>> +1 for the generic Map as suggested by Mark. We already have such
> >>> things
> >>>> in
> >>>>> other sample classes, like NameSample and the POSSample.
> >>>>>
> >>>>> A use case: the 20news corpus is a collection of articles, and each
> >>>> article
> >>>>> contains fields like "From", "Subject", "Organization". Mahout, which
> >>>>> includes a formatter for this corpus, concatenate it all to the text
> >>>> field,
> >>>>> but I think we could improve accuracy by handling this metadata in a
> >>>>> separated feature generator.
> >>>>>
> >>>>>
> >>>>> 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
> >>>>>
> >>>>>> I agree, this goes back to the concept of having a "document"
> >>> model...
> >>>>>> I know in the prod systems I've used doccat, storing sentences and
> >>>>>> paragraphs wouldn't make sense, people usually have their own
> >> domain
> >>>>> model
> >>>>>> for that. I still feel like if we augment the documentsample object
> >>>> with
> >>>>> a
> >>>>>> generic Map it would be helpful in some cases and not constraining
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> On 04/15/2014 07:45 PM, William Colen wrote:
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I've been working with the Doccat module and I am wondering if
> >> we
> >>>>> could
> >>>>>>>> improve its data structure for the 1.6.0 release.
> >>>>>>>>
> >>>>>>>> Today the DocumentSample has the following attributes:
> >>>>>>>>
> >>>>>>>> - String category
> >>>>>>>> - List<String> text
> >>>>>>>>
> >>>>>>>> I would suggest adding an attribute to hold metadata, or
> >>> additional
> >>>>>>>> contexts information. What do you think?
> >>>>>>>
> >>>>>>> Right now the training format contains these two fields per line.
> >>>>>>> Do you want to change the format as well?
> >>>>>>>
> >>>>>>>> Also, what do you think of including sentences and paragraph
> >>>>>> information? I
> >>>>>>>> don't know if there is anything a feature generator can extract
> >>> from
> >>>>> it
> >>>>>> to
> >>>>>>>> improve the classification.
> >>>>>>>
> >>>>>>> I guess we only want to do that if there is a use case for it. It
> >>>> will
> >>>>>> make the processing for the clients
> >>>>>>> more complex, since they then would have to provide sentences and
> >>>>>> paragraphs compared to just
> >>>>>>> a piece of text.
> >>>>>>>
> >>>>>>> Jörn
> >>
>

Re: DocumentSample in Doccat

Posted by Tech mail <gi...@gmail.com>.

I think it might also be true that the featuregenerator interface in doccat is different than the others, also I don't think the tokennamefinder interface has a probs() method, which has always made me use the ME impl direct.

Sent from my iPhone

> On Apr 24, 2014, at 6:54 PM, William Colen <wi...@gmail.com> wrote:
> 
> Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> interface. It is different from other tools, for example, we can't get the
> best category of one document with only one call, we need to use two
> methods.
> 
> 
> 
> 2014-04-24 18:43 GMT-03:00 Mark G <ma...@apache.org>:
> 
>> William, that map looks good to me.
>> In my current project I find this method convenient for getting back the
>> probs over the categories in the model as a Map....let me know if there's
>> anything wrong with it :)
>> 
>> public Map<String, Double> categoriesAsMap(String text) {
>>    Map<String, Double> probDist = new HashMap<String, Double>();
>> 
>>    double[] categorize = categorize(text);
>>    int catSize = getNumberOfCategories();
>>    for (int i = 0; i < catSize; i++) {
>>      String category = getCategory(i);
>>      probDist.put(category, categorize[getIndex(category)]);
>>    }
>>    return probDist;
>> 
>>  }
>> 
>> perhaps we should consider adding this method to abstract some
>> details....just a thought
>> 
>> 
>> 
>> 
>> 
>> On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.colen@gmail.com
>>> wrote:
>> 
>>> What do you think of adding the following field to the DocumentSample?
>>> 
>>> Map<String, Object> extraInformation
>>> 
>>> 
>>> Also, we could add the following methods to the DocumentCategorizer
>>> interface:
>>> 
>>> public double[] categorize(String text[], Map<String, Object>
>>> extraInformation);
>>> public double[] categorize(String documentText, Map<String, Object>
>>> extraInformation);
>>> 
>>> Any opinion?
>>> 
>>> Thank you,
>>> William
>>> 
>>> 
>>> 2014-04-17 10:39 GMT-03:00 Mark G <gi...@gmail.com>:
>>> 
>>>> Another general doccat thought I had is this. in my projects that use
>>>> Doccat, I created a class called a samplecollection, which simply
>>> wrapped a
>>>> list<documentsample> but then provided  a method that returned the
>>> samples
>>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of
>> the
>>>> doccat training format of all the samples). This worked out well
>> because
>>> I
>>>> stored all the samples in a database, and users could CRUD samples for
>>>> different categories. There was a map reduce job that at job startup
>> read
>>>> in the samples from the database into the samplecollection, dynamically
>>>> generated the model, and then used the model to classify all the texts
>>>> across the cluster; so every MR job ran the latest and greatest model
>>> based
>>>> on current samples. Not sure if we're interested in something like
>> that,
>>>> but I see several questions on stack overflow asking about iterative
>>> model
>>>> building, and a SampleCollection that returns a Model has worked for
>> me.
>>> I
>>>> also created a SampleCRUD interface that abstracts storage and
>> retrieval
>>> of
>>>> the samples.... I had a Postgres and Accumulo impl for sample storage.
>>>> just a thought, I know this can get very specific and complicated,
>>> thought
>>>> we may be able to find a middle ground by providing a framework and
>> some
>>>> generic impls.
>>>> MG
>>>> 
>>>> 
>>>> On Thu, Apr 17, 2014 at 8:28 AM, William Colen <
>> william.colen@gmail.com
>>>>> wrote:
>>>> 
>>>>> Yes, I don't see how to represent the sentences and paragraphs.
>>>>> 
>>>>> +1 for the generic Map as suggested by Mark. We already have such
>>> things
>>>> in
>>>>> other sample classes, like NameSample and the POSSample.
>>>>> 
>>>>> A use case: the 20news corpus is a collection of articles, and each
>>>> article
>>>>> contains fields like "From", "Subject", "Organization". Mahout, which
>>>>> includes a formatter for this corpus, concatenate it all to the text
>>>> field,
>>>>> but I think we could improve accuracy by handling this metadata in a
>>>>> separated feature generator.
>>>>> 
>>>>> 
>>>>> 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
>>>>> 
>>>>>> I agree, this goes back to the concept of having a "document"
>>> model...
>>>>>> I know in the prod systems I've used doccat, storing sentences and
>>>>>> paragraphs wouldn't make sense, people usually have their own
>> domain
>>>>> model
>>>>>> for that. I still feel like if we augment the documentsample object
>>>> with
>>>>> a
>>>>>> generic Map it would be helpful in some cases and not constraining
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> On 04/15/2014 07:45 PM, William Colen wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I've been working with the Doccat module and I am wondering if
>> we
>>>>> could
>>>>>>>> improve its data structure for the 1.6.0 release.
>>>>>>>> 
>>>>>>>> Today the DocumentSample has the following attributes:
>>>>>>>> 
>>>>>>>> - String category
>>>>>>>> - List<String> text
>>>>>>>> 
>>>>>>>> I would suggest adding an attribute to hold metadata, or
>>> additional
>>>>>>>> contexts information. What do you think?
>>>>>>> 
>>>>>>> Right now the training format contains these two fields per line.
>>>>>>> Do you want to change the format as well?
>>>>>>> 
>>>>>>>> Also, what do you think of including sentences and paragraph
>>>>>> information? I
>>>>>>>> don't know if there is anything a feature generator can extract
>>> from
>>>>> it
>>>>>> to
>>>>>>>> improve the classification.
>>>>>>> 
>>>>>>> I guess we only want to do that if there is a use case for it. It
>>>> will
>>>>>> make the processing for the clients
>>>>>>> more complex, since they then would have to provide sentences and
>>>>>> paragraphs compared to just
>>>>>>> a piece of text.
>>>>>>> 
>>>>>>> Jörn
>>

Re: DocumentSample in Doccat

Posted by William Colen <wi...@gmail.com>.

Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
interface. It is different from other tools, for example, we can't get the
best category of one document with only one call, we need to use two
methods.



2014-04-24 18:43 GMT-03:00 Mark G <ma...@apache.org>:

> William, that map looks good to me.
> In my current project I find this method convenient for getting back the
> probs over the categories in the model as a Map....let me know if there's
> anything wrong with it :)
>
> public Map<String, Double> categoriesAsMap(String text) {
>     Map<String, Double> probDist = new HashMap<String, Double>();
>
>     double[] categorize = categorize(text);
>     int catSize = getNumberOfCategories();
>     for (int i = 0; i < catSize; i++) {
>       String category = getCategory(i);
>       probDist.put(category, categorize[getIndex(category)]);
>     }
>     return probDist;
>
>   }
>
> perhaps we should consider adding this method to abstract some
> details....just a thought
>
>
>
>
>
> On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.colen@gmail.com
> >wrote:
>
> > What do you think of adding the following field to the DocumentSample?
> >
> > Map<String, Object> extraInformation
> >
> >
> > Also, we could add the following methods to the DocumentCategorizer
> > interface:
> >
> > public double[] categorize(String text[], Map<String, Object>
> > extraInformation);
> > public double[] categorize(String documentText, Map<String, Object>
> > extraInformation);
> >
> > Any opinion?
> >
> > Thank you,
> > William
> >
> >
> > 2014-04-17 10:39 GMT-03:00 Mark G <gi...@gmail.com>:
> >
> > > Another general doccat thought I had is this. in my projects that use
> > > Doccat, I created a class called a samplecollection, which simply
> > wrapped a
> > > list<documentsample> but then provided  a method that returned the
> > samples
> > > as a DoccatModel (using a properly formatted ByteArrayInputStream of
> the
> > > doccat training format of all the samples). This worked out well
> because
> > I
> > > stored all the samples in a database, and users could CRUD samples for
> > > different categories. There was a map reduce job that at job startup
> read
> > > in the samples from the database into the samplecollection, dynamically
> > > generated the model, and then used the model to classify all the texts
> > > across the cluster; so every MR job ran the latest and greatest model
> > based
> > > on current samples. Not sure if we're interested in something like
> that,
> > > but I see several questions on stack overflow asking about iterative
> > model
> > > building, and a SampleCollection that returns a Model has worked for
> me.
> >  I
> > > also created a SampleCRUD interface that abstracts storage and
> retrieval
> > of
> > > the samples.... I had a Postgres and Accumulo impl for sample storage.
> > > just a thought, I know this can get very specific and complicated,
> > thought
> > > we may be able to find a middle ground by providing a framework and
> some
> > > generic impls.
> > > MG
> > >
> > >
> > > On Thu, Apr 17, 2014 at 8:28 AM, William Colen <
> william.colen@gmail.com
> > > >wrote:
> > >
> > > > Yes, I don't see how to represent the sentences and paragraphs.
> > > >
> > > > +1 for the generic Map as suggested by Mark. We already have such
> > things
> > > in
> > > > other sample classes, like NameSample and the POSSample.
> > > >
> > > > A use case: the 20news corpus is a collection of articles, and each
> > > article
> > > > contains fields like "From", "Subject", "Organization". Mahout, which
> > > > includes a formatter for this corpus, concatenate it all to the text
> > > field,
> > > > but I think we could improve accuracy by handling this metadata in a
> > > > separated feature generator.
> > > >
> > > >
> > > > 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
> > > >
> > > > > I agree, this goes back to the concept of having a "document"
> > model...
> > > > > I know in the prod systems I've used doccat, storing sentences and
> > > > > paragraphs wouldn't make sense, people usually have their own
> domain
> > > > model
> > > > > for that. I still feel like if we augment the documentsample object
> > > with
> > > > a
> > > > > generic Map it would be helpful in some cases and not constraining
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > > > > >> Hello,
> > > > > >>
> > > > > >> I've been working with the Doccat module and I am wondering if
> we
> > > > could
> > > > > >> improve its data structure for the 1.6.0 release.
> > > > > >>
> > > > > >> Today the DocumentSample has the following attributes:
> > > > > >>
> > > > > >> - String category
> > > > > >> - List<String> text
> > > > > >>
> > > > > >> I would suggest adding an attribute to hold metadata, or
> > additional
> > > > > >> contexts information. What do you think?
> > > > > >
> > > > > > Right now the training format contains these two fields per line.
> > > > > > Do you want to change the format as well?
> > > > > >
> > > > > >> Also, what do you think of including sentences and paragraph
> > > > > information? I
> > > > > >> don't know if there is anything a feature generator can extract
> > from
> > > > it
> > > > > to
> > > > > >> improve the classification.
> > > > > >
> > > > > > I guess we only want to do that if there is a use case for it. It
> > > will
> > > > > make the processing for the clients
> > > > > > more complex, since they then would have to provide sentences and
> > > > > paragraphs compared to just
> > > > > > a piece of text.
> > > > > >
> > > > > > Jörn
> > > > >
> > > >
> > >
> >
>

Re: DocumentSample in Doccat

Posted by Mark G <ma...@apache.org>.

William, that map looks good to me.
In my current project I find this method convenient for getting back the
probs over the categories in the model as a Map....let me know if there's
anything wrong with it :)

public Map<String, Double> categoriesAsMap(String text) {
    Map<String, Double> probDist = new HashMap<String, Double>();

    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      probDist.put(category, categorize[getIndex(category)]);
    }
    return probDist;

  }

perhaps we should consider adding this method to abstract some
details....just a thought





On Thu, Apr 24, 2014 at 3:56 PM, William Colen <wi...@gmail.com>wrote:

> What do you think of adding the following field to the DocumentSample?
>
> Map<String, Object> extraInformation
>
>
> Also, we could add the following methods to the DocumentCategorizer
> interface:
>
> public double[] categorize(String text[], Map<String, Object>
> extraInformation);
> public double[] categorize(String documentText, Map<String, Object>
> extraInformation);
>
> Any opinion?
>
> Thank you,
> William
>
>
> 2014-04-17 10:39 GMT-03:00 Mark G <gi...@gmail.com>:
>
> > Another general doccat thought I had is this. in my projects that use
> > Doccat, I created a class called a samplecollection, which simply
> wrapped a
> > list<documentsample> but then provided  a method that returned the
> samples
> > as a DoccatModel (using a properly formatted ByteArrayInputStream of the
> > doccat training format of all the samples). This worked out well because
> I
> > stored all the samples in a database, and users could CRUD samples for
> > different categories. There was a map reduce job that at job startup read
> > in the samples from the database into the samplecollection, dynamically
> > generated the model, and then used the model to classify all the texts
> > across the cluster; so every MR job ran the latest and greatest model
> based
> > on current samples. Not sure if we're interested in something like that,
> > but I see several questions on stack overflow asking about iterative
> model
> > building, and a SampleCollection that returns a Model has worked for me.
>  I
> > also created a SampleCRUD interface that abstracts storage and retrieval
> of
> > the samples.... I had a Postgres and Accumulo impl for sample storage.
> > just a thought, I know this can get very specific and complicated,
> thought
> > we may be able to find a middle ground by providing a framework and some
> > generic impls.
> > MG
> >
> >
> > On Thu, Apr 17, 2014 at 8:28 AM, William Colen <william.colen@gmail.com
> > >wrote:
> >
> > > Yes, I don't see how to represent the sentences and paragraphs.
> > >
> > > +1 for the generic Map as suggested by Mark. We already have such
> things
> > in
> > > other sample classes, like NameSample and the POSSample.
> > >
> > > A use case: the 20news corpus is a collection of articles, and each
> > article
> > > contains fields like "From", "Subject", "Organization". Mahout, which
> > > includes a formatter for this corpus, concatenate it all to the text
> > field,
> > > but I think we could improve accuracy by handling this metadata in a
> > > separated feature generator.
> > >
> > >
> > > 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
> > >
> > > > I agree, this goes back to the concept of having a "document"
> model...
> > > > I know in the prod systems I've used doccat, storing sentences and
> > > > paragraphs wouldn't make sense, people usually have their own domain
> > > model
> > > > for that. I still feel like if we augment the documentsample object
> > with
> > > a
> > > > generic Map it would be helpful in some cases and not constraining
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com>
> > wrote:
> > > > >
> > > > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> I've been working with the Doccat module and I am wondering if we
> > > could
> > > > >> improve its data structure for the 1.6.0 release.
> > > > >>
> > > > >> Today the DocumentSample has the following attributes:
> > > > >>
> > > > >> - String category
> > > > >> - List<String> text
> > > > >>
> > > > >> I would suggest adding an attribute to hold metadata, or
> additional
> > > > >> contexts information. What do you think?
> > > > >
> > > > > Right now the training format contains these two fields per line.
> > > > > Do you want to change the format as well?
> > > > >
> > > > >> Also, what do you think of including sentences and paragraph
> > > > information? I
> > > > >> don't know if there is anything a feature generator can extract
> from
> > > it
> > > > to
> > > > >> improve the classification.
> > > > >
> > > > > I guess we only want to do that if there is a use case for it. It
> > will
> > > > make the processing for the clients
> > > > > more complex, since they then would have to provide sentences and
> > > > paragraphs compared to just
> > > > > a piece of text.
> > > > >
> > > > > Jörn
> > > >
> > >
> >
>

Re: DocumentSample in Doccat

Posted by William Colen <wi...@gmail.com>.

What do you think of adding the following field to the DocumentSample?

Map<String, Object> extraInformation


Also, we could add the following methods to the DocumentCategorizer
interface:

public double[] categorize(String text[], Map<String, Object>
extraInformation);
public double[] categorize(String documentText, Map<String, Object>
extraInformation);

Any opinion?

Thank you,
William


2014-04-17 10:39 GMT-03:00 Mark G <gi...@gmail.com>:

> Another general doccat thought I had is this. in my projects that use
> Doccat, I created a class called a samplecollection, which simply wrapped a
> list<documentsample> but then provided  a method that returned the samples
> as a DoccatModel (using a properly formatted ByteArrayInputStream of the
> doccat training format of all the samples). This worked out well because I
> stored all the samples in a database, and users could CRUD samples for
> different categories. There was a map reduce job that at job startup read
> in the samples from the database into the samplecollection, dynamically
> generated the model, and then used the model to classify all the texts
> across the cluster; so every MR job ran the latest and greatest model based
> on current samples. Not sure if we're interested in something like that,
> but I see several questions on stack overflow asking about iterative model
> building, and a SampleCollection that returns a Model has worked for me.  I
> also created a SampleCRUD interface that abstracts storage and retrieval of
> the samples.... I had a Postgres and Accumulo impl for sample storage.
> just a thought, I know this can get very specific and complicated, thought
> we may be able to find a middle ground by providing a framework and some
> generic impls.
> MG
>
>
> On Thu, Apr 17, 2014 at 8:28 AM, William Colen <william.colen@gmail.com
> >wrote:
>
> > Yes, I don't see how to represent the sentences and paragraphs.
> >
> > +1 for the generic Map as suggested by Mark. We already have such things
> in
> > other sample classes, like NameSample and the POSSample.
> >
> > A use case: the 20news corpus is a collection of articles, and each
> article
> > contains fields like "From", "Subject", "Organization". Mahout, which
> > includes a formatter for this corpus, concatenate it all to the text
> field,
> > but I think we could improve accuracy by handling this metadata in a
> > separated feature generator.
> >
> >
> > 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
> >
> > > I agree, this goes back to the concept of having a "document" model...
> > > I know in the prod systems I've used doccat, storing sentences and
> > > paragraphs wouldn't make sense, people usually have their own domain
> > model
> > > for that. I still feel like if we augment the documentsample object
> with
> > a
> > > generic Map it would be helpful in some cases and not constraining
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> > > >
> > > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > > >> Hello,
> > > >>
> > > >> I've been working with the Doccat module and I am wondering if we
> > could
> > > >> improve its data structure for the 1.6.0 release.
> > > >>
> > > >> Today the DocumentSample has the following attributes:
> > > >>
> > > >> - String category
> > > >> - List<String> text
> > > >>
> > > >> I would suggest adding an attribute to hold metadata, or additional
> > > >> contexts information. What do you think?
> > > >
> > > > Right now the training format contains these two fields per line.
> > > > Do you want to change the format as well?
> > > >
> > > >> Also, what do you think of including sentences and paragraph
> > > information? I
> > > >> don't know if there is anything a feature generator can extract from
> > it
> > > to
> > > >> improve the classification.
> > > >
> > > > I guess we only want to do that if there is a use case for it. It
> will
> > > make the processing for the clients
> > > > more complex, since they then would have to provide sentences and
> > > paragraphs compared to just
> > > > a piece of text.
> > > >
> > > > Jörn
> > >
> >
>

Re: DocumentSample in Doccat

Posted by Mark G <gi...@gmail.com>.

Another general doccat thought I had is this. in my projects that use
Doccat, I created a class called a samplecollection, which simply wrapped a
list<documentsample> but then provided  a method that returned the samples
as a DoccatModel (using a properly formatted ByteArrayInputStream of the
doccat training format of all the samples). This worked out well because I
stored all the samples in a database, and users could CRUD samples for
different categories. There was a map reduce job that at job startup read
in the samples from the database into the samplecollection, dynamically
generated the model, and then used the model to classify all the texts
across the cluster; so every MR job ran the latest and greatest model based
on current samples. Not sure if we're interested in something like that,
but I see several questions on stack overflow asking about iterative model
building, and a SampleCollection that returns a Model has worked for me.  I
also created a SampleCRUD interface that abstracts storage and retrieval of
the samples.... I had a Postgres and Accumulo impl for sample storage.
just a thought, I know this can get very specific and complicated, thought
we may be able to find a middle ground by providing a framework and some
generic impls.
MG

On Thu, Apr 17, 2014 at 8:28 AM, William Colen <wi...@gmail.com>wrote:

> Yes, I don't see how to represent the sentences and paragraphs.
>
> +1 for the generic Map as suggested by Mark. We already have such things in
> other sample classes, like NameSample and the POSSample.
>
> A use case: the 20news corpus is a collection of articles, and each article
> contains fields like "From", "Subject", "Organization". Mahout, which
> includes a formatter for this corpus, concatenate it all to the text field,
> but I think we could improve accuracy by handling this metadata in a
> separated feature generator.
>
>
> 2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:
>
> > I agree, this goes back to the concept of having a "document" model...
> > I know in the prod systems I've used doccat, storing sentences and
> > paragraphs wouldn't make sense, people usually have their own domain
> model
> > for that. I still feel like if we augment the documentsample object with
> a
> > generic Map it would be helpful in some cases and not constraining
> >
> > Sent from my iPhone
> >
> > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> > >
> > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > >> Hello,
> > >>
> > >> I've been working with the Doccat module and I am wondering if we
> could
> > >> improve its data structure for the 1.6.0 release.
> > >>
> > >> Today the DocumentSample has the following attributes:
> > >>
> > >> - String category
> > >> - List<String> text
> > >>
> > >> I would suggest adding an attribute to hold metadata, or additional
> > >> contexts information. What do you think?
> > >
> > > Right now the training format contains these two fields per line.
> > > Do you want to change the format as well?
> > >
> > >> Also, what do you think of including sentences and paragraph
> > information? I
> > >> don't know if there is anything a feature generator can extract from
> it
> > to
> > >> improve the classification.
> > >
> > > I guess we only want to do that if there is a use case for it. It will
> > make the processing for the clients
> > > more complex, since they then would have to provide sentences and
> > paragraphs compared to just
> > > a piece of text.
> > >
> > > Jörn
> >
>

Re: DocumentSample in Doccat

Posted by William Colen <wi...@gmail.com>.

Yes, I don't see how to represent the sentences and paragraphs.

+1 for the generic Map as suggested by Mark. We already have such things in
other sample classes, like NameSample and the POSSample.

A use case: the 20news corpus is a collection of articles, and each article
contains fields like "From", "Subject", "Organization". Mahout, which
includes a formatter for this corpus, concatenate it all to the text field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.


2014-04-17 8:37 GMT-03:00 Tech mail <gi...@gmail.com>:

> I agree, this goes back to the concept of having a "document" model...
> I know in the prod systems I've used doccat, storing sentences and
> paragraphs wouldn't make sense, people usually have their own domain model
> for that. I still feel like if we augment the documentsample object with a
> generic Map it would be helpful in some cases and not constraining
>
> Sent from my iPhone
>
> > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> >
> >> On 04/15/2014 07:45 PM, William Colen wrote:
> >> Hello,
> >>
> >> I've been working with the Doccat module and I am wondering if we could
> >> improve its data structure for the 1.6.0 release.
> >>
> >> Today the DocumentSample has the following attributes:
> >>
> >> - String category
> >> - List<String> text
> >>
> >> I would suggest adding an attribute to hold metadata, or additional
> >> contexts information. What do you think?
> >
> > Right now the training format contains these two fields per line.
> > Do you want to change the format as well?
> >
> >> Also, what do you think of including sentences and paragraph
> information? I
> >> don't know if there is anything a feature generator can extract from it
> to
> >> improve the classification.
> >
> > I guess we only want to do that if there is a use case for it. It will
> make the processing for the clients
> > more complex, since they then would have to provide sentences and
> paragraphs compared to just
> > a piece of text.
> >
> > Jörn
>

Re: DocumentSample in Doccat

Posted by Tech mail <gi...@gmail.com>.

I agree, this goes back to the concept of having a "document" model...
I know in the prod systems I've used doccat, storing sentences and paragraphs wouldn't make sense, people usually have their own domain model for that. I still feel like if we augment the documentsample object with a generic Map it would be helpful in some cases and not constraining

Sent from my iPhone

> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> 
>> On 04/15/2014 07:45 PM, William Colen wrote:
>> Hello,
>> 
>> I've been working with the Doccat module and I am wondering if we could
>> improve its data structure for the 1.6.0 release.
>> 
>> Today the DocumentSample has the following attributes:
>> 
>> - String category
>> - List<String> text
>> 
>> I would suggest adding an attribute to hold metadata, or additional
>> contexts information. What do you think?
> 
> Right now the training format contains these two fields per line.
> Do you want to change the format as well?
> 
>> Also, what do you think of including sentences and paragraph information? I
>> don't know if there is anything a feature generator can extract from it to
>> improve the classification.
> 
> I guess we only want to do that if there is a use case for it. It will make the processing for the clients
> more complex, since they then would have to provide sentences and paragraphs compared to just
> a piece of text.
> 
> Jörn

Re: DocumentSample in Doccat

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/15/2014 07:45 PM, William Colen wrote:
> Hello,
>
> I've been working with the Doccat module and I am wondering if we could
> improve its data structure for the 1.6.0 release.
>
> Today the DocumentSample has the following attributes:
>
> - String category
> - List<String> text
>
> I would suggest adding an attribute to hold metadata, or additional
> contexts information. What do you think?

Right now the training format contains these two fields per line.
Do you want to change the format as well?

> Also, what do you think of including sentences and paragraph information? I
> don't know if there is anything a feature generator can extract from it to
> improve the classification.

I guess we only want to do that if there is a use case for it. It will 
make the processing for the clients
more complex, since they then would have to provide sentences and 
paragraphs compared to just
a piece of text.

Jörn

Re: DocumentSample in Doccat

Posted by Tech mail <gi...@gmail.com>.

William, in my last project that I used doccat, I extended the documentsample and just added a generic Map to hold additional key values. Perhaps adding that to the baseline might be natural

Sent from my iPhone

> On Apr 15, 2014, at 11:45 AM, William Colen <wi...@gmail.com> wrote:
> 
> Hello,
> 
> I've been working with the Doccat module and I am wondering if we could
> improve its data structure for the 1.6.0 release.
> 
> Today the DocumentSample has the following attributes:
> 
> - String category
> - List<String> text
> 
> I would suggest adding an attribute to hold metadata, or additional
> contexts information. What do you think?
> 
> Also, what do you think of including sentences and paragraph information? I
> don't know if there is anything a feature generator can extract from it to
> improve the classification.
> 
> Thank you,
> William