You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Christopher Baechle <cb...@my.fau.edu> on 2015/11/06 16:12:56 UTC

How to annotate based on document collection

I am working with an existing project that is built with UIMA. I am trying
to create a tf-idf style score that looks at the set of documents as a
whole.

Since the rest of the project uses UIMA heavily, I would like to implement
this as an annotator if possible, rather than a separate program. Is it
possible within UIMA to do this?

Re: How to annotate based on document collection

Posted by Christopher Baechle <cb...@my.fau.edu>.
Thanks. That answered my question.

On Fri, Nov 6, 2015 at 10:21 AM, buddha <bu...@yahoo.com> wrote:

> UIMA works best when you are investigating one document at a time.  My
> suggestion would be to run the initial pipeline to get the correct
> annotation, which I assume are tokens in your case, then save those off
> into some relational table.
>
> From there, you can run the documents through again and load your df
> values as an external resource, then do the tf the second time.
>
> There are ways to estimate the tf/idf values, but, frankly, the whole
> notion of “document frequency” means you’ve looked at the whole corpus at
> least once.
>
> > On Nov 6, 2015, at 7:12 AM, Christopher Baechle <cb...@my.fau.edu>
> wrote:
> >
> > I am working with an existing project that is built with UIMA. I am
> trying
> > to create a tf-idf style score that looks at the set of documents as a
> > whole.
> >
> > Since the rest of the project uses UIMA heavily, I would like to
> implement
> > this as an annotator if possible, rather than a separate program. Is it
> > possible within UIMA to do this?
>
>

Re: How to annotate based on document collection

Posted by buddha <bu...@yahoo.com.INVALID>.
UIMA works best when you are investigating one document at a time.  My suggestion would be to run the initial pipeline to get the correct annotation, which I assume are tokens in your case, then save those off into some relational table.

From there, you can run the documents through again and load your df values as an external resource, then do the tf the second time.

There are ways to estimate the tf/idf values, but, frankly, the whole notion of “document frequency” means you’ve looked at the whole corpus at least once.

> On Nov 6, 2015, at 7:12 AM, Christopher Baechle <cb...@my.fau.edu> wrote:
> 
> I am working with an existing project that is built with UIMA. I am trying
> to create a tf-idf style score that looks at the set of documents as a
> whole.
> 
> Since the rest of the project uses UIMA heavily, I would like to implement
> this as an annotator if possible, rather than a separate program. Is it
> possible within UIMA to do this?