You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Petr Baudis <pa...@ucw.cz> on 2014/04/22 04:20:17 UTC

Deduplicating Annotations With Same coveredText

  Hi!

  I'm facing a task of deduplicating annotations that have the same
getCoveredText() value (possibly at different sofa locations) - I'd
like to keep just a single of each; for example if I were to make
a bag-of-words with only single annotation per word and number of
occurences as a feature.  (Or, in my case, the annotations are scored
candidate answers in a QA system that I'd like to merge if they are
textually the same.)

  Is there a better way than simply loading all annotations of the type
to a java map, mass-dropping them from indexes, then readding some of
them?

  My idea was to simply index them by coveredText and then by sequential
iteration, it's enough to just compare getCoveredText() of current and
previous annotation to decide whether to merge them. However, it appears
that coveredText is not supported as a key feature, I'd have to make an
explicit copy of it as a separate feature. Is there any other option?

  Thanks,

				Petr "Pasky" Baudis

Re: Deduplicating Annotations With Same coveredText

Posted by Petr Baudis <pa...@ucw.cz>.

  Hi!

On Tue, Apr 22, 2014 at 05:10:56PM -0400, Marshall Schor wrote:
> If you plan on running your pipeline in one JVM (rather than having it scaled
> out over multiple JVMs), you can consider using an external resource which would
> be a plain Java Set<String> of the unique covered text so far found.  Then, in
> the annotator (or annotators) that are adding new FeatureStructures representing
> the possibly duplication annotation, you can first check the shared resource to
> see if its been already annotated, and if so, skip both creating the additional
> FeatureStructure, and adding it to the indexes.
> 
> Would that work for your use case?

  That's an interesting approach, thanks for the suggestion.  While I
could do it this way now, I plan to scale out my setup to multiple
machines in the future and this solution would become inconvenient
then.  For the time being, I have simply loaded all the FSes to a
coveredText-addressed map and then removed duplicates.

				Petr "Pasky" Baudis

Re: Deduplicating Annotations With Same coveredText

Posted by Marshall Schor <ms...@schor.com>.

If you plan on running your pipeline in one JVM (rather than having it scaled
out over multiple JVMs), you can consider using an external resource which would
be a plain Java Set<String> of the unique covered text so far found.  Then, in
the annotator (or annotators) that are adding new FeatureStructures representing
the possibly duplication annotation, you can first check the shared resource to
see if its been already annotated, and if so, skip both creating the additional
FeatureStructure, and adding it to the indexes.

Would that work for your use case?

-Marshall

On 4/21/2014 10:20 PM, Petr Baudis wrote:
>   Hi!
>
>   I'm facing a task of deduplicating annotations that have the same
> getCoveredText() value (possibly at different sofa locations) - I'd
> like to keep just a single of each; for example if I were to make
> a bag-of-words with only single annotation per word and number of
> occurences as a feature.  (Or, in my case, the annotations are scored
> candidate answers in a QA system that I'd like to merge if they are
> textually the same.)
>
>   Is there a better way than simply loading all annotations of the type
> to a java map, mass-dropping them from indexes, then readding some of
> them?
>
>   My idea was to simply index them by coveredText and then by sequential
> iteration, it's enough to just compare getCoveredText() of current and
> previous annotation to decide whether to merge them. However, it appears
> that coveredText is not supported as a key feature, I'd have to make an
> explicit copy of it as a separate feature. Is there any other option?
>
>   Thanks,
>
> 				Petr "Pasky" Baudis
>
>