You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Eric Riebling <er...@cs.cmu.edu> on 2010/06/09 18:47:38 UTC

XMI Collection Reader vs. CAS Consumer

It seems to me like a good idea that the default/expected file extensions
read and written by the XMI CR and CAS Consumer should be the same.
As things stand, the CAS Consumer writes files like "doc0, doc1, doc2..."
but the Collection Reader ignores them because they don't end have
extensions.  ("doc0.xmi, doc1.xmi, doc2.xmi...")

I'm not sure if this is by design, for a reason, or an oversight.  I know
it's very useful to add a new annotation to a set of already-annotated documents
saved in XMI format, especially
if the annotations they contain took a lot of time to produce, but it
requires one to go through the extra step of renaming them.

-- 
Eric Riebling  GHC 6713,  LTI,   SCS,  CMU
412.268.9872   http://www.cs.cmu.edu/~er1k

Re: XMI Collection Reader vs. CAS Consumer

Posted by Eric Riebling <er...@cs.cmu.edu>.
It might even make sense (I haven't tried this) for a CPE configured to
read from and write back to the same folder of XMI documents, overwriting
the ones used as input with enriched ones, as output.  Especially on a
system with limited space due to a very large corpus size.  So long as
existing annotations are preserved, and CASes reside in memory during
processing, I can't see any reason why this shouldn't work.

Eric Riebling wrote:
> It seems to me like a good idea that the default/expected file extensions
> read and written by the XMI CR and CAS Consumer should be the same.
> As things stand, the CAS Consumer writes files like "doc0, doc1, doc2..."
> but the Collection Reader ignores them because they don't end have
> extensions.  ("doc0.xmi, doc1.xmi, doc2.xmi...")
> 
> I'm not sure if this is by design, for a reason, or an oversight.  I know
> it's very useful to add a new annotation to a set of already-annotated 
> documents
> saved in XMI format, especially
> if the annotations they contain took a lot of time to produce, but it
> requires one to go through the extra step of renaming them.
> 

-- 
Eric Riebling  GHC 6713,  LTI,   SCS,  CMU
412.268.9872   http://www.cs.cmu.edu/~er1k

Re: XMI Collection Reader vs. CAS Consumer

Posted by Marshall Schor <ms...@schor.com>.
Hi Eric,

The behavior of the Cas Consumer I thought was "fixed" to add the ".xmi"
extension; see Jira https://issues.apache.org/jira/browse/UIMA-629 -
which should have been picked up in release 2.3.0.

Are you running with 2.2.x?

-Marshall

On 6/9/2010 12:47 PM, Eric Riebling wrote:
> It seems to me like a good idea that the default/expected file extensions
> read and written by the XMI CR and CAS Consumer should be the same.
> As things stand, the CAS Consumer writes files like "doc0, doc1, doc2..."
> but the Collection Reader ignores them because they don't end have
> extensions.  ("doc0.xmi, doc1.xmi, doc2.xmi...")
>
> I'm not sure if this is by design, for a reason, or an oversight.  I know
> it's very useful to add a new annotation to a set of already-annotated
> documents
> saved in XMI format, especially
> if the annotations they contain took a lot of time to produce, but it
> requires one to go through the extra step of renaming them.
>