You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Thilo Goetz <tw...@gmx.de> on 2009/07/24 10:22:16 UTC

Document collections [was: Re: Building the eclipse update site]

Jörn Kottmann wrote:
> Jörn Kottmann wrote:
>>
>>> A collection of text documents that you can run
>>> analysis on.  If I understand correctly, the Cas
>>> Editor currently requires XCAS/XmiCAS files.  It
>>> would be nice if users could just add their text
>>> files and then either create annotations manually
>>> with the Cas Editor, or automatically by running
>>> some analysis and then view the results using the
>>> Cas Editor.  Then we could add results comparison
>>> etc.  See
>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>
>>> for a (outdated) description of what we have
>>> in-house.  It's geared more towards a business user
>>> than a developer, but the ideas of document collections
>>> and the development cycle are equally applicable.
>>> If there was enough interest here, I think that
>>> would be a good direction to go in.
>>>   
>> Yes for me it sounds like the right way.
>> We could also use it for debugging an AE, then
>> a user defines a debug configuration and adds
>> the collection as document source.
> How would you define the format of a document collection ?
> 
> To open a CAS document the document itself and a type system
> for the document is needed.
> 
> In the Cas Editor right now an Input Collection is a Corpus folder which
> contains xmi/xcas files
> in one directory together with the project type system the files can be
> loaded by UIMA. Though
> it has be criticized for not allowing sub directories for structuring
> its documents.
> 
> Jörn

That's perfectly fine, we do this in a similar way.
What would be good though is to distinguish between
text documents and "CAS documents" (be they XCAS, XMI
or some other format).  So you could start your work
by importing some text documents, then annotate them
in various ways (manually, or with coded annotators).
The CASes would reside in a different folder, and you
could derive any number of CAS collections from the
same set of source text documents.  We find that way
of working very convenient.

--Thilo

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Marshall Schor <ms...@schor.com>.

Jörn Kottmann wrote:
> Jörn Kottmann wrote:
>> The proposed changes would move the project model out
>> of the Cas Editor into a new project with the advantage that
>> it can be used by other tooling too. Like an analysis engine launcher
>> which needs a document collection to run the AE or a PEAR
>> runner.
>
> Should we split the Cas Editor for the 2.3.0 release or should we
> wait for the release after 2.3.0 ?
>
> In my opinion we should wait and rebuild the project model
> based on the Cas Editor project model. This would have the advantage
> that we can consider new requirements we may get
> from other tooling like an AE debug launcher or  multi view support
> for the Cas Editor.
>
> It also has the disadvantage that the Cas Editor users have
> to convert their projects one day.
>

This could wait for the next release, IMHO, because I think it might
take more that the time left :-) given all the other things on our plate
to get done...  And, I agree that the understanding of use case /
requirements might improve with a bit more time and discussion.

-Marshall

> Jörn
>
>
>

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Thilo Goetz <tw...@gmx.de>.
Jörn Kottmann wrote:
> Jörn Kottmann wrote:
>> The proposed changes would move the project model out
>> of the Cas Editor into a new project with the advantage that
>> it can be used by other tooling too. Like an analysis engine launcher
>> which needs a document collection to run the AE or a PEAR
>> runner.
> 
> Should we split the Cas Editor for the 2.3.0 release or should we
> wait for the release after 2.3.0 ?

Waiting is fine with me, I'm in no rush atm.

--Thilo

> 
> In my opinion we should wait and rebuild the project model
> based on the Cas Editor project model. This would have the advantage
> that we can consider new requirements we may get
> from other tooling like an AE debug launcher or  multi view support
> for the Cas Editor.
> 
> It also has the disadvantage that the Cas Editor users have
> to convert their projects one day.
> 
> Jörn

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Jörn Kottmann <ko...@gmail.com>.
Jörn Kottmann wrote:
> The proposed changes would move the project model out
> of the Cas Editor into a new project with the advantage that
> it can be used by other tooling too. Like an analysis engine launcher
> which needs a document collection to run the AE or a PEAR
> runner.

Should we split the Cas Editor for the 2.3.0 release or should we
wait for the release after 2.3.0 ?

In my opinion we should wait and rebuild the project model
based on the Cas Editor project model. This would have the advantage
that we can consider new requirements we may get
from other tooling like an AE debug launcher or  multi view support
for the Cas Editor.

It also has the disadvantage that the Cas Editor users have
to convert their projects one day.

Jörn


Re: Document collections [was: Re: Building the eclipse update site]

Posted by Jörn Kottmann <ko...@gmail.com>.
Thilo Goetz wrote:
> Jörn Kottmann wrote:
>   
>> Thilo Goetz wrote:
>>     
>>> Jörn Kottmann wrote:
>>>  
>>>       
>>>> Thilo Goetz wrote:
>>>>    
>>>>         
>>>>> Jörn Kottmann wrote:
>>>>>  
>>>>>      
>>>>>           
>>>>>> Jörn Kottmann wrote:
>>>>>>           
>>>>>>             
>>>>>>>> A collection of text documents that you can run
>>>>>>>> analysis on.  If I understand correctly, the Cas
>>>>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>>>>> would be nice if users could just add their text
>>>>>>>> files and then either create annotations manually
>>>>>>>> with the Cas Editor, or automatically by running
>>>>>>>> some analysis and then view the results using the
>>>>>>>> Cas Editor.  Then we could add results comparison
>>>>>>>> etc.  See
>>>>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> for a (outdated) description of what we have
>>>>>>>> in-house.  It's geared more towards a business user
>>>>>>>> than a developer, but the ideas of document collections
>>>>>>>> and the development cycle are equally applicable.
>>>>>>>> If there was enough interest here, I think that
>>>>>>>> would be a good direction to go in.
>>>>>>>>                       
>>>>>>>>                 
>>>>>>> Yes for me it sounds like the right way.
>>>>>>> We could also use it for debugging an AE, then
>>>>>>> a user defines a debug configuration and adds
>>>>>>> the collection as document source.
>>>>>>>                 
>>>>>>>               
>>>>>> How would you define the format of a document collection ?
>>>>>>
>>>>>> To open a CAS document the document itself and a type system
>>>>>> for the document is needed.
>>>>>>
>>>>>> In the Cas Editor right now an Input Collection is a Corpus folder
>>>>>> which
>>>>>> contains xmi/xcas files
>>>>>> in one directory together with the project type system the files
>>>>>> can be
>>>>>> loaded by UIMA. Though
>>>>>> it has be criticized for not allowing sub directories for structuring
>>>>>> its documents.
>>>>>>
>>>>>> Jörn
>>>>>>             
>>>>>>             
>>>>> That's perfectly fine, we do this in a similar way.
>>>>> What would be good though is to distinguish between
>>>>> text documents and "CAS documents" (be they XCAS, XMI
>>>>> or some other format).  So you could start your work
>>>>> by importing some text documents, then annotate them
>>>>> in various ways (manually, or with coded annotators).
>>>>> The CASes would reside in a different folder, and you
>>>>> could derive any number of CAS collections from the
>>>>> same set of source text documents.  We find that way
>>>>> of working very convenient.
>>>>>       
>>>>>           
>>>> We could reuse the code which is in the Cas Editor right
>>>> now and move it into a new plugin which provides the document
>>>> collections and type system to other plugins.
>>>>
>>>> The Cas Editor should be independent of the project model because
>>>> people who use the Cas Editor do not necessarily want to it.
>>>>     
>>>>         
>>> +1, couldn't agree more.  In fact, I would like to integrate
>>> the CAS editor into our tooling, that would be a good test
>>> case how independent it is.  I don't know when I'll get around
>>> to playing with that, but it's definitely on my to do list.
>>>   
>>>       
>> Ok, then lets split the Cas Editor into the editing part and project
>> model. For the project model part we have to create a new eclipse
>> project, e.g. uimaj-ep-base. The remaining Cas Editor should be independent
>> of the project model which means uimaj-ep-base depends on the Cas Editor
>> (to add the
>> document provider extension to it).
>>
>> After we are done with that, we can look into uimaj-ep-base and see how
>> it fits our needs
>> and how it can be used by the other eclipse based tooling.
>>
>> Jörn
>>     
>
> +1, that would be great.  I would like to be able to use
> the CAS Editor sometimes not to edit CASes, but to simply
> display them.  For example, you could imagine some Eclipse
> tooling that runs UIMA analysis and displays the results
> in the CAS Editor, without first materializing the CAS on
> disk.  So with your proposed changes, I should be able to
> do that, right?
>   
That is already possible. You must implement your own
CAS provider which knows how to get the CAS object
and config settings from your tooling.

The proposed changes would move the project model out
of the Cas Editor into a new project with the advantage that
it can be used by other tooling too. Like an analysis engine launcher
which needs a document collection to run the AE or a PEAR
runner.

Jörn

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Thilo Goetz <tw...@gmx.de>.
Jörn Kottmann wrote:
> Thilo Goetz wrote:
>> Jörn Kottmann wrote:
>>  
>>> Thilo Goetz wrote:
>>>    
>>>> Jörn Kottmann wrote:
>>>>  
>>>>      
>>>>> Jörn Kottmann wrote:
>>>>>           
>>>>>>> A collection of text documents that you can run
>>>>>>> analysis on.  If I understand correctly, the Cas
>>>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>>>> would be nice if users could just add their text
>>>>>>> files and then either create annotations manually
>>>>>>> with the Cas Editor, or automatically by running
>>>>>>> some analysis and then view the results using the
>>>>>>> Cas Editor.  Then we could add results comparison
>>>>>>> etc.  See
>>>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for a (outdated) description of what we have
>>>>>>> in-house.  It's geared more towards a business user
>>>>>>> than a developer, but the ideas of document collections
>>>>>>> and the development cycle are equally applicable.
>>>>>>> If there was enough interest here, I think that
>>>>>>> would be a good direction to go in.
>>>>>>>                       
>>>>>> Yes for me it sounds like the right way.
>>>>>> We could also use it for debugging an AE, then
>>>>>> a user defines a debug configuration and adds
>>>>>> the collection as document source.
>>>>>>                 
>>>>> How would you define the format of a document collection ?
>>>>>
>>>>> To open a CAS document the document itself and a type system
>>>>> for the document is needed.
>>>>>
>>>>> In the Cas Editor right now an Input Collection is a Corpus folder
>>>>> which
>>>>> contains xmi/xcas files
>>>>> in one directory together with the project type system the files
>>>>> can be
>>>>> loaded by UIMA. Though
>>>>> it has be criticized for not allowing sub directories for structuring
>>>>> its documents.
>>>>>
>>>>> Jörn
>>>>>             
>>>> That's perfectly fine, we do this in a similar way.
>>>> What would be good though is to distinguish between
>>>> text documents and "CAS documents" (be they XCAS, XMI
>>>> or some other format).  So you could start your work
>>>> by importing some text documents, then annotate them
>>>> in various ways (manually, or with coded annotators).
>>>> The CASes would reside in a different folder, and you
>>>> could derive any number of CAS collections from the
>>>> same set of source text documents.  We find that way
>>>> of working very convenient.
>>>>       
>>> We could reuse the code which is in the Cas Editor right
>>> now and move it into a new plugin which provides the document
>>> collections and type system to other plugins.
>>>
>>> The Cas Editor should be independent of the project model because
>>> people who use the Cas Editor do not necessarily want to it.
>>>     
>>
>> +1, couldn't agree more.  In fact, I would like to integrate
>> the CAS editor into our tooling, that would be a good test
>> case how independent it is.  I don't know when I'll get around
>> to playing with that, but it's definitely on my to do list.
>>   
> Ok, then lets split the Cas Editor into the editing part and project
> model. For the project model part we have to create a new eclipse
> project, e.g. uimaj-ep-base. The remaining Cas Editor should be independent
> of the project model which means uimaj-ep-base depends on the Cas Editor
> (to add the
> document provider extension to it).
> 
> After we are done with that, we can look into uimaj-ep-base and see how
> it fits our needs
> and how it can be used by the other eclipse based tooling.
> 
> Jörn

+1, that would be great.  I would like to be able to use
the CAS Editor sometimes not to edit CASes, but to simply
display them.  For example, you could imagine some Eclipse
tooling that runs UIMA analysis and displays the results
in the CAS Editor, without first materializing the CAS on
disk.  So with your proposed changes, I should be able to
do that, right?

--Thilo

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Jörn Kottmann <ko...@gmail.com>.
Thilo Goetz wrote:
> Jörn Kottmann wrote:
>   
>> Thilo Goetz wrote:
>>     
>>> Jörn Kottmann wrote:
>>>  
>>>       
>>>> Jörn Kottmann wrote:
>>>>    
>>>>         
>>>>>> A collection of text documents that you can run
>>>>>> analysis on.  If I understand correctly, the Cas
>>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>>> would be nice if users could just add their text
>>>>>> files and then either create annotations manually
>>>>>> with the Cas Editor, or automatically by running
>>>>>> some analysis and then view the results using the
>>>>>> Cas Editor.  Then we could add results comparison
>>>>>> etc.  See
>>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>>
>>>>>>
>>>>>> for a (outdated) description of what we have
>>>>>> in-house.  It's geared more towards a business user
>>>>>> than a developer, but the ideas of document collections
>>>>>> and the development cycle are equally applicable.
>>>>>> If there was enough interest here, I think that
>>>>>> would be a good direction to go in.
>>>>>>           
>>>>>>             
>>>>> Yes for me it sounds like the right way.
>>>>> We could also use it for debugging an AE, then
>>>>> a user defines a debug configuration and adds
>>>>> the collection as document source.
>>>>>       
>>>>>           
>>>> How would you define the format of a document collection ?
>>>>
>>>> To open a CAS document the document itself and a type system
>>>> for the document is needed.
>>>>
>>>> In the Cas Editor right now an Input Collection is a Corpus folder which
>>>> contains xmi/xcas files
>>>> in one directory together with the project type system the files can be
>>>> loaded by UIMA. Though
>>>> it has be criticized for not allowing sub directories for structuring
>>>> its documents.
>>>>
>>>> Jörn
>>>>     
>>>>         
>>> That's perfectly fine, we do this in a similar way.
>>> What would be good though is to distinguish between
>>> text documents and "CAS documents" (be they XCAS, XMI
>>> or some other format).  So you could start your work
>>> by importing some text documents, then annotate them
>>> in various ways (manually, or with coded annotators).
>>> The CASes would reside in a different folder, and you
>>> could derive any number of CAS collections from the
>>> same set of source text documents.  We find that way
>>> of working very convenient.
>>>       
>> We could reuse the code which is in the Cas Editor right
>> now and move it into a new plugin which provides the document
>> collections and type system to other plugins.
>>
>> The Cas Editor should be independent of the project model because
>> people who use the Cas Editor do not necessarily want to it.
>>     
>
> +1, couldn't agree more.  In fact, I would like to integrate
> the CAS editor into our tooling, that would be a good test
> case how independent it is.  I don't know when I'll get around
> to playing with that, but it's definitely on my to do list.
>   
Ok, then lets split the Cas Editor into the editing part and project
model. For the project model part we have to create a new eclipse
project, e.g. uimaj-ep-base. The remaining Cas Editor should be independent
of the project model which means uimaj-ep-base depends on the Cas Editor 
(to add the
document provider extension to it).

After we are done with that, we can look into uimaj-ep-base and see how 
it fits our needs
and how it can be used by the other eclipse based tooling.

Jörn

Re: Document collections [was: Re: Building the eclipse update site]

Posted by Thilo Goetz <tw...@gmx.de>.
Jörn Kottmann wrote:
> Thilo Goetz wrote:
>> Jörn Kottmann wrote:
>>  
>>> Jörn Kottmann wrote:
>>>    
>>>>> A collection of text documents that you can run
>>>>> analysis on.  If I understand correctly, the Cas
>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>> would be nice if users could just add their text
>>>>> files and then either create annotations manually
>>>>> with the Cas Editor, or automatically by running
>>>>> some analysis and then view the results using the
>>>>> Cas Editor.  Then we could add results comparison
>>>>> etc.  See
>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>
>>>>>
>>>>> for a (outdated) description of what we have
>>>>> in-house.  It's geared more towards a business user
>>>>> than a developer, but the ideas of document collections
>>>>> and the development cycle are equally applicable.
>>>>> If there was enough interest here, I think that
>>>>> would be a good direction to go in.
>>>>>           
>>>> Yes for me it sounds like the right way.
>>>> We could also use it for debugging an AE, then
>>>> a user defines a debug configuration and adds
>>>> the collection as document source.
>>>>       
>>> How would you define the format of a document collection ?
>>>
>>> To open a CAS document the document itself and a type system
>>> for the document is needed.
>>>
>>> In the Cas Editor right now an Input Collection is a Corpus folder which
>>> contains xmi/xcas files
>>> in one directory together with the project type system the files can be
>>> loaded by UIMA. Though
>>> it has be criticized for not allowing sub directories for structuring
>>> its documents.
>>>
>>> Jörn
>>>     
>>
>> That's perfectly fine, we do this in a similar way.
>> What would be good though is to distinguish between
>> text documents and "CAS documents" (be they XCAS, XMI
>> or some other format).  So you could start your work
>> by importing some text documents, then annotate them
>> in various ways (manually, or with coded annotators).
>> The CASes would reside in a different folder, and you
>> could derive any number of CAS collections from the
>> same set of source text documents.  We find that way
>> of working very convenient.
> 
> We could reuse the code which is in the Cas Editor right
> now and move it into a new plugin which provides the document
> collections and type system to other plugins.
> 
> The Cas Editor should be independent of the project model because
> people who use the Cas Editor do not necessarily want to it.

+1, couldn't agree more.  In fact, I would like to integrate
the CAS editor into our tooling, that would be a good test
case how independent it is.  I don't know when I'll get around
to playing with that, but it's definitely on my to do list.

--Thilo

> 
> Jörn


Re: Document collections [was: Re: Building the eclipse update site]

Posted by Jörn Kottmann <ko...@gmail.com>.
Thilo Goetz wrote:
> Jörn Kottmann wrote:
>   
>> Jörn Kottmann wrote:
>>     
>>>> A collection of text documents that you can run
>>>> analysis on.  If I understand correctly, the Cas
>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>> would be nice if users could just add their text
>>>> files and then either create annotations manually
>>>> with the Cas Editor, or automatically by running
>>>> some analysis and then view the results using the
>>>> Cas Editor.  Then we could add results comparison
>>>> etc.  See
>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>
>>>> for a (outdated) description of what we have
>>>> in-house.  It's geared more towards a business user
>>>> than a developer, but the ideas of document collections
>>>> and the development cycle are equally applicable.
>>>> If there was enough interest here, I think that
>>>> would be a good direction to go in.
>>>>   
>>>>         
>>> Yes for me it sounds like the right way.
>>> We could also use it for debugging an AE, then
>>> a user defines a debug configuration and adds
>>> the collection as document source.
>>>       
>> How would you define the format of a document collection ?
>>
>> To open a CAS document the document itself and a type system
>> for the document is needed.
>>
>> In the Cas Editor right now an Input Collection is a Corpus folder which
>> contains xmi/xcas files
>> in one directory together with the project type system the files can be
>> loaded by UIMA. Though
>> it has be criticized for not allowing sub directories for structuring
>> its documents.
>>
>> Jörn
>>     
>
> That's perfectly fine, we do this in a similar way.
> What would be good though is to distinguish between
> text documents and "CAS documents" (be they XCAS, XMI
> or some other format).  So you could start your work
> by importing some text documents, then annotate them
> in various ways (manually, or with coded annotators).
> The CASes would reside in a different folder, and you
> could derive any number of CAS collections from the
> same set of source text documents.  We find that way
> of working very convenient.

We could reuse the code which is in the Cas Editor right
now and move it into a new plugin which provides the document
collections and type system to other plugins.

The Cas Editor should be independent of the project model because
people who use the Cas Editor do not necessarily want to it.

Jörn