You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Greg Holmberg <ho...@comcast.net> on 2009/05/19 19:02:41 UTC

Re: document structure (was: Discussion of next UIMA release)

Indeed, the structure is important to linguistic analysis.  For example,  
imagine you have a table with three cells, containing the text "1996",  
"Honda", and "Camry".  If the cells are properly treated as sentence or  
paragraph boundaries, then entity extraction would produce a year, a  
company, and a vehicle.  If the structure is striped and just the plain  
text is analyzed, then you get one entity, a vehicle, "1996 Honda Camry".   
Which is not exactly the same thing.

I feel that the lack of any standard in UIMA regarding the structure of  
the document being analyzed (that is, beyond simply plain text) makes it  
pretty much impossible to combine annotators from different sources--one  
of the primary justifications of UIMA, in my opinion.

I sketched a possible solution to this on the wiki  
(http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Document  
model") back in 2007, but it didn't generate much interest.  There's also  
a proposal for document properties, beyond the simple  
SourceDocumentInformation class.

Greg Holmberg

On Tue, 19 May 2009 09:34:14 -0700, Manuel Fiorelli  
<ma...@gmail.com> wrote:
> I would like to see a well-established way to analyze semi-structured
> documents, such as (X)HTML pages. UIMA shouldn't provide its own
> parser, but at least a type system (like uima.cas) to represent a DOM
> Document within a CAS instance (the simplest solution is to represent
> element nodes as feature structures and text nodes as annotations on
> the plain text, but I suspect there are more convenient solutions).
>
> When the analysis function doesn't rely upon the document structure,
> there should be a way to skip most of the markup and iterate on the
> blocks. I think that we cannot work directly on the plain text, since
> the loss of information could lead to misinterpretations. For example,
> in the following fragment
>
> <p>First paragrapher</p><p>Second paragrapher</p>
>
> the plain text would be
>
> First paragrapherSecond paragrapher
>
> where "paragrapherSecond" is an error in the interpretation of the  
> document.
>
> Manuel Fiorelli

Re: document structure (was: Discussion of next UIMA release)

Posted by Greg Holmberg <ho...@comcast.net>.

On Tue, 19 May 2009 11:04:49 -0700, Manuel Fiorelli  
<ma...@gmail.com> wrote:
> I'm happy to see I am not the only who feels  this feature to be
> useful. I saw that in your model, every node is an annotation, which
> is fine to easily implement the property "textContent", which returns
> the text contained in an Element.
>
> Also the support for pdf (and other document formats) would be an
> important addition...
>
> Manuel Fiorelli

For PDF filtering, check out this open-source project:  
http://aperture.sourceforge.net

This handles PDF, HTML, XML, RTF, Office, OpenOffice, Corel, email, ical.  
It also provides crawlers.  It's built on other open-source libraries,  
such as POI and PDFBox, but adds the ability to produce XML with RDF  
elements.  The RDF could be represented in the document model I proposed.

Greg

Re: document structure (was: Discussion of next UIMA release)

Posted by Manuel Fiorelli <ma...@gmail.com>.

2009/5/19 Greg Holmberg <ho...@comcast.net>:
> I sketched a possible solution to this on the wiki
> (http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Document
> model") back in 2007, but it didn't generate much interest.  There's also a
> proposal for document properties, beyond the simple
> SourceDocumentInformation class.

I'm happy to see I am not the only who feels  this feature to be
useful. I saw that in your model, every node is an annotation, which
is fine to easily implement the property "textContent", which returns
the text contained in an Element.

Also the support for pdf (and other document formats) would be an
important addition...

Manuel Fiorelli

Re: document structure (was: Discussion of next UIMA release)

Posted by Kameron Cole <ka...@us.ibm.com>.

How would DITA play into this?  It seems to me that whether the community
adopts it or not, DITA is a de facto standard for document structure.
Further, I clearly see millions of applications for text analytics and
DITA.


** ** ** **
Kameron Arthur Cole
Senior IT Specialist, Managing Consultant
IBM Information Management Lab Services.
kameroncole@us.ibm.com


home office: 305-831-4058 / mobile office: 305.905.4112 / fax: 845.491.4052


ECM Lab Services Mission:
To provide fee-based services and ECM centric solutions around our products
with profitable delivery, high customer satisfaction and rapid ROI
realization.


Information Clearing House for OmniFind (my blog)


Worldwide Discovery (OmniFind) Tech SalesWiki


IBM Enterprise Content Management


                                                                           
             Eddie Epstein                                                 
             <eaepstein@gmail.                                             
             com>                                                       To 
                                       uima-user@incubator.apache.org      
             05/19/2009 06:04                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: document structure (was:        
             Please respond to         Discussion of next UIMA release)    
             uima-user@incubat                                             
               or.apache.org                                               
                                                                           
                                                                           
                                                                           
                                                                           




Hi Greg,

Since your original proposal back in 2007 there has been a growing
effort to add annotators to the project. Do you have any components
that use the proposed document model type system, say a collection
reader, that you would be willing to submit?

Regards,
Eddie

On Tue, May 19, 2009 at 1:02 PM, Greg Holmberg <ho...@comcast.net>
wrote:
> I feel that the lack of any standard in UIMA regarding the structure of
the
> document being analyzed (that is, beyond simply plain text) makes it
pretty
> much impossible to combine annotators from different sources--one of the
> primary justifications of UIMA, in my opinion.
>
> I sketched a possible solution to this on the wiki
> (http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Document
> model") back in 2007, but it didn't generate much interest.  There's also
a
> proposal for document properties, beyond the simple
> SourceDocumentInformation class.
>
>

Re: TikaAnnotator (was: document structure)

Posted by Tong Fin <to...@gmail.com>.

Since we have some users using this project, it maybe a good candidate for
graduation from sandbox.

Opinions ?

-- Tong

On Fri, May 22, 2009 at 3:58 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Julien Nioche wrote:
>
>> Hi,
>>
>> I contributed an annotator to the sandbox some time ago which uses Tika to
>> convert original markup into UIMA annotations. It does not seem to be
>> listed
>> on the website but it should be in the SVN repository of the sandbox.
>>
>> Tika supports numerous formats such as PDF, XML, HTML
>>
> I checked in the code 4 months ago. Please have a look at it to make
> sure everything is as intended.
>
> Here is the svn link:
> http://svn.apache.org/viewvc/incubator/uima/sandbox/trunk/TikaAnnotator/
>
> Jörn
>

Re: TikaAnnotator (was: document structure)

Posted by Jörn Kottmann <ko...@gmail.com>.

Julien Nioche wrote:
> Hi,
>
> I contributed an annotator to the sandbox some time ago which uses Tika to
> convert original markup into UIMA annotations. It does not seem to be listed
> on the website but it should be in the SVN repository of the sandbox.
>
> Tika supports numerous formats such as PDF, XML, HTML
I checked in the code 4 months ago. Please have a look at it to make
sure everything is as intended.

Here is the svn link:
http://svn.apache.org/viewvc/incubator/uima/sandbox/trunk/TikaAnnotator/

Jörn

Re: document structure

Posted by Marshall Schor <ms...@schor.com>.

I updated the UIMA website's sandbox page with this information.

-Marshall

Julien Nioche wrote:
> Hi Marshall,
>
> There is a description in the README.txt file from the TikaAnnotator
> repository, which I have slightly rewritten into the text below.
>
>
> *Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries. The TikaAnnotator uses Tika to generate annotations representing
> the original markup of a document, extract its text and metadata. It
> consists of three resources :
>
> - FileSystemCollectionReader : similar to the one in UIMA examples but uses
> TIKA to extract the text from binary documents and generates annotations to
> represent the markup
>
> - MarkupAnnotator : takes the original content from a view and generates a
> new view containing the extracted text with markup annotations
>
> - TikaWrapper : utility class which allows to populate a CAS from a binary
> document; used by the FileSystemCollectionReader *
>
>
> Best,
>
> J.
>
>

Re: document structure

Posted by Julien Nioche <li...@gmail.com>.

Hi Marshall,

There is a description in the README.txt file from the TikaAnnotator
repository, which I have slightly rewritten into the text below.

*Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. The TikaAnnotator uses Tika to generate annotations representing
the original markup of a document, extract its text and metadata. It
consists of three resources :

- FileSystemCollectionReader : similar to the one in UIMA examples but uses
TIKA to extract the text from binary documents and generates annotations to
represent the markup

- MarkupAnnotator : takes the original content from a view and generates a
new view containing the extracted text with markup annotations

- TikaWrapper : utility class which allows to populate a CAS from a binary
document; used by the FileSystemCollectionReader *

Best,

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/5/22 Marshall Schor <ms...@schor.com>

> Hi Julien,
>
> Can you write up a little something and submit a patch to the website?
>
> -Marshall
>
> Julien Nioche wrote:
> > Hi,
> >
> > I contributed an annotator to the sandbox some time ago which uses Tika
> to
> > convert original markup into UIMA annotations. It does not seem to be
> listed
> > on the website but it should be in the SVN repository of the sandbox.
> >
> > Tika supports numerous formats such as PDF, XML, HTML etc...
> >
> > Julien
> >
> >
>

Re: document structure

Posted by Marshall Schor <ms...@schor.com>.

Hi Julien,

Can you write up a little something and submit a patch to the website?

-Marshall

Julien Nioche wrote:
> Hi,
>
> I contributed an annotator to the sandbox some time ago which uses Tika to
> convert original markup into UIMA annotations. It does not seem to be listed
> on the website but it should be in the SVN repository of the sandbox.
>
> Tika supports numerous formats such as PDF, XML, HTML etc...
>
> Julien
>
>

Re: document structure (was: Discussion of next UIMA release)

Posted by Julien Nioche <li...@gmail.com>.

Hi,

I contributed an annotator to the sandbox some time ago which uses Tika to
convert original markup into UIMA annotations. It does not seem to be listed
on the website but it should be in the SVN repository of the sandbox.

Tika supports numerous formats such as PDF, XML, HTML etc...

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/5/21 Greg Holmberg <ho...@comcast.net>

> On Tue, 19 May 2009 15:04:28 -0700, Eddie Epstein <ea...@gmail.com>
> wrote:
>
>> Since your original proposal back in 2007 there has been a growing
>> effort to add annotators to the project. Do you have any components
>> that use the proposed document model type system, say a collection
>> reader, that you would be willing to submit?
>>
>
> I did use this schema in a prototype.  I used the Stax parser to convert
> XML to this annotation structure over plain text.  Since the proposed schema
> losses no XML information, the XML can be reproduced from the CAS, if
> desired. Not byte-for-byte, since carriage ruturns may come out differently,
> but certainly functionally equivalent XML.
>
> HTML was first cleaned up with HTMLCleaner, converted to XML (XHTML), and
> then sent through the Stax parser and into the CAS.
>
> For other formats, I used a commercial filtering product to convert PDF,
> Office, etc. to HTML, and then through the above process.
>
> An open-source solution to filtering binary formats could use Aperture to
> produce XML+RDF, and then through the above process.
>
> The annotators I used didn't understand the CAS, only HTML, so I had to
> keep that in addition to the CAS to feed to those annotators.  The offsets
> these annotators returned were then relative to the HTML, so I kept a map of
> offset ranges between the HTML and the plain-text in the CAS.  This let me
> translate the offsets returned from the annotators against the HTML into
> offsets against the CAS, so when I created annotations they pointed to the
> right place.
>
> So I can't contribute the commercial filter code (we don't have source code
> anyway).  I may be able to contribute the XML and HTML converters, since
> that code was never shipped as a product. However, it will require approval
> from some EVP three levels above me.  I will look into it, but don't hold
> your breath.
>
>
>
> Greg Holmberg
>

Re: document structure (was: Discussion of next UIMA release)

Posted by Greg Holmberg <ho...@comcast.net>.

On Tue, 19 May 2009 15:04:28 -0700, Eddie Epstein <ea...@gmail.com>  
wrote:
> Since your original proposal back in 2007 there has been a growing
> effort to add annotators to the project. Do you have any components
> that use the proposed document model type system, say a collection
> reader, that you would be willing to submit?

I did use this schema in a prototype.  I used the Stax parser to convert  
XML to this annotation structure over plain text.  Since the proposed  
schema losses no XML information, the XML can be reproduced from the CAS,  
if desired. Not byte-for-byte, since carriage ruturns may come out  
differently, but certainly functionally equivalent XML.

HTML was first cleaned up with HTMLCleaner, converted to XML (XHTML), and  
then sent through the Stax parser and into the CAS.

For other formats, I used a commercial filtering product to convert PDF,  
Office, etc. to HTML, and then through the above process.

An open-source solution to filtering binary formats could use Aperture to  
produce XML+RDF, and then through the above process.

The annotators I used didn't understand the CAS, only HTML, so I had to  
keep that in addition to the CAS to feed to those annotators.  The offsets  
these annotators returned were then relative to the HTML, so I kept a map  
of offset ranges between the HTML and the plain-text in the CAS.  This let  
me translate the offsets returned from the annotators against the HTML  
into offsets against the CAS, so when I created annotations they pointed  
to the right place.

So I can't contribute the commercial filter code (we don't have source  
code anyway).  I may be able to contribute the XML and HTML converters,  
since that code was never shipped as a product. However, it will require  
approval from some EVP three levels above me.  I will look into it, but  
don't hold your breath.

Greg Holmberg

Re: document structure (was: Discussion of next UIMA release)

Posted by Eddie Epstein <ea...@gmail.com>.

Hi Greg,

Since your original proposal back in 2007 there has been a growing
effort to add annotators to the project. Do you have any components
that use the proposed document model type system, say a collection
reader, that you would be willing to submit?

Regards,
Eddie

On Tue, May 19, 2009 at 1:02 PM, Greg Holmberg <ho...@comcast.net> wrote:
> I feel that the lack of any standard in UIMA regarding the structure of the
> document being analyzed (that is, beyond simply plain text) makes it pretty
> much impossible to combine annotators from different sources--one of the
> primary justifications of UIMA, in my opinion.
>
> I sketched a possible solution to this on the wiki
> (http://cwiki.apache.org/UIMA/uima-sandbox-components.html, see "Document
> model") back in 2007, but it didn't generate much interest.  There's also a
> proposal for document properties, beyond the simple
> SourceDocumentInformation class.
>
>