You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Ben Morgan <be...@gmail.com> on 2010/12/07 23:18:44 UTC

SimpleServer configuration with Sofas

Hey folks,

I've got a problem with the UIMA SimpleServer[1][5] not being able to
correctly run an aggregate analysis engine[6]. The aggregate AE works as
expected however when I test it with the "UIMA CAS Visual Debugger", the
"UIMA Run AE" and the "UIMA Document Analyzer".

The analysis engine[2] is relatively simple (as of yet). It is composed
of the following components:

AE PDF Text Extractor[3]
:: gets a URL as the "initial view" and downloads
the file, extracts the text and puts it in a new
view by the name of "extractedText".
-> Input Sofa: urlString
-> Output Sofa: extractedText
AE Email Annotator[4]
:: simple annotator, just annotates email addresses.

When I run the aggregate analysis engine, it terminates before giving
any results with an error (taken from the Tomcat log file):

SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException:
Annotator processing failed.
...
Caused by: org.apache.uima.cas.CASRuntimeException:
No sofaFS with name plainText found.
...

"plainText" is the Sofa in the aggregate analysis engine which is linked
to the output of the PDF Text Extractor "extractedText".

I took the aggregate analysis engine apart piece by piece, and I started
with the Email Annotator AE. That worked fine with the SimpleServer.

Then I tested the PDF Text Extractor (I changed the input view to
_InitialView). When I tested a URL, it came through as XML, but only
with the intial view and not with the extracted text. In fact, when
testing the text extractor otherwise, it would take around 3 seconds to
download the pdf file, while the SimpleServer sent back its results
immediately (so what is that all about? Does it not even run the code in
the function process()?).

That's my problem, and I wonder if there is something special you need
to do, when there are views or different output sofas. I can not for the
life of me figure out, what is wrong and why it does not work.

Thanks for your help,
Ben Morgan

_______________________________________________________________________________

1:
http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html

2: Aggregate AE Descriptor:
https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/referenceAnnotatorDescriptor.xml

3: PDF Extractor descriptor:
https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/PDFTextExtractorDescriptor.xml
PDF Extractor java source:
https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/src/de/uniwue/informatik/bibrefext/pdf/TextExtractor.java

4: Email Annotator descriptor:
https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/EmailAnnotatorDescriptor.xml

5: SimpleServer web.xml:
https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceWebService/WebContent/WEB-INF/web.xml

6: Complete WAR file:
https://github.com/downloads/cassava/bibrefext/bibrefext.war

Re: SimpleServer configuration with Sofas

Posted by Ben Morgan <be...@gmail.com>.

> 
> When running the aggregate from CVD, is it running from the pear file
> directly as it is from simpleserver?
> 

When running the aggregate from the CVD, I just loaded the descriptor file, but
I also managed to do that with the SimpleServer, and I got the same error, so I
figured that that wasn't the problem. (See link 5 for example).

In short: no, but I tried running it without the Pear on the SimpleServer.

Ben

Re: SimpleServer configuration with Sofas

Posted by Eddie Epstein <ea...@gmail.com>.

When running the aggregate from CVD, is it running from the pear file
directly as it is from simpleserver?

Eddie

Re: SimpleServer configuration with Sofas

Posted by Ben Morgan <be...@gmail.com>.

Eddie Epstein <ea...@...> writes:
> With sofamapping, the toplevel <aggregateSofaName> is the actual name
> of View in the CAS. Here one should find the views named _InitialView
> and extractedText

Didn't seem to make a difference though...

> Dumping the CAS to stdout is basically a one liner:
> 
>     try {
>       XmiCasSerializer.serialize(aCAS, System.out);
>     } catch (SAXException e1) {
>       e1.printStackTrace();
>     }

Ok, in the process of doing this, I found out that I made a mistake in another
part of the code, and the statements were not being logged by Tomcat (I suppose
that is obvious, if the logger outputs to stdout or stderr). So in my process()
function, not all the code was being executed, which caused the problem with the
Sofa "plainText" not being found.

I'm still having a problem with only the default View/Sofa being shown after the
processing is done, but there is an error in the Tomcat log files:

    Caused by: org.apache.uima.cas.CASRuntimeException: The JCAS cover class
"de.uniwue.informatik.bibrefext.email.Email_Type" could not be loaded.

But I need to look into that myself a little bit further.

My main discovery is that I may have been looking for a long while in the wrong
spot; as a dog barking up the wrong tree. Sorry 'bout that.

-Ben

Re: SimpleServer configuration with Sofas

Posted by Eddie Epstein <ea...@gmail.com>.

On Thu, Dec 9, 2010 at 5:12 AM, Ben Morgan <be...@gmail.com> wrote:
> Eddie wrote:
>> As I understand it, your aggregate has two delegates: the first is
>> sofa aware, reads the input data from the _InitialView and creates a
>> new view called plainText; the second delegate is sofa unaware and
>> expects to get the plainText view passed to it.
>
> The aggregate is not quite like that:
>    _InitialView
>        -> urlString (input sofa of pdfextractor)
>    plainText
>        <- "extractedText" output sofa/view from pdfextractor
>        -> goes to "_InitialView" input of email annotator

With sofamapping, the toplevel <aggregateSofaName> is the actual name
of View in the CAS. Here one should find the views named _InitialView
and extractedText


>> This all works fine when run from CVD. When run from SimpleServer,
>> processing crashes trying to access plainText. Presumably this is
>> happening when the second delegate is being called. This could happen
>> if for some reason sofamapping is broken when running under
>> SimpleServer.
>>
>> It would be useful to:
>> 1. know that the 1st delegate is running ok, and the problem is on the 2nd
>
> When I tested the 1st delegate alone with SimpleServer, it did not output
> what I expected. In the SimpleServer UI, you can choose what kind of output
> you want, and I chose "Inline XML". I got as a Result then:
>
>    <Result> http://input-text/url-to-the.pdf </Result>
>
> But this result came back faster than the component could have processed the
> input (it should have taken 3 seconds at least, instead of a few milliseconds).

I'm assuming that SimpleServer is only looking in _InitialView to
extract results.


> So the 1st delegate independantly did not work, while the 2nd delegate did.
>
>> 2. dumping the Cas contents after calling the first delegate to see
>> exactly what is in there. This can be done by putting XmiSerialization
>> code at the end of the 1st delegate.
>
> I assume that I would do this in my own code. (As in, I wouldn't touch the
> code of the SimpleServer). Alright, I'll try to do that and see what comes out.

Dumping the CAS to stdout is basically a one liner:

    try {
      XmiCasSerializer.serialize(aCAS, System.out);
    } catch (SAXException e1) {
      e1.printStackTrace();
    }

Re: SimpleServer configuration with Sofas

Posted by Ben Morgan <be...@gmail.com>.

Eddie wrote:
> As I understand it, your aggregate has two delegates: the first is
> sofa aware, reads the input data from the _InitialView and creates a
> new view called plainText; the second delegate is sofa unaware and
> expects to get the plainText view passed to it.

The aggregate is not quite like that:
    _InitialView
        -> urlString (input sofa of pdfextractor)
    plainText
        <- "extractedText" output sofa/view from pdfextractor
        -> goes to "_InitialView" input of email annotator

> This all works fine when run from CVD. When run from SimpleServer,
> processing crashes trying to access plainText. Presumably this is
> happening when the second delegate is being called. This could happen
> if for some reason sofamapping is broken when running under
> SimpleServer.
> 
> It would be useful to:
> 1. know that the 1st delegate is running ok, and the problem is on the 2nd

When I tested the 1st delegate alone with SimpleServer, it did not output
what I expected. In the SimpleServer UI, you can choose what kind of output
you want, and I chose "Inline XML". I got as a Result then:

    <Result> http://input-text/url-to-the.pdf </Result>

But this result came back faster than the component could have processed the
input (it should have taken 3 seconds at least, instead of a few milliseconds).
So the 1st delegate independantly did not work, while the 2nd delegate did.

> 2. dumping the Cas contents after calling the first delegate to see
> exactly what is in there. This can be done by putting XmiSerialization
> code at the end of the 1st delegate.

I assume that I would do this in my own code. (As in, I wouldn't touch the
code of the SimpleServer). Alright, I'll try to do that and see what comes out.

Thanks alot for your help!

- Ben

Re: SimpleServer configuration with Sofas

Posted by Eddie Epstein <ea...@gmail.com>.

>> To get the PDF extractor working, can't you
>> change the default view somehow, so that
>> CAS.getDocumentText() will retrieve the extracted
>> text?  I thought that was possible to make
>> non-Sofa aware annotators work with Sofa-aware
>> ones.  However, not sure.
>
> How would that work? Would I change the default view in the PDF Extractor java
> source code? (Can you even do that?) But wouldn't that be bad coding practice?

As I understand it, your aggregate has two delegates: the first is
sofa aware, reads the input data from the _InitialView and creates a
new view called plainText; the second delegate is sofa unaware and
expects to get the plainText view passed to it.

This all works fine when run from CVD. When run from SimpleServer,
processing crashes trying to access plainText. Presumably this is
happening when the second delegate is being called. This could happen
if for some reason sofamapping is broken when running under
SimpleServer.

It would be useful to:
1. know that the 1st delegate is running ok, and the problem is on the 2nd
2. dumping the Cas contents after calling the first delegate to see
exactly what is in there. This can be done by putting XmiSerialization
code at the end of the 1st delegate.

Eddie

Re: SimpleServer configuration with Sofas

Posted by Ben Morgan <be...@gmail.com>.

Thilo Götz <tw...@...> writes:
> the SimpleServer is not Sofa-aware, and neither
> am I .  I don't think there should be an
> exception, though.  Can you please post the full
> stack trace, maybe that will help.

I didn't want to paste the whole error log here, so you can see it at GitHub:
http://tinyurl.com/catalina-error-log-txt

(-> https://github.com/cassava/bibrefext/blob/uima/
UIMA/scratch/catalina_error_log.txt)

> To get the PDF extractor working, can't you
> change the default view somehow, so that
> CAS.getDocumentText() will retrieve the extracted
> text?  I thought that was possible to make
> non-Sofa aware annotators work with Sofa-aware
> ones.  However, not sure.

How would that work? Would I change the default view in the PDF Extractor java
source code? (Can you even do that?) But wouldn't that be bad coding practice?

- Ben

Re: SimpleServer configuration with Sofas

Posted by Thilo Götz <tw...@gmx.de>.

Hi Ben,

the SimpleServer is not Sofa-aware, and neither
am I ;-).  I don't think there should be an
exception, though.  Can you please post the full
stack trace, maybe that will help.

To get the PDF extractor working, can't you
change the default view somehow, so that
CAS.getDocumentText() will retrieve the extracted
text?  I thought that was possible to make
non-Sofa aware annotators work with Sofa-aware
ones.  However, not sure.

--Thilo

On 12/7/2010 23:18, Ben Morgan wrote:
> Hey folks,
> 
> I've got a problem with the UIMA SimpleServer[1][5] not being able to correctly
> run an aggregate analysis engine[6]. The aggregate AE works as expected however
> when I test it with the "UIMA CAS Visual Debugger", the "UIMA Run AE" and the
> "UIMA Document Analyzer".
> 
> The analysis engine[2] is relatively simple (as of yet). It is composed of the
> following components:
> 
>     AE PDF Text Extractor[3]
>         :: gets a URL as the "initial view" and downloads
>            the file, extracts the text and puts it in a new
>            view by the name of "extractedText".
>         -> Input Sofa: urlString
>         -> Output Sofa: extractedText
>     AE Email Annotator[4]
>         :: simple annotator, just annotates email addresses.
> 
> When I run the aggregate analysis engine, it terminates before giving any
> results with an error (taken from the Tomcat log file):
> 
>     SEVERE: Exception occurred
>     org.apache.uima.analysis_engine.AnalysisEngineProcessException:
>         Annotator processing failed.
>     ...
>     Caused by: org.apache.uima.cas.CASRuntimeException:
>         No sofaFS with name plainText found.
>     ...
> 
> "plainText" is the Sofa in the aggregate analysis engine which is linked to the
> output of the PDF Text Extractor "extractedText".
> 
> I took the aggregate analysis engine apart piece by piece, and I started with
> the Email Annotator AE. That worked fine with the SimpleServer.
> 
> Then I tested the PDF Text Extractor (I changed the input view to _InitialView).
> When I tested a URL, it came through as XML, but only with the intial view and
> not with the extracted text. In fact, when testing the text extractor otherwise,
> it would take around 3 seconds to download the pdf file, while the SimpleServer
> sent back its results immediately (so what is that all about? Does it not even
> run the code in the function process()?).
> 
> That's my problem, and I wonder if there is something special you need to do,
> when there are views or different output sofas. I can not for the life of me
> figure out, what is wrong and why it does not work.
> 
> Thanks for your help,
> Ben Morgan
> 
> _______________________________________________________________________________
> 
> 1:
> http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html
> 
> 
> 2: Aggregate AE Descriptor:
> https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/referenceAnnotatorDescriptor.xml
> 
> 
> 3: PDF Extractor descriptor:
> https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/PDFTextExtractorDescriptor.xml
> 
>    PDF Extractor java source:
> https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/src/de/uniwue/informatik/bibrefext/pdf/TextExtractor.java
> 
> 
> 4: Email Annotator descriptor:
> https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceAnnotator/desc/EmailAnnotatorDescriptor.xml
> 
> 
> 5: SimpleServer web.xml:
> https://github.com/cassava/bibrefext/blob/uima/UIMA/workspace/ReferenceWebService/WebContent/WEB-INF/web.xml
> 
> 
> 6: Complete WAR file: https://github.com/downloads/cassava/bibrefext/bibrefext.war