You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Bob Sizemore <si...@us.ibm.com> on 2011/03/16 21:33:32 UTC

Not seeing the document names in the Document Analyzer

When I am running through the samples in the tutorial and I run the document
analyzer on the first sample.  There are 8 txt files in the data directory that
are processed.  What I see in the  in the document analyzer is the following:
doc0.xmi
doc1.xmi
doc2.xmi
doc3.xmi
doc4.xmi
doc5.xmi
doc6.xmi

When I double click the data appears correct but the label in the doc analyzer
is not correct and should be the names of the files:

IBM_LifeSciences.txt
New_IBM_Fellows.txt
SeminarChallengesInSpeechRecognition.txt
.
.
.

Is there some setting I am missing or is there a way debug this?

Re: Not seeing the document names in the Document Analyzer

Posted by Marshall Schor <ms...@schor.com>.

Now posted as a Jira: https://issues.apache.org/jira/browse/UIMA-2097

-Marshall

On 3/22/2011 1:54 PM, Marshall Schor wrote:
> OK - found the problem.
>
> The Document analyzer uses a component "FileSystemCollectionReader" to read the
> files. This component inserts into the CAS the name of the file being read,
> using the code:
>
>       // Also store location of source document in CAS. This information is critical
>       // if CAS Consumers will need to know where the original document contents
> are located.
>       // For example, the Semantic Search CAS Indexer writes this information
> into the
>       // search index that it creates, which allows applications that use the
> search index to
>       // locate the documents that satisfy their semantic queries.
>       SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas);
>       srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString());
>
> This last line gets the source file name, in your case
>
> C:\Watson\UIMA sdk\apache-uima\examples\data
>
> and the toURL converts the "blank" to "%20"
>
> which then causes the serialization code to fail when it attempts to create the file name, and as a result, the default file name is used.
>
> I could reproduce this by making the source directory have a blank in it.
>
> You can avoid this issue by having the source directory the document analyzer is using, be one without blanks in the path name.
>
> Cheers. -Marshall
>
>
> On 3/22/2011 1:09 PM, Marshall Schor wrote:
>> On 3/22/2011 12:25 PM, Marshall Schor wrote:
>>> Here's an idea:
>>>
>>> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
>>> called with a null file name:
>>>
>>> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>>>
>>> line 108-110:
>>>     if (outFile == null) {
>>>       outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");    
>>>     }
>>>
>>> The code above that has a try block that might be getting tripped up by the fact
>>> that your install point is in a path with a blank in it.
>>>
>>> Can you try installing into a path without a blank?
>> I tried this, and it also worked (with blanks in the file path) - so that's not
>> it...
>>
>> I'll contact you off-list to debug this mystery. -Marshall
>>> -Marshall
>>>
>>> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>>>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>>>> document names?
>>>>
>>>>
>>>>
>

Re: Not seeing the document names in the Document Analyzer

Posted by Marshall Schor <ms...@schor.com>.

OK - found the problem.

The Document analyzer uses a component "FileSystemCollectionReader" to read the
files. This component inserts into the CAS the name of the file being read,
using the code:

      // Also store location of source document in CAS. This information is critical
      // if CAS Consumers will need to know where the original document contents
are located.
      // For example, the Semantic Search CAS Indexer writes this information
into the
      // search index that it creates, which allows applications that use the
search index to
      // locate the documents that satisfy their semantic queries.
      SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas);
      srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString());

This last line gets the source file name, in your case

C:\Watson\UIMA sdk\apache-uima\examples\data

and the toURL converts the "blank" to "%20"

which then causes the serialization code to fail when it attempts to create the file name, and as a result, the default file name is used.

I could reproduce this by making the source directory have a blank in it.

You can avoid this issue by having the source directory the document analyzer is using, be one without blanks in the path name.

Cheers. -Marshall


On 3/22/2011 1:09 PM, Marshall Schor wrote:
>
> On 3/22/2011 12:25 PM, Marshall Schor wrote:
>> Here's an idea:
>>
>> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
>> called with a null file name:
>>
>> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>>
>> line 108-110:
>>     if (outFile == null) {
>>       outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");    
>>     }
>>
>> The code above that has a try block that might be getting tripped up by the fact
>> that your install point is in a path with a blank in it.
>>
>> Can you try installing into a path without a blank?
> I tried this, and it also worked (with blanks in the file path) - so that's not
> it...
>
> I'll contact you off-list to debug this mystery. -Marshall
>> -Marshall
>>
>> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>>> document names?
>>>
>>>
>>>
>

Re: Not seeing the document names in the Document Analyzer

Posted by Marshall Schor <ms...@schor.com>.


On 3/22/2011 12:25 PM, Marshall Schor wrote:
> Here's an idea:
>
> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
> called with a null file name:
>
> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>
> line 108-110:
>     if (outFile == null) {
>       outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");    
>     }
>
> The code above that has a try block that might be getting tripped up by the fact
> that your install point is in a path with a blank in it.
>
> Can you try installing into a path without a blank?

I tried this, and it also worked (with blanks in the file path) - so that's not
it...

I'll contact you off-list to debug this mystery. -Marshall
> -Marshall
>
> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>> document names?
>>
>>
>>
>

Re: Not seeing the document names in the Document Analyzer

Posted by Marshall Schor <ms...@schor.com>.

Here's an idea:

The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
called with a null file name:

uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java

line 108-110:
    if (outFile == null) {
      outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");    
    }

The code above that has a try block that might be getting tripped up by the fact
that your install point is in a path with a blank in it.

Can you try installing into a path without a blank?

-Marshall

On 3/22/2011 8:48 AM, Bob Sizemore wrote:
> Anybody have any ideas for me to try to get the doc analyzer showing the right
> document names?
>
>
>

Re: Not seeing the document names in the Document Analyzer

Posted by Bob Sizemore <si...@us.ibm.com>.

Anybody have any ideas for me to try to get the doc analyzer showing the right
document names?

Re: Not seeing the document names in the Document Analyzer

Posted by Bob Sizemore <si...@us.ibm.com>.

I am running on Windows XP.  I have downloaded installed the 2.3.1 UIMA sdk to
C:\Watson\UIMA sdk\apache-uima

I am using Eclipse SDK Version: 3.6.2 Build id: M20110210-1200

I used the update site and installed the UIMA 2.3.1 runtime and tools and
UIMA-AS Deployment Descriptor Editor 2.3.0.incubating

Added UIMA_HOME C:/Watson/UIMA sdk/apache-uima/  to the classpath variables
under the java->buildpath

using C:\Program Files\IBM\Java50  download from IBM  added that as the JRE for
eclipse

I followed the instructions in the UIMA setup to install the sdk and samples.

Set JAVA_HOME and UIMA_HOME environmental variables

imported the examples into eclipse and ran the document analyzer from the
run->run menu.  I tried running it using the bat file and got the same results.

Re: Not seeing the document names in the Document Analyzer

Posted by Marshall Schor <ms...@schor.com>.

I tried this on the current trunk, and it produced output which had the names of
the files, not doc0, etc.

I was running on Windows XP. 

I ran against the data in the examples dir: src/main/data

I ran using the descriptor in the examples dir:
src/main/descriptors/tutorial/ex1/RoomNumberAnnotator.xml

Can you provide more information so we can try to reproduce this issue?

-Marshall

On 3/16/2011 4:33 PM, Bob Sizemore wrote:
> When I am running through the samples in the tutorial and I run the document
> analyzer on the first sample.  There are 8 txt files in the data directory that
> are processed.  What I see in the  in the document analyzer is the following:
> doc0.xmi
> doc1.xmi
> doc2.xmi
> doc3.xmi
> doc4.xmi
> doc5.xmi
> doc6.xmi
>
> When I double click the data appears correct but the label in the doc analyzer
> is not correct and should be the names of the files:
>
> IBM_LifeSciences.txt
> New_IBM_Fellows.txt
> SeminarChallengesInSpeechRecognition.txt
> .
> .
> .
>
> Is there some setting I am missing or is there a way debug this?
>
>
>
>
>