You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Bob Sizemore <si...@us.ibm.com> on 2011/03/16 21:33:32 UTC
Not seeing the document names in the Document Analyzer
When I am running through the samples in the tutorial and I run the document
analyzer on the first sample. There are 8 txt files in the data directory that
are processed. What I see in the in the document analyzer is the following:
doc0.xmi
doc1.xmi
doc2.xmi
doc3.xmi
doc4.xmi
doc5.xmi
doc6.xmi
When I double click the data appears correct but the label in the doc analyzer
is not correct and should be the names of the files:
IBM_LifeSciences.txt
New_IBM_Fellows.txt
SeminarChallengesInSpeechRecognition.txt
.
.
.
Is there some setting I am missing or is there a way debug this?
Re: Not seeing the document names in the Document Analyzer
Posted by Marshall Schor <ms...@schor.com>.
Now posted as a Jira: https://issues.apache.org/jira/browse/UIMA-2097
-Marshall
On 3/22/2011 1:54 PM, Marshall Schor wrote:
> OK - found the problem.
>
> The Document analyzer uses a component "FileSystemCollectionReader" to read the
> files. This component inserts into the CAS the name of the file being read,
> using the code:
>
> // Also store location of source document in CAS. This information is critical
> // if CAS Consumers will need to know where the original document contents
> are located.
> // For example, the Semantic Search CAS Indexer writes this information
> into the
> // search index that it creates, which allows applications that use the
> search index to
> // locate the documents that satisfy their semantic queries.
> SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas);
> srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString());
>
> This last line gets the source file name, in your case
>
> C:\Watson\UIMA sdk\apache-uima\examples\data
>
> and the toURL converts the "blank" to "%20"
>
> which then causes the serialization code to fail when it attempts to create the file name, and as a result, the default file name is used.
>
> I could reproduce this by making the source directory have a blank in it.
>
> You can avoid this issue by having the source directory the document analyzer is using, be one without blanks in the path name.
>
> Cheers. -Marshall
>
>
> On 3/22/2011 1:09 PM, Marshall Schor wrote:
>> On 3/22/2011 12:25 PM, Marshall Schor wrote:
>>> Here's an idea:
>>>
>>> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
>>> called with a null file name:
>>>
>>> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>>>
>>> line 108-110:
>>> if (outFile == null) {
>>> outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");
>>> }
>>>
>>> The code above that has a try block that might be getting tripped up by the fact
>>> that your install point is in a path with a blank in it.
>>>
>>> Can you try installing into a path without a blank?
>> I tried this, and it also worked (with blanks in the file path) - so that's not
>> it...
>>
>> I'll contact you off-list to debug this mystery. -Marshall
>>> -Marshall
>>>
>>> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>>>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>>>> document names?
>>>>
>>>>
>>>>
>
Re: Not seeing the document names in the Document Analyzer
Posted by Marshall Schor <ms...@schor.com>.
OK - found the problem.
The Document analyzer uses a component "FileSystemCollectionReader" to read the
files. This component inserts into the CAS the name of the file being read,
using the code:
// Also store location of source document in CAS. This information is critical
// if CAS Consumers will need to know where the original document contents
are located.
// For example, the Semantic Search CAS Indexer writes this information
into the
// search index that it creates, which allows applications that use the
search index to
// locate the documents that satisfy their semantic queries.
SourceDocumentInformation srcDocInfo = new SourceDocumentInformation(jcas);
srcDocInfo.setUri(file.getAbsoluteFile().toURL().toString());
This last line gets the source file name, in your case
C:\Watson\UIMA sdk\apache-uima\examples\data
and the toURL converts the "blank" to "%20"
which then causes the serialization code to fail when it attempts to create the file name, and as a result, the default file name is used.
I could reproduce this by making the source directory have a blank in it.
You can avoid this issue by having the source directory the document analyzer is using, be one without blanks in the path name.
Cheers. -Marshall
On 3/22/2011 1:09 PM, Marshall Schor wrote:
>
> On 3/22/2011 12:25 PM, Marshall Schor wrote:
>> Here's an idea:
>>
>> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
>> called with a null file name:
>>
>> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>>
>> line 108-110:
>> if (outFile == null) {
>> outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");
>> }
>>
>> The code above that has a try block that might be getting tripped up by the fact
>> that your install point is in a path with a blank in it.
>>
>> Can you try installing into a path without a blank?
> I tried this, and it also worked (with blanks in the file path) - so that's not
> it...
>
> I'll contact you off-list to debug this mystery. -Marshall
>> -Marshall
>>
>> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>>> document names?
>>>
>>>
>>>
>
Re: Not seeing the document names in the Document Analyzer
Posted by Marshall Schor <ms...@schor.com>.
On 3/22/2011 12:25 PM, Marshall Schor wrote:
> Here's an idea:
>
> The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
> called with a null file name:
>
> uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
>
> line 108-110:
> if (outFile == null) {
> outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");
> }
>
> The code above that has a try block that might be getting tripped up by the fact
> that your install point is in a path with a blank in it.
>
> Can you try installing into a path without a blank?
I tried this, and it also worked (with blanks in the file path) - so that's not
it...
I'll contact you off-list to debug this mystery. -Marshall
> -Marshall
>
> On 3/22/2011 8:48 AM, Bob Sizemore wrote:
>> Anybody have any ideas for me to try to get the doc analyzer showing the right
>> document names?
>>
>>
>>
>
Re: Not seeing the document names in the Document Analyzer
Posted by Marshall Schor <ms...@schor.com>.
Here's an idea:
The suffix doc1.xmi doc2.xmi, etc are produced when the XMI Cas Serializer is
called with a null file name:
uimaj-examples/src/main/java/org/apache/uima/examples/xmi/XmiWriterCasConsumer.java
line 108-110:
if (outFile == null) {
outFile = new File(mOutputDir, "doc" + mDocNum++ + ".xmi");
}
The code above that has a try block that might be getting tripped up by the fact
that your install point is in a path with a blank in it.
Can you try installing into a path without a blank?
-Marshall
On 3/22/2011 8:48 AM, Bob Sizemore wrote:
> Anybody have any ideas for me to try to get the doc analyzer showing the right
> document names?
>
>
>
Re: Not seeing the document names in the Document Analyzer
Posted by Bob Sizemore <si...@us.ibm.com>.
Anybody have any ideas for me to try to get the doc analyzer showing the right
document names?
Re: Not seeing the document names in the Document Analyzer
Posted by Bob Sizemore <si...@us.ibm.com>.
I am running on Windows XP. I have downloaded installed the 2.3.1 UIMA sdk to
C:\Watson\UIMA sdk\apache-uima
I am using Eclipse SDK Version: 3.6.2 Build id: M20110210-1200
I used the update site and installed the UIMA 2.3.1 runtime and tools and
UIMA-AS Deployment Descriptor Editor 2.3.0.incubating
Added UIMA_HOME C:/Watson/UIMA sdk/apache-uima/ to the classpath variables
under the java->buildpath
using C:\Program Files\IBM\Java50 download from IBM added that as the JRE for
eclipse
I followed the instructions in the UIMA setup to install the sdk and samples.
Set JAVA_HOME and UIMA_HOME environmental variables
imported the examples into eclipse and ran the document analyzer from the
run->run menu. I tried running it using the bat file and got the same results.
Re: Not seeing the document names in the Document Analyzer
Posted by Marshall Schor <ms...@schor.com>.
I tried this on the current trunk, and it produced output which had the names of
the files, not doc0, etc.
I was running on Windows XP.
I ran against the data in the examples dir: src/main/data
I ran using the descriptor in the examples dir:
src/main/descriptors/tutorial/ex1/RoomNumberAnnotator.xml
Can you provide more information so we can try to reproduce this issue?
-Marshall
On 3/16/2011 4:33 PM, Bob Sizemore wrote:
> When I am running through the samples in the tutorial and I run the document
> analyzer on the first sample. There are 8 txt files in the data directory that
> are processed. What I see in the in the document analyzer is the following:
> doc0.xmi
> doc1.xmi
> doc2.xmi
> doc3.xmi
> doc4.xmi
> doc5.xmi
> doc6.xmi
>
> When I double click the data appears correct but the label in the doc analyzer
> is not correct and should be the names of the files:
>
> IBM_LifeSciences.txt
> New_IBM_Fellows.txt
> SeminarChallengesInSpeechRecognition.txt
> .
> .
> .
>
> Is there some setting I am missing or is there a way debug this?
>
>
>
>
>