You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by tryma <tr...@creuna.no> on 2006/10/09 09:02:49 UTC

Problem parsing some MS Excel & other formats (Office 2003)

Hi,

I initially thought there was an issue with POI so I posted my initial
question on the POI-user list.
Actually, now I see this is happening in the Nutch classes for the MS parse
plugin, not POI, so I'm giving this list a go.

Here's a trace I get when I catch any exception occurring as I attempt to
call the MSExcelParser's getParse(Content). It seems I get an NPE in
MSBaseParser.getParse().

[#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
java.lang.NullPointerException
	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
	at
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
        at
<my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
        ...

Looking at the source (MSBaseParser.java) at this line, it goes:

****SNIP****
      extractor.extract(new ByteArrayInputStream(raw));
      text = extractor.getText();
      properties = extractor.getProperties();
      outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
getConf());
      
    } catch (Exception e) {
      return new ParseStatus(ParseStatus.FAILED,
                             "Can't be handled as micrsosoft document. " +
e)
                             .getEmptyParse(this.conf);
    }
    
    // collect meta data
    Metadata metadata = new Metadata();
    title = properties.getProperty(DublinCore.TITLE);      <========== This
is line 94 as indicated in the trace
    properties.remove(DublinCore.TITLE);
****SNIP****

So I can only gather that my properties object is null. As seen above in the
snippet from the MSBaseParser source, properties is initially null but
assigned a value from the ExcelExtractor (properties =
extractor.getProperties();) which I assume is becoming null.

Any ideas how I can get around this or if I'm not setting some required
properties?

Btw, I've noticed a spelling mistake in the ParseStatus that is returned in
the above lines of code; "Micrsosoft"


Thanks,
Trym
-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6712543
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Aisha <ai...@yahoo.com>.
Hi Andrzej ,

Thank you for your reply,

As I have a lot of raised exception, Could you please have a look at it and
said me if there is a way to solve them : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be
handled as micrsosoft document.
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
instance, the following exception occured: null

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't
be handled as micrsosoft document. java.util.NoSuchElementException
 
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't
be handled as micrsosoft document. java.io.IOException: Invalid header
signature; read 7015536635646467195, expected -2226271756974174256

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: dsp
	at java.net.URL.<init>(URL.java:574)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
	at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
	at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84)
	at
org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)


In the last error, the string after "unknown protocol: " is not always dsp,
it seems to be different in each case and I don't understand what mean this
string.

Thank you very much.

Best regards,
Aïcha 

Aisha wrote:
> Hi,
>
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.....:
>
> 2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully
> parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as
> Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files
> are unsupported at this time
>
> Couls you please help me because the volume of rejected document is
> large.......
>   

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 
supported.

Your only option is to use some other external parser through parse-ext 
plugin.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6911914
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Aisha wrote:
> Hi,
>
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.....:
>
> 2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved files
> are unsupported at this time
>
> Couls you please help me because the volume of rejected document is
> large.......
>   

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 
supported.

Your only option is to use some other external parser through parse-ext 
plugin.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Aisha <ai...@yahoo.com>.
Hi,

I try with last releases nutch-2006-10-13.tar.gz and
nutch-2006-10-19.tar.gz,
but the NPE doesn't seem to be fixed, I always have the same exception
message for a lot of document and a lot af format, excel but word and
powerpoint too.....:

2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully parse
content file://C:/docs_a_indexer/test.doc of type application/msword
2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as Microsoft
document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved files
are unsupported at this time

Couls you please help me because the volume of rejected document is
large.......

Thanks in advance,
best regards,
Aïcha



Andrzej Bialecki wrote:
> 
> tryma wrote:
>> Hi Andrzej,
>>
>> Great that you've fixed the NPE, thanks! No prob with the spelling
>> mistake,
>> just wasn't sure what you'd fixed when you quoted my last message. ;)
>>
>> How do I get hold of this change, get the nightly build and use that?
>>
>>   
> 
> Yes, or use 'svn update' if you checked out your sources from SVN.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6898319
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Andrzej Bialecki <ab...@getopt.org>.
tryma wrote:
> Hi Andrzej,
>
> Great that you've fixed the NPE, thanks! No prob with the spelling mistake,
> just wasn't sure what you'd fixed when you quoted my last message. ;)
>
> How do I get hold of this change, get the nightly build and use that?
>
>   

Yes, or use 'svn update' if you checked out your sources from SVN.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by tryma <tr...@creuna.no>.
Hi Andrzej,

Great that you've fixed the NPE, thanks! No prob with the spelling mistake,
just wasn't sure what you'd fixed when you quoted my last message. ;)

How do I get hold of this change, get the nightly build and use that?


Cheers,
Trym



Andrzej Bialecki wrote:
> 
> 
> NPE in the first place .. I can live with spelling mistakes in info 
> messages. ;)
> 
> 
> BTW. this NPE is an interesting thing. From my understanding of the code 
> it can occur if SummaryInformation stream cannot be read quickly enough 
> (2 sec) from the input document. This may happen if the parser is stuck 
> (at which point nothing will help), or if the file is large and the 
> information stream is not at the document's beginning - and in this case 
> increasing the TIMEOUT value may help.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6713367
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Andrzej Bialecki <ab...@getopt.org>.
tryma wrote:
> Fixed the NPE issue too, or just the spelling mistake?
>
>   

NPE in the first place .. I can live with spelling mistakes in info 
messages. ;)

BTW. this NPE is an interesting thing. From my understanding of the code 
it can occur if SummaryInformation stream cannot be read quickly enough 
(2 sec) from the input document. This may happen if the parser is stuck 
(at which point nothing will help), or if the file is large and the 
information stream is not at the document's beginning - and in this case 
increasing the TIMEOUT value may help.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by tryma <tr...@creuna.no>.

Fixed the NPE issue too, or just the spelling mistake?


Best,

Trym


Andrzej Bialecki wrote:
> 
> tryma wrote:
>> Hi,
>>
>> I initially thought there was an issue with POI so I posted my initial
>> question on the POI-user list.
>> Actually, now I see this is happening in the Nutch classes for the MS
>> parse
>> plugin, not POI, so I'm giving this list a go.
>>
>> Here's a trace I get when I catch any exception occurring as I attempt to
>> call the MSExcelParser's getParse(Content). It seems I get an NPE in
>> MSBaseParser.getParse().
>>
>> [#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
>> java.lang.NullPointerException
>> 	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
>> 	at
>> org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
>>         at
>> <my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
>>         ...
>>
>> Looking at the source (MSBaseParser.java) at this line, it goes:
>>
>> ****SNIP****
>>       extractor.extract(new ByteArrayInputStream(raw));
>>       text = extractor.getText();
>>       properties = extractor.getProperties();
>>       outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
>> getConf());
>>       
>>     } catch (Exception e) {
>>       return new ParseStatus(ParseStatus.FAILED,
>>                              "Can't be handled as micrsosoft document. "
>> +
>> e)
>>                              .getEmptyParse(this.conf);
>>     }
>>     
>>     // collect meta data
>>     Metadata metadata = new Metadata();
>>     title = properties.getProperty(DublinCore.TITLE);      <==========
>> This
>> is line 94 as indicated in the trace
>>     properties.remove(DublinCore.TITLE);
>> ****SNIP****
>>
>> So I can only gather that my properties object is null. As seen above in
>> the
>> snippet from the MSBaseParser source, properties is initially null but
>> assigned a value from the ExcelExtractor (properties =
>> extractor.getProperties();) which I assume is becoming null.
>>
>> Any ideas how I can get around this or if I'm not setting some required
>> properties?
>>
>> Btw, I've noticed a spelling mistake in the ParseStatus that is returned
>> in
>> the above lines of code; "Micrsosoft"
>>
>>   
> 
> Fixed - thanks for reporting it.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6713006
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Problem parsing some MS Excel & other formats (Office 2003)

Posted by Andrzej Bialecki <ab...@getopt.org>.
tryma wrote:
> Hi,
>
> I initially thought there was an issue with POI so I posted my initial
> question on the POI-user list.
> Actually, now I see this is happening in the Nutch classes for the MS parse
> plugin, not POI, so I'm giving this list a go.
>
> Here's a trace I get when I catch any exception occurring as I attempt to
> call the MSExcelParser's getParse(Content). It seems I get an NPE in
> MSBaseParser.getParse().
>
> [#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
> java.lang.NullPointerException
> 	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
> 	at
> org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
>         at
> <my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
>         ...
>
> Looking at the source (MSBaseParser.java) at this line, it goes:
>
> ****SNIP****
>       extractor.extract(new ByteArrayInputStream(raw));
>       text = extractor.getText();
>       properties = extractor.getProperties();
>       outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
> getConf());
>       
>     } catch (Exception e) {
>       return new ParseStatus(ParseStatus.FAILED,
>                              "Can't be handled as micrsosoft document. " +
> e)
>                              .getEmptyParse(this.conf);
>     }
>     
>     // collect meta data
>     Metadata metadata = new Metadata();
>     title = properties.getProperty(DublinCore.TITLE);      <========== This
> is line 94 as indicated in the trace
>     properties.remove(DublinCore.TITLE);
> ****SNIP****
>
> So I can only gather that my properties object is null. As seen above in the
> snippet from the MSBaseParser source, properties is initially null but
> assigned a value from the ExcelExtractor (properties =
> extractor.getProperties();) which I assume is becoming null.
>
> Any ideas how I can get around this or if I'm not setting some required
> properties?
>
> Btw, I've noticed a spelling mistake in the ParseStatus that is returned in
> the above lines of code; "Micrsosoft"
>
>   

Fixed - thanks for reporting it.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com