You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by tryma <tr...@creuna.no> on 2006/10/04 09:38:12 UTC

Problem parsing some MS Excel documents (Office 2003)

Hi,

I initially thought there was an issue with POI so I posted my initial
question on the POI-user list.
Actually, now I see this is happening in the Nutch classes for the MS parse
plugin, not POI, so I'm giving this list a go.

Here's a trace I get when I catch any exception occurring as I attempt to
call the MSExcelParser's getParse(Content). It seems I get an NPE in
MSBaseParser.getParse().

[#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
java.lang.NullPointerException
	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
	at
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
        at
<my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
        ...

Looking at the source (MSBaseParser.java) at this line, it goes:

****SNIP****
      extractor.extract(new ByteArrayInputStream(raw));
      text = extractor.getText();
      properties = extractor.getProperties();
      outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
getConf());
      
    } catch (Exception e) {
      return new ParseStatus(ParseStatus.FAILED,
                             "Can't be handled as micrsosoft document. " +
e)
                             .getEmptyParse(this.conf);
    }
    
    // collect meta data
    Metadata metadata = new Metadata();
    title = properties.getProperty(DublinCore.TITLE);      <========== This
is line 94 as indicated in the trace
    properties.remove(DublinCore.TITLE);
****SNIP****

So I can only gather that my properties object is null. As seen above in the
snippet from the MSBaseParser source, properties is initially null but
assigned a value from the ExcelExtractor (properties =
extractor.getProperties();) which I assume is becoming null.

Any ideas how I can get around this or if I'm not setting some required
properties?

Btw, I've noticed a spelling mistake in the ParseStatus that is returned in
the above lines of code; "Micrsosoft"


Thanks,
Trym
-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel-documents-%28Office-2003%29-tf2380851.html#a6635140
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Problem parsing some MS Excel documents (Office 2003)

Posted by tryma <tr...@creuna.no>.

Any suggestions, or should I maybe post this on the Nutch-dev list too?

To me it seems a bit strange that the MSBaseParser.java opens for the
possibility that your properties object may be set to null and then later
can give rise to an NPE at the call:

    title = properties.getProperty(DublinCore.TITLE);

Comments?


Thanks,
Trym


tryma wrote:
> 
> Hi,
> 
> I initially thought there was an issue with POI so I posted my initial
> question on the POI-user list.
> Actually, now I see this is happening in the Nutch classes for the MS
> parse plugin, not POI, so I'm giving this list a go.
> 
> Here's a trace I get when I catch any exception occurring as I attempt to
> call the MSExcelParser's getParse(Content). It seems I get an NPE in
> MSBaseParser.getParse().
> 
> [#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
> java.lang.NullPointerException
> 	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
> 	at
> org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
>         at
> <my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
>         ...
> 
> Looking at the source (MSBaseParser.java) at this line, it goes:
> 
> ****SNIP****
>       extractor.extract(new ByteArrayInputStream(raw));
>       text = extractor.getText();
>       properties = extractor.getProperties();
>       outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
> getConf());
>       
>     } catch (Exception e) {
>       return new ParseStatus(ParseStatus.FAILED,
>                              "Can't be handled as micrsosoft document. " +
> e)
>                              .getEmptyParse(this.conf);
>     }
>     
>     // collect meta data
>     Metadata metadata = new Metadata();
>     title = properties.getProperty(DublinCore.TITLE);      <==========
> This is line 94 as indicated in the trace
>     properties.remove(DublinCore.TITLE);
> ****SNIP****
> 
> So I can only gather that my properties object is null. As seen above in
> the snippet from the MSBaseParser source, properties is initially null but
> assigned a value from the ExcelExtractor (properties =
> extractor.getProperties();) which I assume is becoming null.
> 
> Any ideas how I can get around this or if I'm not setting some required
> properties?
> 
> Btw, I've noticed a spelling mistake in the ParseStatus that is returned
> in the above lines of code; "Micrsosoft"
> 
> 
> Thanks,
> Trym
> 

-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel-documents-%28Office-2003%29-tf2380851.html#a6654362
Sent from the Nutch - User mailing list archive at Nabble.com.