You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joe Bell <jo...@prodeasystems.com> on 2009/12/06 19:24:14 UTC

Nutch 1.0 ms-powerpoint plugin

Hi - this is my first post to the nutch mailing list, please let me know
if I commit any list protocol errors.

 

I'm currently using Nutch 1.0 with the Powerpoint plugin enabled and can
verify that Nutch is indeed pulling in the entire file for passing off
to the parser (i.e., I've set the content limit to -1 to get the full
file).  However it appears that most Powerpoint files with any
complexity (they use a template, have tables, images, etc.) do not get
indexed at all.  In one case I created a new file with one "title" slide
and the title text was recognized but the subtitle text directly
underneath was not.

 

My question is whether I'm missing something that has already been
covered (like for example,
http://issues.apache.org/jira/browse/NUTCH-463, though I don't see any
logs indicating issues in my crawl) or that this is a known defect in
the existing Powerpoint plugin?  It goes without saying that I'd very
much like to be able to completely index Powerpoint slides as this is
going to be the most common document type on my site.

 

Thanks,

Joe

 




This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.