You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lucifersam <ro...@tagish.co.uk> on 2007/03/12 14:56:42 UTC

nutch-0.8.1 - PDF Fragment problem

Hi,

I am having a problem with certain PDF files and the fragment which is
returned when the search is ran. This seems to be an issue when the PDF has
little or no text, (just images).

For example, the following was the result of a search for "Insulation":

 ... Map 8 Noise Exclusion & Insulation Zones - DP47
C78C111C105C115C101C32C69C120C99C108C117C115C105C111C110C32C97C110C100C32C73C110C115C117C108C97C116C105C111C110C32C90C111C110C101C32C45C32C82C65C70C32C76C101C101C109C105C110C103
8 3 ... 

The long character string is causing layout issues on my site, and I would
like to simply remove this. Is there an easy way to do this via XSL, or a
way to prevent it being indexed in the first place?

Many thanks,

Ross

FYI -  I am using nutch-0.8.1, and have updated the code to use PDFBox-0.7.3
in the hope it would be fixed, but same results
-- 
View this message in context: http://www.nabble.com/nutch-0.8.1---PDF-Fragment-problem-tf3389595.html#a9434973
Sent from the Nutch - User mailing list archive at Nabble.com.