You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by dpverma <pa...@gmail.com> on 2012/09/11 01:20:26 UTC

Re: nutch crawling file system SOLVED

Can you pls let me know how you solved your problem?
I am also getting the same error which you had.
Getting the index with pdf's file name but not the content in those




--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling file system SOLVED

Posted by dpverma <pa...@gmail.com>.

following are the steps which I have done so far:
1. download the two missing libraries  from:
  http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

I downloaded the additional JARS from the URL in step 1 but instead of
putting them in "src/plugin/parse-pdf/lib" folder, I put them in "plugins"
folder.  I modified the plugin.xml in the same folder as per the
instructions in it.  Then enabled the 'parse-pdf' plugin in 'nutch-site.xml'
file as shown below (just added 'parse-pdf' at the end. I did not think I
need to rebuild Nutch.  
 
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)|parse-pdf</value>
</property>

Then i got the following error:
WARN  parse.Parser - Error parsing:
http://viterbi.usc.edu/aviation/assets/002/74092.pdf: failed(2,202): Content
truncated at 66251 bytes. Parser can't handle incomplete pdf file.

solution :for this I changed file.content.limit and hhtp.content.limit to -1
in both nutch-default.xml and nutch-site.xml

Then I fixed follwoing error:
The parsing plugins: [org.apache.nutch.parse.pdf.PdfParser] are enabled via
the plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

solution: in parse-plugins.xml under nutch config , I uncommented the pdf
section

After doing this , I see no errors or warning in log file.
But there is still no text in the content section.

I have given direct link to the pdf file in regex-urlfilter.xml
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/aviation/assets/002/79884.pdf([a-z0-9\-A-Z]*\/)*

the only thing I have not done is rebuild the nutch. is that the reason no
text is getting extracted from the pdf?
If rebuilding nutch is crucial step...can you pls guide me as to how to do
it.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4007024.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling file system SOLVED

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

Take a look at ftp.content.limit property in nutch-default.xml and set
it accordingly in nutch-site.xml

Thanks

Lewis

On Tue, Sep 11, 2012 at 12:20 AM, dpverma <pa...@gmail.com> wrote:
> Can you pls let me know how you solved your problem?
> I am also getting the same error which you had.
> Getting the index with pdf's file name but not the content in those
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis