You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dpverma <pa...@gmail.com> on 2012/09/11 01:20:26 UTC
Re: nutch crawling file system SOLVED
Can you pls let me know how you solved your problem?
I am also getting the same error which you had.
Getting the index with pdf's file name but not the content in those
--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch crawling file system SOLVED
Posted by dpverma <pa...@gmail.com>.
following are the steps which I have done so far:
1. download the two missing libraries from:
http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/
I downloaded the additional JARS from the URL in step 1 but instead of
putting them in "src/plugin/parse-pdf/lib" folder, I put them in "plugins"
folder. I modified the plugin.xml in the same folder as per the
instructions in it. Then enabled the 'parse-pdf' plugin in 'nutch-site.xml'
file as shown below (just added 'parse-pdf' at the end. I did not think I
need to rebuild Nutch.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)|parse-pdf</value>
</property>
Then i got the following error:
WARN parse.Parser - Error parsing:
http://viterbi.usc.edu/aviation/assets/002/74092.pdf: failed(2,202): Content
truncated at 66251 bytes. Parser can't handle incomplete pdf file.
solution :for this I changed file.content.limit and hhtp.content.limit to -1
in both nutch-default.xml and nutch-site.xml
Then I fixed follwoing error:
The parsing plugins: [org.apache.nutch.parse.pdf.PdfParser] are enabled via
the plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it in the
parse-plugins.xml file
solution: in parse-plugins.xml under nutch config , I uncommented the pdf
section
After doing this , I see no errors or warning in log file.
But there is still no text in the content section.
I have given direct link to the pdf file in regex-urlfilter.xml
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/aviation/assets/002/79884.pdf([a-z0-9\-A-Z]*\/)*
the only thing I have not done is rebuild the nutch. is that the reason no
text is getting extracted from the pdf?
If rebuilding nutch is crucial step...can you pls guide me as to how to do
it.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4007024.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch crawling file system SOLVED
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
Take a look at ftp.content.limit property in nutch-default.xml and set
it accordingly in nutch-site.xml
Thanks
Lewis
On Tue, Sep 11, 2012 at 12:20 AM, dpverma <pa...@gmail.com> wrote:
> Can you pls let me know how you solved your problem?
> I am also getting the same error which you had.
> Getting the index with pdf's file name but not the content in those
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Lewis