You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reddibabu <re...@gmail.com> on 2014/03/21 09:52:25 UTC
Unable to crawl and index pdf metadata into Solr from Nutch
Hi,
I am using Nutch 1.7 and Solr 4.5
I can able to crawl any PDF from Nutch side and it can display some metadata
on terminal by using "bin/nutch indexchecker
http://www.master.netseven.it/files/262-Nutch.pdf". But I am not able to
index same pdf details into Solr.
I got some "INFO:parse.ParseSegment -
http://master.netseven.it/files/262-Nutch.pdf skipped. Content of size
371452 was truncated to 62630" on terminal
Is there any size limit for PDF and let me know how to set unlimit (-1) to
PDF content ?
Please any one assist me on the same
Thanks in advance.
--
View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-and-index-pdf-metadata-into-Solr-from-Nutch-tp4125941.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Unable to crawl and index pdf metadata into Solr from Nutch
Posted by remi tassing <ta...@gmail.com>.
Hi,
modify the default value of http.content.limit and/or ftp.content.limit
value accordingly.
This problem has nothing to do with the format but the content size
Remi
On Fri, Mar 21, 2014 at 4:52 PM, reddibabu <re...@gmail.com> wrote:
> Hi,
>
> I am using Nutch 1.7 and Solr 4.5
>
> I can able to crawl any PDF from Nutch side and it can display some
> metadata
> on terminal by using "bin/nutch indexchecker
> http://www.master.netseven.it/files/262-Nutch.pdf". But I am not able to
> index same pdf details into Solr.
>
> I got some "INFO:parse.ParseSegment -
> http://master.netseven.it/files/262-Nutch.pdf skipped. Content of size
> 371452 was truncated to 62630" on terminal
>
> Is there any size limit for PDF and let me know how to set unlimit (-1) to
> PDF content ?
>
> Please any one assist me on the same
>
>
> Thanks in advance.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Unable-to-crawl-and-index-pdf-metadata-into-Solr-from-Nutch-tp4125941.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>