You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by reddibabu <re...@gmail.com> on 2014/03/21 09:52:25 UTC

Unable to crawl and index pdf metadata into Solr from Nutch

Hi,

I am using Nutch 1.7 and Solr 4.5

I can able to crawl any PDF from Nutch side and it can display some metadata
on terminal by using  "bin/nutch indexchecker
http://www.master.netseven.it/files/262-Nutch.pdf". But I am not able to
index same pdf details into Solr.

I got some "INFO:parse.ParseSegment -
http://master.netseven.it/files/262-Nutch.pdf skipped. Content of size
371452 was truncated to 62630" on terminal

Is there any size limit for PDF and let me know how to set unlimit (-1) to
PDF content ?

Please any one assist me on the same


Thanks in advance.




--
View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-and-index-pdf-metadata-into-Solr-from-Nutch-tp4125941.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Unable to crawl and index pdf metadata into Solr from Nutch

Posted by remi tassing <ta...@gmail.com>.

Hi,

modify the default value of http.content.limit and/or ftp.content.limit
value accordingly.
This problem has nothing to do with the format but the content size

Remi


On Fri, Mar 21, 2014 at 4:52 PM, reddibabu <re...@gmail.com> wrote:

> Hi,
>
> I am using Nutch 1.7 and Solr 4.5
>
> I can able to crawl any PDF from Nutch side and it can display some
> metadata
> on terminal by using  "bin/nutch indexchecker
> http://www.master.netseven.it/files/262-Nutch.pdf". But I am not able to
> index same pdf details into Solr.
>
> I got some "INFO:parse.ParseSegment -
> http://master.netseven.it/files/262-Nutch.pdf skipped. Content of size
> 371452 was truncated to 62630" on terminal
>
> Is there any size limit for PDF and let me know how to set unlimit (-1) to
> PDF content ?
>
> Please any one assist me on the same
>
>
> Thanks in advance.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Unable-to-crawl-and-index-pdf-metadata-into-Solr-from-Nutch-tp4125941.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>