You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2016/05/06 01:59:34 UTC

Nutch 1.x crawl Zip file URLs

Hi,

(a) Is it possible to crawl URL of a Zip file using Nutch and index in
Solr? (pls see example below)

(b) Also, if a zip file URL has PDF files in them, is it possible to use
Nutch to crawl the Zip file URL and also the PDF file inside the Zip file
URL?


E.g.
*https://www.abc123.xxx/sites/docs/testing.zip
<https://www.abc123.xxx/sites/docs/testing.zip>*
When I unzip above URL - I would have the following:


*def.pdf*

*lmn.pdf*
*reg.pdf*


Please advise.

Thanks!

AL