You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jason Manfield <ra...@yahoo.com> on 2005/05/18 20:40:20 UTC

crawling PDF file with page links?

Can nutch (with its out-of-box PDFBox plugin) crawl PDF files, where each page is link (e.g. the URL appends &PGN=pageNumber to go to the specific page)? On the browser, each page in the pdf file is loaded on demand basis. However when the content is fetched from the URL (from the code), it looks like all the pages are not fetched. Even when the pdf is saved from the browser (with Save As, not all pages are saved. The Acrobat Reader is able to open only 1 page and gives errors (cannot find link) for the other pages. Examining the pdf file with notepad, I did find some tags like GoToR for each page, indicating the destination (in binary form though) for the page.
 
Any idea on how to extract everything from the pdf??
 
Thanks
 
Jason

		
---------------------------------
Do you Yahoo!?
 Yahoo! Small Business - Try our new resources site!