You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Glen Newton <gl...@gmail.com> on 2009/07/08 17:37:17 UTC

Re: Indexing rich documents from websites using ExtractingRequestHandler

Try putting all the PDF URLs into a file, download with something like
'wget' then index locally.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/8 ahammad <ah...@gmail.com>:
>
> Hello,
>
> I can index rich documents like pdf for instance that are on the filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 

-