You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ahammad <ah...@gmail.com> on 2009/07/08 16:40:27 UTC
Indexing rich documents from websites using
ExtractingRequestHandler
Hello,
I can index rich documents like pdf for instance that are on the filesystem.
Can we use ExtractingRequestHandler to index files that are accessible on a
website?
For example, there is a file that can be reached like so:
http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
How would I go about indexing that file? I tried using the following
combinations. I will put the errors in brackets:
stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
filename, directory name, or volume label syntax is incorrect)
stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
cannot find the path specified)
stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
the specified network name is invalid)
stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
find the path specified)
stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
was not found)
I sort of understand why I get those errors. What are the alternative
methods of doing this? I am guessing that the stream.file attribute doesn't
support web addresses. Is there another attribute that does?
--
View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing rich documents from websites using ExtractingRequestHandler
Posted by Jay Hill <ja...@gmail.com>.
I haven't tried this myself, but it sounds like what you're looking for is
enabling remote streaming:
http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf
As the link above shows you should be able to enable remote streaming like
this: <requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048" /> and then something like this might work:
stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf<http://www.sub.mydomain.com/files/pdfdocs/testfile.pdf>
So you use stream.url instead of stream.file.
Hope this helps.
-Jay
On Wed, Jul 8, 2009 at 7:40 AM, ahammad <ah...@gmail.com> wrote:
>
> Hello,
>
> I can index rich documents like pdf for instance that are on the
> filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format
> of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network
> path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context:
> http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Re: Indexing rich documents from websites using ExtractingRequestHandler
Posted by Glen Newton <gl...@gmail.com>.
Try putting all the PDF URLs into a file, download with something like
'wget' then index locally.
Glen Newton
http://zzzoot.blogspot.com/
2009/7/8 ahammad <ah...@gmail.com>:
>
> Hello,
>
> I can index rich documents like pdf for instance that are on the filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
--
-