You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ahammad <ah...@gmail.com> on 2009/07/08 16:40:27 UTC

Indexing rich documents from websites using ExtractingRequestHandler

Hello,

I can index rich documents like pdf for instance that are on the filesystem.
Can we use ExtractingRequestHandler to index files that are accessible on a
website?

For example, there is a file that can be reached like so:
http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf

How would I go about indexing that file? I tried using the following
combinations. I will put the errors in brackets:

stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
filename, directory name, or volume label syntax is incorrect)
stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
cannot find the path specified)
stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
the specified network name is invalid)
stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
find the path specified)
stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
was not found)

I sort of understand why I get those errors. What are the alternative
methods of doing this? I am guessing that the stream.file attribute doesn't
support web addresses. Is there another attribute that does?
-- 
View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing rich documents from websites using ExtractingRequestHandler

Posted by Jay Hill <ja...@gmail.com>.

I haven't tried this myself, but it sounds like what you're looking for is
enabling remote streaming:
http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf

As the link above shows you should be able to enable remote streaming like
this: <requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048" />  and then something like this might work:
stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf<http://www.sub.mydomain.com/files/pdfdocs/testfile.pdf>

So you use stream.url instead of stream.file.

Hope this helps.

-Jay


On Wed, Jul 8, 2009 at 7:40 AM, ahammad <ah...@gmail.com> wrote:

>
> Hello,
>
> I can index rich documents like pdf for instance that are on the
> filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format
> of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network
> path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context:
> http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Indexing rich documents from websites using ExtractingRequestHandler

Posted by Glen Newton <gl...@gmail.com>.

Try putting all the PDF URLs into a file, download with something like
'wget' then index locally.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/8 ahammad <ah...@gmail.com>:
>
> Hello,
>
> I can index rich documents like pdf for instance that are on the filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 

-