You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ross <te...@gmail.com> on 2009/12/30 03:19:55 UTC

Solr Cell - PDFs plus literal metadata - GET or POST ?

Hi all

I'm experimenting with Solr. I've successfully indexed some PDFs and
all looks good but now I want to index some PDFs with metadata pulled
from another source. I see this example in the docs.

curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"
 -F "tutorial=@tutorial.pdf"

I can write code to generate a script with those commands substituting
my own literal.whatever.  My metadata could be up to a couple of KB in
size. Is there a way of making the literal a POST variable rather than
a GET?  Will Solr Cell accept it as a POST? Something doesn't feel
right about generating a huge long URL. I think Tomcat can handle up
to 8 KB by default so I guess that's okay although I'm not sure how
long a Linux command line can reasonably be.

I know Curl may not be the right thing to use for production use but
this is initially to get some data indexed for test and demo.

Thanks
Ross

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

Posted by Ross <te...@gmail.com>.
On Tue, Jan 5, 2010 at 2:25 PM, Giovanni Fernandez-Kincade
<gf...@capitaliq.com> wrote:
> Really? Doesn't it have to be delimited differently, if both the file contents and the document metadata will be part of the POST data? How does Solr Cell tell the difference between the literals and the start of the file? I've tried this before and haven't had any luck with it.

Thanks Shalin.

And Giovanni, yes it definitely works.

This will set literal.mydata to the contents of mydata.txt

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "myfile=@tutorial.html" -F "literal.mydata=<mydata.txt"

Unfortunately I could not get the UTF-8 encoding to work property.
It's probably a curl or o/s configuration issue. I tried mydata.txt
with and without BOM and I can do a "more mydata.txt" command and the
special characters display correctly on my terminal set to UTF-8 but
they get screwed up when indexed.

I gave up in the end and went back to putting it urlencoded in the url.

Ross

>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Monday, January 04, 2010 4:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ?
>
> On Wed, Dec 30, 2009 at 7:49 AM, Ross <te...@gmail.com> wrote:
>
>> Hi all
>>
>> I'm experimenting with Solr. I've successfully indexed some PDFs and
>> all looks good but now I want to index some PDFs with metadata pulled
>> from another source. I see this example in the docs.
>>
>> curl "
>> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah
>> "
>>  -F "tutorial=@tutorial.pdf"
>>
>> I can write code to generate a script with those commands substituting
>> my own literal.whatever.  My metadata could be up to a couple of KB in
>> size. Is there a way of making the literal a POST variable rather than
>> a GET?
>
>
> With Curl? Yes, see the man page.
>
>
>>  Will Solr Cell accept it as a POST?
>
>
> Yes, it will.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

RE: Solr Cell - PDFs plus literal metadata - GET or POST ?

Posted by Giovanni Fernandez-Kincade <gf...@capitaliq.com>.
Really? Doesn't it have to be delimited differently, if both the file contents and the document metadata will be part of the POST data? How does Solr Cell tell the difference between the literals and the start of the file? I've tried this before and haven't had any luck with it. 

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com] 
Sent: Monday, January 04, 2010 4:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

On Wed, Dec 30, 2009 at 7:49 AM, Ross <te...@gmail.com> wrote:

> Hi all
>
> I'm experimenting with Solr. I've successfully indexed some PDFs and
> all looks good but now I want to index some PDFs with metadata pulled
> from another source. I see this example in the docs.
>
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah
> "
>  -F "tutorial=@tutorial.pdf"
>
> I can write code to generate a script with those commands substituting
> my own literal.whatever.  My metadata could be up to a couple of KB in
> size. Is there a way of making the literal a POST variable rather than
> a GET?


With Curl? Yes, see the man page.


>  Will Solr Cell accept it as a POST?


Yes, it will.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Dec 30, 2009 at 7:49 AM, Ross <te...@gmail.com> wrote:

> Hi all
>
> I'm experimenting with Solr. I've successfully indexed some PDFs and
> all looks good but now I want to index some PDFs with metadata pulled
> from another source. I see this example in the docs.
>
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah
> "
>  -F "tutorial=@tutorial.pdf"
>
> I can write code to generate a script with those commands substituting
> my own literal.whatever.  My metadata could be up to a couple of KB in
> size. Is there a way of making the literal a POST variable rather than
> a GET?


With Curl? Yes, see the man page.


>  Will Solr Cell accept it as a POST?


Yes, it will.

-- 
Regards,
Shalin Shekhar Mangar.