You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Karthik Shiraly <ka...@gmail.com> on 2011/03/10 06:29:29 UTC

Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

Hi,

I'm using Solr 1.4.1.
The scenario involves user uploading multiple files. These have content
extracted using SolrCell, then indexed by Solr along with other information
about the user.

ContentStreamUpdateRequest seemed like the right choice for this - use
addFile() to send file data, and use setParam() to add normal data fields.

However, when I do multiple addFile() to ContentStreamUpdateRequest, I
observed that at the server side, even the file parts of this multipart post
are interpreted as regular form fields by the FileUpload component.
I found that FileUpload does so because the "filename" value in
"Content-Disposition" headers of each part are not being set.
Digging a bit further, it seems the actual root cause is in the client side
solrj API ... the CommonsHttpSolrServer class is not setting "filename"
value in "Content-Disposition" header while creating multipart Part
instances (from HttpClient framework).

I solved this problem by a hack - in CommonsHttpSolrServer.request() method
where the PartBase instances are created, I overrode
"sendDispositionHeader()" and added "filename" value. That solved the
problem.

However, my questions are:
1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
Should I be using something else?

2. My end goal is to map contents of each file to *separate* fields, not a
common field. Since the regular ExtractingRequestHandler maps all content to
just one field, I believe I've to create a custom RequestHandler (possibly
reusing existing SolrCell classes).
Is this approach right?

Thanks
Karthik

Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

Posted by Karthik Shiraly <ka...@gmail.com>.
In case the exact problem was not clear to somebody:
The problem with FileUpload interpreting file data as regular form fields is
that, Solr thinks there are no content streams in the request and throws a
"missing_content_stream" exception.

On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly <
karthikshiraly80@gmail.com> wrote:

> Hi,
>
> I'm using Solr 1.4.1.
> The scenario involves user uploading multiple files. These have content
> extracted using SolrCell, then indexed by Solr along with other information
> about the user.
>
> ContentStreamUpdateRequest seemed like the right choice for this - use
> addFile() to send file data, and use setParam() to add normal data fields.
>
> However, when I do multiple addFile() to ContentStreamUpdateRequest, I
> observed that at the server side, even the file parts of this multipart post
> are interpreted as regular form fields by the FileUpload component.
> I found that FileUpload does so because the "filename" value in
> "Content-Disposition" headers of each part are not being set.
> Digging a bit further, it seems the actual root cause is in the client side
> solrj API ... the CommonsHttpSolrServer class is not setting "filename"
> value in "Content-Disposition" header while creating multipart Part
> instances (from HttpClient framework).
>
> I solved this problem by a hack - in CommonsHttpSolrServer.request() method
> where the PartBase instances are created, I overrode
> "sendDispositionHeader()" and added "filename" value. That solved the
> problem.
>
> However, my questions are:
> 1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
> Should I be using something else?
>
> 2. My end goal is to map contents of each file to *separate* fields, not a
> common field. Since the regular ExtractingRequestHandler maps all content to
> just one field, I believe I've to create a custom RequestHandler (possibly
> reusing existing SolrCell classes).
> Is this approach right?
>
> Thanks
> Karthik
>
>
>