You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kgoess <kn...@goess.org> on 2011/10/28 20:16:21 UTC

form-data post to ExtractingRequestHandler with utf-8 characters not handled

I'm trying to post a PDF along with a whole bunch of metadata fields to the
ExtractingRequestHandler as multipart/form-data.   It works fine except for
the utf-8 character handling.  Here is what my post looks like (abridged):

   POST /solr/update/extract HTTP/1.1
   TE: deflate,gzip;q=0.3
   Connection: TE, close
   Host: localhost:8983
   Content-Length: 21418
   Content-Type: multipart/form-data;
boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   Content-Disposition: form-data; name=literal.title

   smart >>‘<< quote
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   Content-Disposition: form-data; name="myfile";
filename="text.pdf.1174588823"
   Content-Type: application/pdf
   Content-Transfer-Encoding: binary

   ...binary pdf data

I've verified on the network that the quote character, a LEFT SINGLE
QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes "e2 80
98" which is correct.  However, when I search for the document in Solr, it's
coming back as the byte sequence "c3 a2 c2 80 c2 98" which I'm guessing is
it being double-utf8-encoded.

The multipart/form-data is MIME, which is supposed to be 7-bit, so I've
tried encoding any non-ascii fields as quoted-printable

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: quoted-printable

   smart >>=E2=80=98<< quote=

as well as base64

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: base64

   c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI=

but what sold puts in its index is just that value, it's not decoding either
the quoted-printable or the base64.  I've tried encoding the utf-8 values as
HTML entities, but then Solr doesn't unescape them either, and any accented
characters are stored as the HTML entities, not as the unicode characters.

Can anybody give me any pointers as to where I might be going wrong, where
to look for solutions, or any different/better ways to handle this?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: form-data post to ExtractingRequestHandler with utf-8 characters not handled

Posted by kgoess <kn...@goess.org>.
I finally managed to answer my own question. UTF-8 data in the body is ok,
but you need to specify charset=utf-8 in the Content-Type header in each
part, to tell the receiver (Solr) that it's not the default ISO-8859-1

   Content-Disposition: form-data; name=literal.bptitle
   Content-Type: text/plain; charset=utf-8

   accented séance ghosts
   --W76L1XO3T9bSMjapwVc9MgXQDNwQ4DBKgevNArdl

References:
The default charset is ISO-8859-1:
http://tools.ietf.org/html/rfc2616#section-3.7.1
How to set the charset for multipartform-data:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.2

And if anybody's curious, here's how you specify that in Perl and send a pdf
to the /update/extract solr-cell handler:

    my %form_fields = (
       title => 'accented séance ghosts',
       author => 'smith'
    );

    my @content;

    while (my ($field, $value) = each %form_fields){
        if ($value =~ /^[[:ascii:]]+$/ ){
            push @content, "literal.$field" => $value;
        }else{
             push @content, "literal.$field" =>
                      [ undef,
                       "literal.$field",
                       "Content-Type" => 'text/plain; charset=utf-8',
                       "Content-Disposition" => "form-data;
name=literal.$field",
                       "Content" => encode('utf-8-strict', $value),
                      ];
         }
     }

    push @content, ( myfile => [ $path, undef, 'Content-Type' =>
'application/pdf', 'Content-Transfer-Encoding', 'binary' ]),

    local $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;

    my $response = $ua->post(
            $extract_uri,
            Content_Type => 'form-data',
            Content      => \@content,
        );




--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3474450.html
Sent from the Solr - User mailing list archive at Nabble.com.