You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kgoess <kn...@goess.org> on 2011/10/28 20:16:21 UTC
form-data post to ExtractingRequestHandler with utf-8 characters
not handled
I'm trying to post a PDF along with a whole bunch of metadata fields to the
ExtractingRequestHandler as multipart/form-data. It works fine except for
the utf-8 character handling. Here is what my post looks like (abridged):
POST /solr/update/extract HTTP/1.1
TE: deflate,gzip;q=0.3
Connection: TE, close
Host: localhost:8983
Content-Length: 21418
Content-Type: multipart/form-data;
boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
--wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
Content-Disposition: form-data; name=literal.title
smart >>‘<< quote
--wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
Content-Disposition: form-data; name="myfile";
filename="text.pdf.1174588823"
Content-Type: application/pdf
Content-Transfer-Encoding: binary
...binary pdf data
I've verified on the network that the quote character, a LEFT SINGLE
QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes "e2 80
98" which is correct. However, when I search for the document in Solr, it's
coming back as the byte sequence "c3 a2 c2 80 c2 98" which I'm guessing is
it being double-utf8-encoded.
The multipart/form-data is MIME, which is supposed to be 7-bit, so I've
tried encoding any non-ascii fields as quoted-printable
Content-Disposition: form-data; name=literal.title
Content-Transfer-Encoding: quoted-printable
smart >>=E2=80=98<< quote=
as well as base64
Content-Disposition: form-data; name=literal.title
Content-Transfer-Encoding: base64
c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI=
but what sold puts in its index is just that value, it's not decoding either
the quoted-printable or the base64. I've tried encoding the utf-8 values as
HTML entities, but then Solr doesn't unescape them either, and any accented
characters are stored as the HTML entities, not as the unicode characters.
Can anybody give me any pointers as to where I might be going wrong, where
to look for solutions, or any different/better ways to handle this?
Thanks!
--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: form-data post to ExtractingRequestHandler with utf-8
characters not handled
Posted by kgoess <kn...@goess.org>.
I finally managed to answer my own question. UTF-8 data in the body is ok,
but you need to specify charset=utf-8 in the Content-Type header in each
part, to tell the receiver (Solr) that it's not the default ISO-8859-1
Content-Disposition: form-data; name=literal.bptitle
Content-Type: text/plain; charset=utf-8
accented séance ghosts
--W76L1XO3T9bSMjapwVc9MgXQDNwQ4DBKgevNArdl
References:
The default charset is ISO-8859-1:
http://tools.ietf.org/html/rfc2616#section-3.7.1
How to set the charset for multipartform-data:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.2
And if anybody's curious, here's how you specify that in Perl and send a pdf
to the /update/extract solr-cell handler:
my %form_fields = (
title => 'accented séance ghosts',
author => 'smith'
);
my @content;
while (my ($field, $value) = each %form_fields){
if ($value =~ /^[[:ascii:]]+$/ ){
push @content, "literal.$field" => $value;
}else{
push @content, "literal.$field" =>
[ undef,
"literal.$field",
"Content-Type" => 'text/plain; charset=utf-8',
"Content-Disposition" => "form-data;
name=literal.$field",
"Content" => encode('utf-8-strict', $value),
];
}
}
push @content, ( myfile => [ $path, undef, 'Content-Type' =>
'application/pdf', 'Content-Transfer-Encoding', 'binary' ]),
local $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;
my $response = $ua->post(
$extract_uri,
Content_Type => 'form-data',
Content => \@content,
);
--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3474450.html
Sent from the Solr - User mailing list archive at Nabble.com.