You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kgoess <kn...@goess.org> on 2011/11/02 18:07:29 UTC
Re: form-data post to ExtractingRequestHandler with utf-8
characters not handled
I finally managed to answer my own question. UTF-8 data in the body is ok,
but you need to specify charset=utf-8 in the Content-Type header in each
part, to tell the receiver (Solr) that it's not the default ISO-8859-1
Content-Disposition: form-data; name=literal.bptitle
Content-Type: text/plain; charset=utf-8
accented séance ghosts
--W76L1XO3T9bSMjapwVc9MgXQDNwQ4DBKgevNArdl
References:
The default charset is ISO-8859-1:
http://tools.ietf.org/html/rfc2616#section-3.7.1
How to set the charset for multipartform-data:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.2
And if anybody's curious, here's how you specify that in Perl and send a pdf
to the /update/extract solr-cell handler:
my %form_fields = (
title => 'accented séance ghosts',
author => 'smith'
);
my @content;
while (my ($field, $value) = each %form_fields){
if ($value =~ /^[[:ascii:]]+$/ ){
push @content, "literal.$field" => $value;
}else{
push @content, "literal.$field" =>
[ undef,
"literal.$field",
"Content-Type" => 'text/plain; charset=utf-8',
"Content-Disposition" => "form-data;
name=literal.$field",
"Content" => encode('utf-8-strict', $value),
];
}
}
push @content, ( myfile => [ $path, undef, 'Content-Type' =>
'application/pdf', 'Content-Transfer-Encoding', 'binary' ]),
local $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;
my $response = $ua->post(
$extract_uri,
Content_Type => 'form-data',
Content => \@content,
);
--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3474450.html
Sent from the Solr - User mailing list archive at Nabble.com.