You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kgoess <kn...@goess.org> on 2011/11/02 18:07:29 UTC

Re: form-data post to ExtractingRequestHandler with utf-8 characters not handled

I finally managed to answer my own question. UTF-8 data in the body is ok,
but you need to specify charset=utf-8 in the Content-Type header in each
part, to tell the receiver (Solr) that it's not the default ISO-8859-1

   Content-Disposition: form-data; name=literal.bptitle
   Content-Type: text/plain; charset=utf-8

   accented séance ghosts
   --W76L1XO3T9bSMjapwVc9MgXQDNwQ4DBKgevNArdl

References:
The default charset is ISO-8859-1:
http://tools.ietf.org/html/rfc2616#section-3.7.1
How to set the charset for multipartform-data:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.2

And if anybody's curious, here's how you specify that in Perl and send a pdf
to the /update/extract solr-cell handler:

    my %form_fields = (
       title => 'accented séance ghosts',
       author => 'smith'
    );

    my @content;

    while (my ($field, $value) = each %form_fields){
        if ($value =~ /^[[:ascii:]]+$/ ){
            push @content, "literal.$field" => $value;
        }else{
             push @content, "literal.$field" =>
                      [ undef,
                       "literal.$field",
                       "Content-Type" => 'text/plain; charset=utf-8',
                       "Content-Disposition" => "form-data;
name=literal.$field",
                       "Content" => encode('utf-8-strict', $value),
                      ];
         }
     }

    push @content, ( myfile => [ $path, undef, 'Content-Type' =>
'application/pdf', 'Content-Transfer-Encoding', 'binary' ]),

    local $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;

    my $response = $ua->post(
            $extract_uri,
            Content_Type => 'form-data',
            Content      => \@content,
        );




--
View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3474450.html
Sent from the Solr - User mailing list archive at Nabble.com.