You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mariusz Cieślukowski (Jira)" <ji...@apache.org> on 2020/05/11 12:06:00 UTC

[jira] [Updated] (TIKA-3100) RFC822Parser ignore charset when extractAllAlternatives set to true

     [ https://issues.apache.org/jira/browse/TIKA-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mariusz Cieślukowski updated TIKA-3100:
---------------------------------------
    Labels: rfc822parser  (was: )

> RFC822Parser ignore charset when extractAllAlternatives set to true
> -------------------------------------------------------------------
>
>                 Key: TIKA-3100
>                 URL: https://issues.apache.org/jira/browse/TIKA-3100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>         Environment:  
> Windows 10 x64
> OpenJDK 14
>            Reporter: Mariusz Cieślukowski
>            Priority: Major
>              Labels: rfc822parser
>         Attachments: testRFC822_quoted_charset_iso_8859_2
>
>
> In default mode RFC822Parser seems to ignore charset defined in headers when detect content. When I set "extractAllAlternatives " to false then content seems fine.
> Test case:
> {code:java}
>     @Test
>     public void testQuotedPrintableCharset() {
>         Metadata metadata = new Metadata();
>         InputStream stream = getStream("test-documents/testRFC822_quoted_charset_iso_8859_2");
>         ContentHandler handler = new BodyContentHandler();
>         ParseContext context = new ParseContext();
>         
>         try {
>             RFC822Parser emailparser = new RFC822Parser();
>             emailparser.setExtractAllAlternatives(true);            
>             emailparser.parse(stream, handler, metadata, context);
>             String bodyText = handler.toString();
>             assertTrue(bodyText.contains("Dzie\u0144 dobry."));
>             
>         } catch (Exception e) {
>             fail("Exception thrown: " + e.getMessage());
>         }
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)