You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mariusz Cieślukowski (Jira)" <ji...@apache.org> on 2020/05/11 12:06:00 UTC
[jira] [Updated] (TIKA-3100) RFC822Parser ignore charset when
extractAllAlternatives set to true
[ https://issues.apache.org/jira/browse/TIKA-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mariusz Cieślukowski updated TIKA-3100:
---------------------------------------
Labels: rfc822parser (was: )
> RFC822Parser ignore charset when extractAllAlternatives set to true
> -------------------------------------------------------------------
>
> Key: TIKA-3100
> URL: https://issues.apache.org/jira/browse/TIKA-3100
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.24.1
> Environment:
> Windows 10 x64
> OpenJDK 14
> Reporter: Mariusz Cieślukowski
> Priority: Major
> Labels: rfc822parser
> Attachments: testRFC822_quoted_charset_iso_8859_2
>
>
> In default mode RFC822Parser seems to ignore charset defined in headers when detect content. When I set "extractAllAlternatives " to false then content seems fine.
> Test case:
> {code:java}
> @Test
> public void testQuotedPrintableCharset() {
> Metadata metadata = new Metadata();
> InputStream stream = getStream("test-documents/testRFC822_quoted_charset_iso_8859_2");
> ContentHandler handler = new BodyContentHandler();
> ParseContext context = new ParseContext();
>
> try {
> RFC822Parser emailparser = new RFC822Parser();
> emailparser.setExtractAllAlternatives(true);
> emailparser.parse(stream, handler, metadata, context);
> String bodyText = handler.toString();
> assertTrue(bodyText.contains("Dzie\u0144 dobry."));
>
> } catch (Exception e) {
> fail("Exception thrown: " + e.getMessage());
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)