You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by 肖金伟 <ji...@gmail.com> on 2013/04/09 07:52:23 UTC
Re: Tika 1.3 parse eml file and extracted text are garbled characters
Hello,
>
> I am using Tika 1.3 to extract text of eml files and I got garbled
> characters parsing some eml file.
> when I prints out charset detected which mismatches charset presents in
> eml file's content-type.
>
> I cannot figure out the reason for this. Please let me if any of you have
> seen this error and how to fix this?
>
> Here is the code snippet in JAVA:
>
> Parser parser = new RFC822Parser();
>
> ContentHandler body = new BodyContentHandler();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, "7.eml");
>
> ParseContext context = new ParseContext();
> context.set(Parser.class, parser);
>
> InputStream stream = new FileInputStream("7.eml");
>
> try
> {
> parser.parse(stream, body, metadata, context);
> System.out.println(body.toString());
> }
> catch (Exception e)
> {
>
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
>
> To help recreate this error, the test eml file is enclosed.
>
> Thanks.
>
>
>
>
>
>
>
--
*姓名* : Tinyxiao * Email* : jinwei.xiao@gmail.com