You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by 肖金伟 <ji...@gmail.com> on 2013/04/09 07:52:23 UTC

Re: Tika 1.3 parse eml file and extracted text are garbled characters

Hello,
>
> I am using Tika 1.3 to extract text of eml files and I got garbled
> characters parsing some eml file.
> when I prints out charset detected which mismatches charset presents in
> eml file's content-type.
>
> I cannot figure out the reason for this. Please let me if any of you have
> seen this error and how to fix this?
>
> Here is the code snippet in JAVA:
>
>                 Parser parser = new RFC822Parser();
>
> 		ContentHandler body = new BodyContentHandler();
> 		Metadata metadata = new Metadata();
> 		metadata.set(Metadata.RESOURCE_NAME_KEY, "7.eml");
> 		
> 		ParseContext context = new ParseContext();
> 		context.set(Parser.class, parser);
>
> 		InputStream stream = new FileInputStream("7.eml");
> 		
> 		try
> 		{
> 			parser.parse(stream, body, metadata, context);
> 			System.out.println(body.toString());
> 		}
> 		catch (Exception e)
> 		{
>
> 			// TODO Auto-generated catch block
> 			e.printStackTrace();
> 		}
> 		
>  To help recreate this error, the test eml file is enclosed.
>
> Thanks.
>
>
>
>
>
>
>



-- 
*姓名* : Tinyxiao       * Email* : jinwei.xiao@gmail.com