You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by babug <ba...@gmail.com> on 2012/04/16 15:23:52 UTC

How do i get actual html content of attached file

Hi,

I have attached(Ticket_Diary.oft) a outlook format template.I need to parse
these type of files and get the actual HTML content.I have tested with
following code, but the parser returns <p> tag instead of <table> or <Div>
tags.How do i exclude from SAFE_ELEMENTS  map.?

*String msgfile = "/home/test/Desktop/EmailParse/Ticket Diary.oft";
		InputStream stream = new FileInputStream(msgfile);
		StringWriter sw = new StringWriter();
		Parser parser = new OfficeParser();
		Metadata metadata = new Metadata();
		ParseContext context = new ParseContext();
		context.set(HtmlMapper.class,IdentityHtmlMapper.INSTANCE); 
		
		 SAXTransformerFactory factory = (SAXTransformerFactory)
         SAXTransformerFactory.newInstance();
		 TransformerHandler handler = factory.newTransformerHandler();
		 handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
		 handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
		 handler.setResult(new StreamResult(sw));

		try {
			parser.parse(stream,handler,metadata, context);

		} finally {
			stream.close();
		}
		 String content = sw.toString();*

Output Example :
============
*
Ticket Diary

<dl/>
<div class="message-body"><p>&nbsp; _____ &nbsp;
</p>

<p>&nbsp;</p>

<p>High Priority Notification *</p>


I have attached(TestTicket.jpg) the screen shot of file(.oft) looks, when
its open in outlook. I need to get the full table,style tag of what it has.
Can some one help me on this?

http://apache-poi.1045710.n5.nabble.com/file/n5643786/Ticket_Diary.oft
Ticket_Diary.oft 

http://apache-poi.1045710.n5.nabble.com/file/n5643786/TestTicket.jpg
TestTicket.jpg 

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp5643786p5643786.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: How do i get actual html content of attached file

Posted by babug <ba...@gmail.com>.

Thanks Nick,

any examples or path?

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp5643786p5646648.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: How do i get actual html content of attached file

Posted by Nick Burch <ni...@alfresco.com>.

On 17/04/12 11:37, babug wrote:
>   I have tried using poi, but i couldn't get the output as expected.I get
> chunk in RTF and BODY type . The RtfContent has following chunk *
> {\*\htmltag161 }*. Could you please provide a sample code or something?

That's the "html" version of your file. Outlook often stores the rich 
text version as RTF and not HTML, and that's what your file has. There 
is no HTML version in your case, if you want to see the formatting then 
you need to work with the RTF

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: How do i get actual html content of attached file

Posted by babug <ba...@gmail.com>.

Hi Nick,

I have created a similar post on Tika forum.

http://apache-tika-users.1629097.n2.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp7472491p7472491.html

 I have tried using poi, but i couldn't get the output as expected.I get
chunk in RTF and BODY type . The RtfContent has following chunk *
{\*\htmltag161 }*. Could you please provide a sample code or something?

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp5643786p5646236.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: How do i get actual html content of attached file

Posted by babug <ba...@gmail.com>.

Hi Nick,

I have created a similar post on Tika forum.
http://apache-tika-users.1629097.n2.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp7472491p7472491.html 


 I have tried using poi, but i couldn't get the output as expected.I get
chunk in RTF and BODY type . The RtfContent has following chunk *
{\*\htmltag161 }*. Could you please provide a sample code or something?

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/How-do-i-get-actual-html-content-of-attached-file-tp5643786p5646234.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: How do i get actual html content of attached file

Posted by Nick Burch <ni...@alfresco.com>.

On Mon, 16 Apr 2012, babug wrote:
> I have attached(Ticket_Diary.oft) a outlook format template.I need to 
> parse these type of files and get the actual HTML content.I have tested 
> with following code, but the parser returns <p> tag instead of <table> 
> or <Div> tags.How do i exclude from SAFE_ELEMENTS map.?

It might not be stored as html - Outlook often stores "html" content of 
emails as RTF.

Also...

> *String msgfile = "/home/test/Desktop/EmailParse/Ticket Diary.oft";
> 		InputStream stream = new FileInputStream(msgfile);
> 		StringWriter sw = new StringWriter();
> 		Parser parser = new OfficeParser();
> 		Metadata metadata = new Metadata();
> 		ParseContext context = new ParseContext();
> 		context.set(HtmlMapper.class,IdentityHtmlMapper.INSTANCE);

This seems to be you using Tika. If you want to use Tika to do this, you 
should probably ask on the Tika list. Alternately, you can use HSMF from 
Apache POI to directly access the file, and get at the exact bits of it 
you need. I'd suggest you look at the HSMF text extractor in POI, and 
OutlookExtractor from Apache Tika as good examples of how to go about 
using HSMF

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org