You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Bing Ran <bi...@gmail.com> on 2014/06/08 19:32:02 UTC

WordToHtmlConverter in xwpf

Hi,

Now I'm looking at some docx files and wondering if there's something
similar to the hwpf WordToHtmlConverter/WordToTextConverter which has
served me very well for extracting text and images for doc files.

Thanks!

Bing

Re: WordToHtmlConverter in xwpf

Posted by Angelo zerr <an...@gmail.com>.
Hi Bing,

XDocReport converter doesn't manage shape (I must update our wiki to set
the limitations for our converter).

But any contribution are welcome!

if you wish to speak about XDocReport converter, I suggest you to post on
XDocReport forum to avoid disturbing POI forum.

Regard's Angelo


2014-06-09 10:17 GMT+02:00 Bing Ran <bi...@gmail.com>:

> Thanks Angelo.
>
> I gave XDocReport a go and had limited success with a dock file which
> contains Microsoft Equations.
>
> I believe the equations edited by the equation editor are in the format of
> wmf files. The reference section is like this:
>
> <xml-fragment w:dxaOrig="1542" w:dyaOrig="300">
>
>   <v:shape id="_x0000_i1027" o:spid="_x0000_i1028" type="#_x0000_t75"
>
> style="width:77pt;height:15pt;mso-position-horizontal-relative:page;mso-position-vertical-relative:page"
> o:ole="">
>
>     <v:imagedata r:id="rId12" o:title=""/>
>
>   </v:shape>
>
>   <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1027"
> DrawAspect="Content" ObjectID="_1336171117" r:id="rId13">
>
>     <o:FieldCodes>\* MERGEFORMAT</o:FieldCodes>
>
>   </o:OLEObject>
>
> </xml-fragment>
>
>
> I overrode the XWPFDocumentVisitor and realized that the CTObject instance
> derived from the above fragment was not handled by the visitRun().
>
>
> I'm wondering how I'm going to retrieve the picture data from CTObject.
>
>
> Reading this line: <v:imagedata r:id="rId12" o:title=""/> I
>
>
> I would imagine that the picture data is stored somewhere with an ID of
> rId12.
>
>
> Any help is highly appreciated!
>
>
> Bing
>
>
>
>
>
> 2014-06-09 5:12 GMT+08:00 Angelo zerr <an...@gmail.com>:
>
> > Hi Bing,
> >
> > XDocReport provides a docx->xhtml converter based on POI. See at
> > https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML
> >
> > @nick I didn't know Tika, I will try it and perhaps will integrate it to
> > XDocReport if it works well. Thank's for this info.
> >
> > Regards Angelo
> >
> >
> > 2014-06-08 20:13 GMT+02:00 Nick Burch <ap...@gagravarr.org>:
> >
> > > On Mon, 9 Jun 2014, Bing Ran wrote:
> > >
> > >> Now I'm looking at some docx files and wondering if there's something
> > >> similar to the hwpf WordToHtmlConverter/WordToTextConverter which has
> > >> served me very well for extracting text and images for doc files.
> > >>
> > >
> > > For plain text, try XWPFWordExtractor. For HTML, try Apache Tika (which
> > > wraps Apache POI)
> > >
> > > Nick
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > > For additional commands, e-mail: user-help@poi.apache.org
> > >
> > >
> >
>

Re: WordToHtmlConverter in xwpf

Posted by Bing Ran <bi...@gmail.com>.
Thanks Angelo.

I gave XDocReport a go and had limited success with a dock file which
contains Microsoft Equations.

I believe the equations edited by the equation editor are in the format of
wmf files. The reference section is like this:

<xml-fragment w:dxaOrig="1542" w:dyaOrig="300">

  <v:shape id="_x0000_i1027" o:spid="_x0000_i1028" type="#_x0000_t75"
style="width:77pt;height:15pt;mso-position-horizontal-relative:page;mso-position-vertical-relative:page"
o:ole="">

    <v:imagedata r:id="rId12" o:title=""/>

  </v:shape>

  <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1027"
DrawAspect="Content" ObjectID="_1336171117" r:id="rId13">

    <o:FieldCodes>\* MERGEFORMAT</o:FieldCodes>

  </o:OLEObject>

</xml-fragment>


I overrode the XWPFDocumentVisitor and realized that the CTObject instance
derived from the above fragment was not handled by the visitRun().


I'm wondering how I'm going to retrieve the picture data from CTObject.


Reading this line: <v:imagedata r:id="rId12" o:title=""/> I


I would imagine that the picture data is stored somewhere with an ID of
rId12.


Any help is highly appreciated!


Bing





2014-06-09 5:12 GMT+08:00 Angelo zerr <an...@gmail.com>:

> Hi Bing,
>
> XDocReport provides a docx->xhtml converter based on POI. See at
> https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML
>
> @nick I didn't know Tika, I will try it and perhaps will integrate it to
> XDocReport if it works well. Thank's for this info.
>
> Regards Angelo
>
>
> 2014-06-08 20:13 GMT+02:00 Nick Burch <ap...@gagravarr.org>:
>
> > On Mon, 9 Jun 2014, Bing Ran wrote:
> >
> >> Now I'm looking at some docx files and wondering if there's something
> >> similar to the hwpf WordToHtmlConverter/WordToTextConverter which has
> >> served me very well for extracting text and images for doc files.
> >>
> >
> > For plain text, try XWPFWordExtractor. For HTML, try Apache Tika (which
> > wraps Apache POI)
> >
> > Nick
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> > For additional commands, e-mail: user-help@poi.apache.org
> >
> >
>

Re: WordToHtmlConverter in xwpf

Posted by Angelo zerr <an...@gmail.com>.
Hi Bing,

XDocReport provides a docx->xhtml converter based on POI. See at
https://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

@nick I didn't know Tika, I will try it and perhaps will integrate it to
XDocReport if it works well. Thank's for this info.

Regards Angelo


2014-06-08 20:13 GMT+02:00 Nick Burch <ap...@gagravarr.org>:

> On Mon, 9 Jun 2014, Bing Ran wrote:
>
>> Now I'm looking at some docx files and wondering if there's something
>> similar to the hwpf WordToHtmlConverter/WordToTextConverter which has
>> served me very well for extracting text and images for doc files.
>>
>
> For plain text, try XWPFWordExtractor. For HTML, try Apache Tika (which
> wraps Apache POI)
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: WordToHtmlConverter in xwpf

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 9 Jun 2014, Bing Ran wrote:
> Now I'm looking at some docx files and wondering if there's something 
> similar to the hwpf WordToHtmlConverter/WordToTextConverter which has 
> served me very well for extracting text and images for doc files.

For plain text, try XWPFWordExtractor. For HTML, try Apache Tika (which 
wraps Apache POI)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org