You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/11/17 18:54:45 UTC

2006 ML format?

All,
  On TIKA-2179 [1], Sean Story submitted a document that appears to be a 2006 ML format .xml file.  It appears to inline the components of a regular docx into a single xml file, no zip.  Is it worth the effort to build a read-only subclass of OPCPackage (say, InlinePackage) that would parallel our ZipPackage?  Or would it be better to handle this purely on the Tika side and rewrite the file as a temporary ZipFile that can be read by our current OPCPackage?
  Thank you.

           Best,

                   Tim
[1] https://issues.apache.org/jira/browse/TIKA-2179

RE: 2006 ML format?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> Without looking, can we use that code to read and modify it to allow writing a 2006ML document as a single XML document?

Unfortunately, no.  The goal was read-only and avoid building the full DOM.

Adding the read/write capability would require a fair amount of work, I think; I couldn't quickly find a clean way of extending our current OPCPackage to handle this format.  The challenge was that OPCPackage is tightly tied to ZipPackage, and without some rewiring (perhaps add an AbstractOPCPackage?), it won't be a straightforward addition.

-----Original Message-----
From: Murphy, Mark [mailto:murphymdev@metalexmfg.com] 
Sent: Wednesday, November 23, 2016 4:22 PM
To: 'POI Developers List' <de...@poi.apache.org>
Subject: RE: 2006 ML format?

Without looking, can we use that code to read and modify it to allow writing a 2006ML document as a single XML document? I have no opinion on the read only parser.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, November 23, 2016 2:38 PM
To: POI Developers List <de...@poi.apache.org>
Subject: RE: 2006 ML format?

All,
  I went it alone for the 2006ml format on Tika, see details [1].  If you have any feedback on that bit of code, I'd appreciate it!
 
Major questions:
1) Do we want to move some/most of that into POI for 2006ml?
2) Do we want to offer a streaming read-only XWPF parser based on that code for the regular docx?

Cheers,

         Tim

[1] https://issues.apache.org/jira/browse/TIKA-2179?focusedCommentId=15691150&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15691150

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, November 21, 2016 7:14 AM
To: POI Developers List <de...@poi.apache.org>
Subject: RE: 2006 ML format?

Y, I experimented with adding an InlineOPCPackage; I couldn't quite get it to work, and even if I did, it makes a mess of our OPCPackage and ZipPackage.

I'm thinking I might use this as a reason to build a beanless SXWPF read-only SAX parser.  I suspect that we could very easily re-use whatever I develop for this format on the "modern" ooxml...suspicions have been wrong before...only code and unit tests will tell. :)


-----Original Message-----
From: Mark Murphy [mailto:jmarkmurphy@gmail.com]
Sent: Saturday, November 19, 2016 5:19 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: 2006 ML format?

Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document.

On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Javen.  I worry that I'll be adding duct tape to 
> OPCPackage, but let me put together a patch and we can decide if 
> adding an InlinePackage is too Frankenstein-y for POI.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:javenoneal@gmail.com]
> Sent: Thursday, November 17, 2016 5:58 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: 2006 ML format?
>
> This would probably be of interest to users of POI who are not 
> necessarily using Tika.
>
> If someone spends the effort to add support for a Microsoft Office 
> format, POI seems like a better host.
>
> On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org>
> wrote:
>
> All,
>   On TIKA-2179 [1], Sean Story submitted a document that appears to be 
> a
> 2006 ML format .xml file.  It appears to inline the components of a 
> regular docx into a single xml file, no zip.  Is it worth the effort 
> to build a read-only subclass of OPCPackage (say, InlinePackage) that 
> would parallel our ZipPackage?  Or would it be better to handle this 
> purely on the Tika side and rewrite the file as a temporary ZipFile 
> that can be read by our current OPCPackage?
>   Thank you.
>
>            Best,
>
>                    Tim
> [1] https://issues.apache.org/jira/browse/TIKA-2179
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org

B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[  ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[  ] Z[ K \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: 2006 ML format?

Posted by "Murphy, Mark" <mu...@metalexmfg.com>.
Without looking, can we use that code to read and modify it to allow writing a 2006ML document as a single XML document? I have no opinion on the read only parser.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Wednesday, November 23, 2016 2:38 PM
To: POI Developers List <de...@poi.apache.org>
Subject: RE: 2006 ML format?

All,
  I went it alone for the 2006ml format on Tika, see details [1].  If you have any feedback on that bit of code, I'd appreciate it!
 
Major questions:
1) Do we want to move some/most of that into POI for 2006ml?
2) Do we want to offer a streaming read-only XWPF parser based on that code for the regular docx?

Cheers,

         Tim

[1] https://issues.apache.org/jira/browse/TIKA-2179?focusedCommentId=15691150&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15691150

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, November 21, 2016 7:14 AM
To: POI Developers List <de...@poi.apache.org>
Subject: RE: 2006 ML format?

Y, I experimented with adding an InlineOPCPackage; I couldn't quite get it to work, and even if I did, it makes a mess of our OPCPackage and ZipPackage.

I'm thinking I might use this as a reason to build a beanless SXWPF read-only SAX parser.  I suspect that we could very easily re-use whatever I develop for this format on the "modern" ooxml...suspicions have been wrong before...only code and unit tests will tell. :)


-----Original Message-----
From: Mark Murphy [mailto:jmarkmurphy@gmail.com]
Sent: Saturday, November 19, 2016 5:19 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: 2006 ML format?

Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document.

On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Javen.  I worry that I'll be adding duct tape to 
> OPCPackage, but let me put together a patch and we can decide if 
> adding an InlinePackage is too Frankenstein-y for POI.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:javenoneal@gmail.com]
> Sent: Thursday, November 17, 2016 5:58 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: 2006 ML format?
>
> This would probably be of interest to users of POI who are not 
> necessarily using Tika.
>
> If someone spends the effort to add support for a Microsoft Office 
> format, POI seems like a better host.
>
> On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org>
> wrote:
>
> All,
>   On TIKA-2179 [1], Sean Story submitted a document that appears to be 
> a
> 2006 ML format .xml file.  It appears to inline the components of a 
> regular docx into a single xml file, no zip.  Is it worth the effort 
> to build a read-only subclass of OPCPackage (say, InlinePackage) that 
> would parallel our ZipPackage?  Or would it be better to handle this 
> purely on the Tika side and rewrite the file as a temporary ZipFile 
> that can be read by our current OPCPackage?
>   Thank you.
>
>            Best,
>
>                    Tim
> [1] https://issues.apache.org/jira/browse/TIKA-2179
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org

B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[ K \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: 2006 ML format?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,
  I went it alone for the 2006ml format on Tika, see details [1].  If you have any feedback on that bit of code, I'd appreciate it!
 
Major questions:
1) Do we want to move some/most of that into POI for 2006ml?
2) Do we want to offer a streaming read-only XWPF parser based on that code for the regular docx?

Cheers,

         Tim

[1] https://issues.apache.org/jira/browse/TIKA-2179?focusedCommentId=15691150&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15691150

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, November 21, 2016 7:14 AM
To: POI Developers List <de...@poi.apache.org>
Subject: RE: 2006 ML format?

Y, I experimented with adding an InlineOPCPackage; I couldn't quite get it to work, and even if I did, it makes a mess of our OPCPackage and ZipPackage.

I'm thinking I might use this as a reason to build a beanless SXWPF read-only SAX parser.  I suspect that we could very easily re-use whatever I develop for this format on the "modern" ooxml...suspicions have been wrong before...only code and unit tests will tell. :)


-----Original Message-----
From: Mark Murphy [mailto:jmarkmurphy@gmail.com]
Sent: Saturday, November 19, 2016 5:19 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: 2006 ML format?

Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document.

On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Javen.  I worry that I'll be adding duct tape to 
> OPCPackage, but let me put together a patch and we can decide if 
> adding an InlinePackage is too Frankenstein-y for POI.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:javenoneal@gmail.com]
> Sent: Thursday, November 17, 2016 5:58 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: 2006 ML format?
>
> This would probably be of interest to users of POI who are not 
> necessarily using Tika.
>
> If someone spends the effort to add support for a Microsoft Office 
> format, POI seems like a better host.
>
> On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org>
> wrote:
>
> All,
>   On TIKA-2179 [1], Sean Story submitted a document that appears to be 
> a
> 2006 ML format .xml file.  It appears to inline the components of a 
> regular docx into a single xml file, no zip.  Is it worth the effort 
> to build a read-only subclass of OPCPackage (say, InlinePackage) that 
> would parallel our ZipPackage?  Or would it be better to handle this 
> purely on the Tika side and rewrite the file as a temporary ZipFile 
> that can be read by our current OPCPackage?
>   Thank you.
>
>            Best,
>
>                    Tim
> [1] https://issues.apache.org/jira/browse/TIKA-2179
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org


RE: 2006 ML format?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, I experimented with adding an InlineOPCPackage; I couldn't quite get it to work, and even if I did, it makes a mess of our OPCPackage and ZipPackage.

I'm thinking I might use this as a reason to build a beanless SXWPF read-only SAX parser.  I suspect that we could very easily re-use whatever I develop for this format on the "modern" ooxml...suspicions have been wrong before...only code and unit tests will tell. :)


-----Original Message-----
From: Mark Murphy [mailto:jmarkmurphy@gmail.com] 
Sent: Saturday, November 19, 2016 5:19 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: 2006 ML format?

Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document.

On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Javen.  I worry that I'll be adding duct tape to 
> OPCPackage, but let me put together a patch and we can decide if 
> adding an InlinePackage is too Frankenstein-y for POI.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:javenoneal@gmail.com]
> Sent: Thursday, November 17, 2016 5:58 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: 2006 ML format?
>
> This would probably be of interest to users of POI who are not 
> necessarily using Tika.
>
> If someone spends the effort to add support for a Microsoft Office 
> format, POI seems like a better host.
>
> On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org>
> wrote:
>
> All,
>   On TIKA-2179 [1], Sean Story submitted a document that appears to be 
> a
> 2006 ML format .xml file.  It appears to inline the components of a 
> regular docx into a single xml file, no zip.  Is it worth the effort 
> to build a read-only subclass of OPCPackage (say, InlinePackage) that 
> would parallel our ZipPackage?  Or would it be better to handle this 
> purely on the Tika side and rewrite the file as a temporary ZipFile 
> that can be read by our current OPCPackage?
>   Thank you.
>
>            Best,
>
>                    Tim
> [1] https://issues.apache.org/jira/browse/TIKA-2179
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: 2006 ML format?

Posted by Mark Murphy <jm...@gmail.com>.
Wow, this is nothing like what I thought it would be. I discovered that you
can write a document in this format by selecting save as xml document.

On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Javen.  I worry that I'll be adding duct tape to OPCPackage,
> but let me put together a patch and we can decide if adding an
> InlinePackage is too Frankenstein-y for POI.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:javenoneal@gmail.com]
> Sent: Thursday, November 17, 2016 5:58 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: 2006 ML format?
>
> This would probably be of interest to users of POI who are not necessarily
> using Tika.
>
> If someone spends the effort to add support for a Microsoft Office format,
> POI seems like a better host.
>
> On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org>
> wrote:
>
> All,
>   On TIKA-2179 [1], Sean Story submitted a document that appears to be a
> 2006 ML format .xml file.  It appears to inline the components of a
> regular docx into a single xml file, no zip.  Is it worth the effort to
> build a read-only subclass of OPCPackage (say, InlinePackage) that would
> parallel our ZipPackage?  Or would it be better to handle this purely on
> the Tika side and rewrite the file as a temporary ZipFile that can be read
> by our current OPCPackage?
>   Thank you.
>
>            Best,
>
>                    Tim
> [1] https://issues.apache.org/jira/browse/TIKA-2179
>

RE: 2006 ML format?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Javen.  I worry that I'll be adding duct tape to OPCPackage, but let me put together a patch and we can decide if adding an InlinePackage is too Frankenstein-y for POI.
	
-----Original Message-----
From: Javen O'Neal [mailto:javenoneal@gmail.com] 
Sent: Thursday, November 17, 2016 5:58 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: 2006 ML format?

This would probably be of interest to users of POI who are not necessarily using Tika.

If someone spends the effort to add support for a Microsoft Office format, POI seems like a better host.

On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

All,
  On TIKA-2179 [1], Sean Story submitted a document that appears to be a
2006 ML format .xml file.  It appears to inline the components of a regular docx into a single xml file, no zip.  Is it worth the effort to build a read-only subclass of OPCPackage (say, InlinePackage) that would parallel our ZipPackage?  Or would it be better to handle this purely on the Tika side and rewrite the file as a temporary ZipFile that can be read by our current OPCPackage?
  Thank you.

           Best,

                   Tim
[1] https://issues.apache.org/jira/browse/TIKA-2179

Re: 2006 ML format?

Posted by Javen O'Neal <ja...@gmail.com>.
This would probably be of interest to users of POI who are not necessarily
using Tika.

If someone spends the effort to add support for a Microsoft Office format,
POI seems like a better host.

On Nov 17, 2016 10:55 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

All,
  On TIKA-2179 [1], Sean Story submitted a document that appears to be a
2006 ML format .xml file.  It appears to inline the components of a regular
docx into a single xml file, no zip.  Is it worth the effort to build a
read-only subclass of OPCPackage (say, InlinePackage) that would parallel
our ZipPackage?  Or would it be better to handle this purely on the Tika
side and rewrite the file as a temporary ZipFile that can be read by our
current OPCPackage?
  Thank you.

           Best,

                   Tim
[1] https://issues.apache.org/jira/browse/TIKA-2179