You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2016/12/26 23:05:44 UTC

[Bug 60519] Extractor for *SSF embeddings

https://bz.apache.org/bugzilla/show_bug.cgi?id=60519

--- Comment #1 from Andreas Beeker <ki...@apache.org> ---
The test data for EMF with embedded PDF can be found under
https://people.apache.org/~kiwiwings/Basic_Expense_Template_2011.xls

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [Bug 60519] Extractor for *SSF embeddings

Posted by Javen O'Neal <on...@apache.org>.
The Windows Enhanced Metafile format isn't specific to Microsoft Office
documents, so it probably doesn't belong in POI.
Even if we advertise this as rudimentary support only, it would likely
generate bug reports for POI, detracting from the time we spend on reading
and writing Office documents.

On the same token, POI doesn't maintain code to work with BMP or SVG
content, since there are other libraries that can work with this.

My vote is for a different project to support EMF files. This has an added
benefit of making EMF support for other Java applications without adding
POI as a dependency.

On Jan 4, 2017 11:13 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

> Andi,
>   I like what you've done with the patch for this issue.
>
> All,
>   Is it worthwhile adding a rudimentary EMF parser to POI?  It might help
> us explore what other "full docs" are stuffed inside EMF like the PDFs that
> you found.  I hacked out a version for Tika (locally), but I think this
> would be better in POI.  WDYT?
>
>      Cheers,
>
>              Tim
>
> -----Original Message-----
> From: bugzilla@apache.org [mailto:bugzilla@apache.org]
> Sent: Monday, December 26, 2016 6:06 PM
> To: dev@poi.apache.org
> Subject: [Bug 60519] Extractor for *SSF embeddings
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=60519
>
> --- Comment #1 from Andreas Beeker <ki...@apache.org> --- The test
> data for EMF with embedded PDF can be found under
> https://people.apache.org/~kiwiwings/Basic_Expense_Template_2011.xls
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional
> commands, e-mail: dev-help@poi.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

Re: [Bug 60519] Extractor for *SSF embeddings

Posted by Javen O'Neal <on...@apache.org>.
Sounds like the need is more urgent than timeline of spawning an
incubator project.

In that case, pick whichever project (Tika or POI) will make it
easiest to minimize the package, class, constructor, argument, and
return value dependency on the rest foster parent project so that it's
easier to move to incubator later.

If POI is the better fit, drop it in scratchpad and add an @Internal
annotation so that you're free to make breaking changes at any time.

On Thu, Jan 5, 2017 at 5:00 AM, Allison, Timothy B. <ta...@mitre.org> wrote:
> Thank you Andi and Javen!
>
> Javen,
>
>   I respect your point about "not limited to MSOffice documents".  My selfish/Tika-ish goal in processing them, frankly, is only to extract embedded documents and their metadata.  Andi's patch demonstrated the need to handle the "feature" distinction btwn how Mac xls and Windows xls handle embedded pdf files -- in Windows, the pdf is available as a standalone embedded file, with an emf to represent the icon; in Mac, the emf contains the original pdf (and graphics to represent the icon?)...in short, over on Tika, we're currently not extracting the PDF from the Mac xls, but we are from the Windows xls.
>
>   So, y, a robust read/write EMF parser/writer would make sense as a standalone project in incubator.  However, I don't have the energy/time to do much more than read-only for this one very small problem.  POI's scratchpad or Tika are the two immediate targets that I could easily contribute to.  If there's a need and someone has the time, we could move whatever code there is for this one small task into a future incubator project.
>
>   I also respect your point about inviting bug reports that would distract us from focusing on MSOffice documents.  Sounds like there's loose consensus to put this in Tika for now, and if anyone wants to take it on, move it to incubator?
>
>   Cheers,
>
>                   Tim
>
>
> P.S. As a side note, I suspect there are some interesting metadata items that we can pull out of EMFs... For example, I saw some text content of the PDF in the EMF portion of the mac EMF.  I also saw some original paths for the embedded file in the EMF.
>
> -----Original Message-----
> From: Javen O'Neal [mailto:onealj@apache.org]
> Sent: Wednesday, January 4, 2017 8:05 PM
> To: POI Developers List <de...@poi.apache.org>
> Subject: Re: [Bug 60519] Extractor for *SSF embeddings
>
> What about an Apache incubator project for reading and writing EMF(+) files?
>
> On Jan 4, 2017 2:53 PM, "Andreas Beeker" <ki...@apache.org> wrote:
>
>> Hi Tim,
>>
>> every now and then I play with the idea to provide an EMF parser like
>> the WMF parser, to render images inside slideshows. This could be of
>> course used to extract other content too.
>> The simplest way would be, to adapt the FreeHep library, but its GPL
>> licensed ... :(
>>
>> So for extracting embedded content, I guess it's not so difficult to
>> generically parse the emf(+) records and only handle the interesting ones.
>> This limited functionality should be in scratchpad or the example classes.
>> If it is not a huge code chunk, it could be in the Extractor class -
>> otherwise I would like to see it in Tika ...
>>
>> Andi
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional
>> commands, e-mail: dev-help@poi.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [Bug 60519] Extractor for *SSF embeddings

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you Andi and Javen!

Javen,

  I respect your point about "not limited to MSOffice documents".  My selfish/Tika-ish goal in processing them, frankly, is only to extract embedded documents and their metadata.  Andi's patch demonstrated the need to handle the "feature" distinction btwn how Mac xls and Windows xls handle embedded pdf files -- in Windows, the pdf is available as a standalone embedded file, with an emf to represent the icon; in Mac, the emf contains the original pdf (and graphics to represent the icon?)...in short, over on Tika, we're currently not extracting the PDF from the Mac xls, but we are from the Windows xls.

  So, y, a robust read/write EMF parser/writer would make sense as a standalone project in incubator.  However, I don't have the energy/time to do much more than read-only for this one very small problem.  POI's scratchpad or Tika are the two immediate targets that I could easily contribute to.  If there's a need and someone has the time, we could move whatever code there is for this one small task into a future incubator project.

  I also respect your point about inviting bug reports that would distract us from focusing on MSOffice documents.  Sounds like there's loose consensus to put this in Tika for now, and if anyone wants to take it on, move it to incubator?  

  Cheers,

                  Tim


P.S. As a side note, I suspect there are some interesting metadata items that we can pull out of EMFs... For example, I saw some text content of the PDF in the EMF portion of the mac EMF.  I also saw some original paths for the embedded file in the EMF.

-----Original Message-----
From: Javen O'Neal [mailto:onealj@apache.org] 
Sent: Wednesday, January 4, 2017 8:05 PM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: [Bug 60519] Extractor for *SSF embeddings

What about an Apache incubator project for reading and writing EMF(+) files?

On Jan 4, 2017 2:53 PM, "Andreas Beeker" <ki...@apache.org> wrote:

> Hi Tim,
>
> every now and then I play with the idea to provide an EMF parser like 
> the WMF parser, to render images inside slideshows. This could be of 
> course used to extract other content too.
> The simplest way would be, to adapt the FreeHep library, but its GPL 
> licensed ... :(
>
> So for extracting embedded content, I guess it's not so difficult to 
> generically parse the emf(+) records and only handle the interesting ones.
> This limited functionality should be in scratchpad or the example classes.
> If it is not a huge code chunk, it could be in the Extractor class - 
> otherwise I would like to see it in Tika ...
>
> Andi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional 
> commands, e-mail: dev-help@poi.apache.org
>
>

Re: [Bug 60519] Extractor for *SSF embeddings

Posted by Javen O'Neal <on...@apache.org>.
What about an Apache incubator project for reading and writing EMF(+) files?

On Jan 4, 2017 2:53 PM, "Andreas Beeker" <ki...@apache.org> wrote:

> Hi Tim,
>
> every now and then I play with the idea to provide an EMF parser like the
> WMF parser, to render images inside slideshows. This could be of course
> used to extract other content too.
> The simplest way would be, to adapt the FreeHep library, but its GPL
> licensed ... :(
>
> So for extracting embedded content, I guess it's not so difficult to
> generically parse the emf(+) records and only handle the interesting ones.
> This limited functionality should be in scratchpad or the example classes.
> If it is not a huge code chunk, it could be in the Extractor class -
> otherwise I would like to see it in Tika ...
>
> Andi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
>

Re: [Bug 60519] Extractor for *SSF embeddings

Posted by Andreas Beeker <ki...@apache.org>.
Hi Tim,

every now and then I play with the idea to provide an EMF parser like the WMF parser, to render images inside slideshows. This could be of course used to extract other content too.
The simplest way would be, to adapt the FreeHep library, but its GPL licensed ... :(

So for extracting embedded content, I guess it's not so difficult to generically parse the emf(+) records and only handle the interesting ones. This limited functionality should be in scratchpad or the example classes.
If it is not a huge code chunk, it could be in the Extractor class - otherwise I would like to see it in Tika ...

Andi

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: [Bug 60519] Extractor for *SSF embeddings

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Andi,
  I like what you've done with the patch for this issue.  

All,
  Is it worthwhile adding a rudimentary EMF parser to POI?  It might help us explore what other "full docs" are stuffed inside EMF like the PDFs that you found.  I hacked out a version for Tika (locally), but I think this would be better in POI.  WDYT?

     Cheers,

             Tim

-----Original Message-----
From: bugzilla@apache.org [mailto:bugzilla@apache.org] 
Sent: Monday, December 26, 2016 6:06 PM
To: dev@poi.apache.org
Subject: [Bug 60519] Extractor for *SSF embeddings

https://bz.apache.org/bugzilla/show_bug.cgi?id=60519

--- Comment #1 from Andreas Beeker <ki...@apache.org> --- The test data for EMF with embedded PDF can be found under https://people.apache.org/~kiwiwings/Basic_Expense_Template_2011.xls

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org