You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/09/30 20:37:10 UTC

Compressed RTF / TNEF / LZFU

Hi,

Is there support for MS custom compression format TNEF/LZFU?
It is used to compress RTF files, and is also infamously used for the WINMAIL.DAT files sometimes found as email attachments.

Here's an open source Java implementation of a decompressor:
http://www.freeutils.net/source/jtnef/rtfcompressed.jsp

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


Re: Compressed RTF / TNEF / LZFU

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
> On Fri, Oct 1, 2010 at 11:49 AM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> * What is the correct mimetype? tika-mimetypes.xml lists application/vnd.ms-tnef
>>  I see references to application/ms-tnef other places, should we support both?
> 
> It looks like application/vnd.ms-tnef is the official type [1], but it
> would probably be a good idea to add ../ms-tnef as an alias. Can you
> file an improvement request for that?

TIKA-523
and http://github.com/jukka/jtnef/issues/issue/1

>> * Could we legally include with Tika a maven target or script which downloads
>> 3rd party jars? That would benefit developers (broader distribution) as well as
>> the Tika community (better file format support).
> It would of course be legal to do so (i.e. we wouldn't be going to
> jail for that ;-), but Apache policies (see [2], most notably [3])
> puts some limits on what an official Apache release can include. The
> reason for those policies is to make it easy to include Apache code
> also in commercial products, which I think is a Good Thing (TM).


I'm not thinking of linking against the GPL plugin, but helping users
find them and require explicit action to download and use them.

One such way could be to include a file PLUGINS-README.TXT in which
we could list all 3rd party plugins and how to obtain them. I think this
is more visible than simply a Wiki entry. We could then quickly expand
the numer of file formats supported, and slowly re-implement each
of them in Apache clothes.

In short, we as deveopers care a lot obout licenses, but end users very
often care more about features and are more than happy to use GPL plugins.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


Re: Compressed RTF / TNEF / LZFU

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Oct 1, 2010 at 11:49 AM, Jan Høydahl / Cominvent
<ja...@cominvent.com> wrote:
> * What is the correct mimetype? tika-mimetypes.xml lists application/vnd.ms-tnef
>  I see references to application/ms-tnef other places, should we support both?

It looks like application/vnd.ms-tnef is the official type [1], but it
would probably be a good idea to add ../ms-tnef as an alias. Can you
file an improvement request for that?

> * The 3rd party developers do not necessarily accept contributions
> Rather than forking, could an alternative be to build a "glue" jar of
> the Tika files only?

I already contacted support@freeutils.net and offered my changes for
inclusion in the upstream codebase. Having a separate glue jar for
just a single class seems a bit wasteful.

> * Could we legally include with Tika a maven target or script which downloads
> 3rd party jars? That would benefit developers (broader distribution) as well as
> the Tika community (better file format support).

It would of course be legal to do so (i.e. we wouldn't be going to
jail for that ;-), but Apache policies (see [2], most notably [3])
puts some limits on what an official Apache release can include. The
reason for those policies is to make it easy to include Apache code
also in commercial products, which I think is a Good Thing (TM).

There's nothing stopping anyone from creating such an external Tika
distribution that also contains dependencies under the GPL and other
troublesome licenses. We'd even be happy to link to such efforts from
the Tika web site, but it should still be a clearly separate effort to
avoid confusing the licensing status of the official Tika releases.

Anyway, the best long-term way forward would IMHO be to follow Nick's
suggestion to implement this feature directly in POI.

[1] http://www.iana.org/assignments/media-types/application/vnd.ms-tnef
[2] http://www.apache.org/legal/resolved.html
[3] http://www.apache.org/legal/resolved.html#criteria

BR,

Jukka Zitting

Re: Compressed RTF / TNEF / LZFU

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Wow, that was quick :)

Agree that your solution is a safer one license wise, and deployment simply by dropping in the jar is as simple as it gets.

Questions:
* What is the correct mimetype? tika-mimetypes.xml lists application/vnd.ms-tnef
  I see references to application/ms-tnef other places, should we support both?
* The 3rd party developers do not necessarily accept contributions
  Rather than forking, could an alternative be to build a "glue" jar of the Tika files only?
* Could we legally include with Tika a maven target or script which downloads 3rd party jars? 
  That would benefit developers (broader distribution) as well as the Tika community (better file format support).

I started by creating a wiki page to list known 3rd party parser plugins.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 30. sep. 2010, at 22.33, Jukka Zitting wrote:

> Hi,
> 
> On Thu, Sep 30, 2010 at 9:43 PM, Nick Burch <ni...@alfresco.com> wrote:
>> On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
>>> We could implement the decoder without distributing tnef.jar, using
>>> Class.forName() and simply disabling the decoder if the jar is not on
>>> classpath? Then it is up to the user to download the jar and thereby accept
>>> the GPL license.
>> 
>> I've got a feeling that that may still be too closely linked to be allowed,
>> but hopefully someone else can point us to the answer.
> 
> You're right. The copyleft effects would start to creep in as soon as
> we write Class.forName("net.freeutils.tnef....").
> 
>> There's nothing stopping you writing a GPL licensed parser with a hard
>> dependency, and including it yourself though.
> 
> Creating an example of how to do this has long been on my TODO list,
> and since jtnef is such a simple library I decided to use it for the
> example. See [1] for my fork of the latest jtnef sources, and [2] for
> the commit where I added Tika support to it. Thanks to TIKA-317 [3],
> you can add the resulting jtnef jar to your classpath, and Tika will
> automatically pick it up for parsing any application/x-tnef documents.
> 
> [1] http://github.com/jukka/jtnef
> [2] http://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef
> [3] https://issues.apache.org/jira/browse/TIKA-317
> 
> BR,
> 
> Jukka Zitting


Re: Compressed RTF / TNEF / LZFU

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Sep 30, 2010 at 9:43 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
>> We could implement the decoder without distributing tnef.jar, using
>> Class.forName() and simply disabling the decoder if the jar is not on
>> classpath? Then it is up to the user to download the jar and thereby accept
>> the GPL license.
>
> I've got a feeling that that may still be too closely linked to be allowed,
> but hopefully someone else can point us to the answer.

You're right. The copyleft effects would start to creep in as soon as
we write Class.forName("net.freeutils.tnef....").

> There's nothing stopping you writing a GPL licensed parser with a hard
> dependency, and including it yourself though.

Creating an example of how to do this has long been on my TODO list,
and since jtnef is such a simple library I decided to use it for the
example. See [1] for my fork of the latest jtnef sources, and [2] for
the commit where I added Tika support to it. Thanks to TIKA-317 [3],
you can add the resulting jtnef jar to your classpath, and Tika will
automatically pick it up for parsing any application/x-tnef documents.

[1] http://github.com/jukka/jtnef
[2] http://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef
[3] https://issues.apache.org/jira/browse/TIKA-317

BR,

Jukka Zitting

Re: Compressed RTF / TNEF / LZFU

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
> We could implement the decoder without distributing tnef.jar, using 
> Class.forName() and simply disabling the decoder if the jar is not on 
> classpath? Then it is up to the user to download the jar and thereby 
> accept the GPL license.

I've got a feeling that that may still be too closely linked to be 
allowed, but hopefully someone else can point us to the answer.

There's nothing stopping you writing a GPL licensed parser with a hard 
dependency, and including it yourself though. However, if you want an 
Apache Licensed version so that it can be included directly in the 
official distribution of Tika, extending the HDGF not-quite-LZW decoder is 
probably the best way

> OR we could ask the copyright-holder to explicitly license the package 
> to the ASF for use in Tika only. He says in README: "For non-GPL 
> commercial licensing please contact the address below."

I doubt they'd go for it, but you could always ask. The issue is that by 
making an ASL licensed version, anyone could then use that version under 
the apache license, and included it in their own non-GPL projects, which 
presumably the original author doesn't want to happen. (There's no such 
thing as Apache Licensed for one project only, either the software is 
Apache licensed or it isn't)

Nick

Re: Compressed RTF / TNEF / LZFU

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
We could implement the decoder without distributing tnef.jar, using Class.forName() and simply disabling the decoder if the jar is not on classpath? Then it is up to the user to download the jar and thereby accept the GPL license.

OR we could ask the copyright-holder to explicitly license the package to the ASF for use in Tika only. He says in README: "For non-GPL commercial licensing please contact the address below."

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 30. sep. 2010, at 21.14, Nick Burch wrote:

> On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
>> Here's an open source Java implementation of a decompressor:
>> http://www.freeutils.net/source/jtnef/rtfcompressed.jsp
> 
> Alas that's under the GPL, so can't be used in an official distribution of Tika. (You can use it yourself if you want though, but that would mean that your resultant program would be GPL'd too due to the viral nature of the license)
> 
> However, it looks from a quick glance at the docs on that site that it's very similar to visio compression.
> 
> If you're interested in working on this, I'd suggest we switch over to the POI dev list. I suspect that with the HDGF decompression code, and the documentation on the compression on freeutils, it shouldn't be too much work to implement a decoder.
> 
> Nick


Re: Compressed RTF / TNEF / LZFU

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote:
> Here's an open source Java implementation of a decompressor:
> http://www.freeutils.net/source/jtnef/rtfcompressed.jsp

Alas that's under the GPL, so can't be used in an official distribution of 
Tika. (You can use it yourself if you want though, but that would mean 
that your resultant program would be GPL'd too due to the viral nature of 
the license)

However, it looks from a quick glance at the docs on that site that it's 
very similar to visio compression.

If you're interested in working on this, I'd suggest we switch over to the 
POI dev list. I suspect that with the HDGF decompression code, and the 
documentation on the compression on freeutils, it shouldn't be too much 
work to implement a decoder.

Nick