You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/12/09 11:19:02 UTC

Aperture is available under the BSD

Hi,

The Aperture project (http://aperture.sourceforge.net/) has relicensed
all their code to the BSD license, see
http://sourceforge.net/forum/forum.php?forum_id=891966.

They probably have some code that we could reuse, and perhaps we also
have some valuable bits to contribute to them. The BSD license is
better in line with the Apache License than the OSL 3.0 they used
before, so we're in a much better position for reusing their code now.

BR,

Jukka Zitting

Re: Aperture is available under the BSD

Posted by Grant Ingersoll <gs...@apache.org>.
I think I would let things shake out a little bit with the change to a  
new license.  IANAL, but I think I would at least wait for a  
release.   It does seem to make sense, though.

Personally, though, I really like Tika's SAX model for extraction and  
the, um, lack of RDF.

2 more cents...

Grant

On Dec 9, 2008, at 6:24 AM, Stephane Bastian wrote:

> This is definitely a good news. Besides very good parsers, Aperture  
> also has strong support for mime type. I know we also have support  
> for detecting mime types but at some point and time we may consider  
> using theirs and focus solely on writing Parsers?
> One problem though is that parsers return RDF data, which is fine  
> but seems to be more heavyweight than one would like
>
> just my 2 cents,
>
> Stephane
>
> Jukka Zitting wrote:
>> Hi,
>>
>> The Aperture project (http://aperture.sourceforge.net/) has  
>> relicensed
>> all their code to the BSD license, see
>> http://sourceforge.net/forum/forum.php?forum_id=891966.
>>
>> They probably have some code that we could reuse, and perhaps we also
>> have some valuable bits to contribute to them. The BSD license is
>> better in line with the Apache License than the OSL 3.0 they used
>> before, so we're in a much better position for reusing their code  
>> now.
>>
>> BR,
>>
>> Jukka Zitting
>>
>


Re: Aperture is available under the BSD

Posted by Jérôme Charron <je...@gmail.com>.
>
> Yep, the mime type detection system in Tika is based on the one developed
> for Nutch primarily by Jerome Charron. Jerome worked on an update to this
> mime system, with the freedesktop.org-style interface, and then I worked to
> clean this up and get the functionality into Tika.

The basic idea was to easily update the Tika mime system simply by
downloading
and installing the freedesktop xml mime descriptors.

Best Regards

Jérôme


-- 
Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33675742890 <= ** NEW **
eMail : jerome.charron@shopreflex.com
http://blog.shopreflex.com/
http://www.shopreflex.com/
http://www.staragora.com/

Re: Aperture is available under the BSD

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.
Hi Stephane,

Thanks for your email.

> I didn't know Tika mime type detection was based on freedesktop.org.
> I've also developed a mimeType detection system built on top of
> freedesktop, leveraging the shared-mime-info database to be accurate. Is
> this what you guys have done as well?

Yep, the mime type detection system in Tika is based on the one developed
for Nutch primarily by Jerome Charron. Jerome worked on an update to this
mime system, with the freedesktop.org-style interface, and then I worked to
clean this up and get the functionality into Tika.

> In any case, the point I was trying to make in my previous post was to
> leverage functionality that is available somewhere else as much as
> possible and focus on Tika core features.

Gotcha. My point is: mime detection _is_ one of Tika's core features :)

See: http://wiki.apache.org/incubator/TikaProposal

It's what we proposed as one of the core parts of the system to get Tika
approved as an Apache project and to get it to be more useful to the
community.

> True, mime type detection is important for Tika. However, as you pointed
> out, mime type detection is a project by itself. If the idea of creating
> a commons.xx project for mime detection was floating around earlier, why
> not starting an Apache commons.xxx project based on Tika detection
> schema then? Now be a good time, don't you think?
> It would be a great addition to commons and would free Tika developer
> from maintaining the code base for it

It was decided that, rather than go the commons-xxx route, that we would
maintain the code as part of the core functionality of Tika. There are
developers in Tika that are interested in mime detection as natural lockstep
with content analysis (myself included in that list) and because of this who
are very happy to maintain the mime detection code in Tika.

Thanks!

Cheers,
Chris


> Mattmann, Chris A wrote:
>> Hi Stephane,
>>
>>
>>
>>> This is definitely a good news. Besides very good parsers, Aperture also
>>> has strong support for mime type. I know we also have support for
>>> detecting mime types but at some point and time we may consider using
>>> theirs and focus solely on writing Parsers?
>>>
>>
>> I would be strongly against this mainly due to the fact that there is almost
>> a 1-to-1 correspondence between having a good mime detection system, and
>> parsing content. Tika has a fairly robust mime system based on
>> freedesktop.org's system and I think there is value in Apache having a good
>> mime detection system (in fact it was discussed, even before Tika's
>> inception, to take the Nutch mime type code and turn it into a commons-*
>> project).
>>
>> Thanks,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: Chris.Mattmann@jpl.nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>
>
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Aperture is available under the BSD

Posted by Stephane Bastian <st...@gmail.com>.
Hi Chris,

I didn't know Tika mime type detection was based on freedesktop.org. 
I've also developed a mimeType detection system built on top of 
freedesktop, leveraging the shared-mime-info database to be accurate. Is 
this what you guys have done as well?
In any case, the point I was trying to make in my previous post was to 
leverage functionality that is available somewhere else as much as 
possible and focus on Tika core features.
True, mime type detection is important for Tika. However, as you pointed 
out, mime type detection is a project by itself. If the idea of creating 
a commons.xx project for mime detection was floating around earlier, why 
not starting an Apache commons.xxx project based on Tika detection 
schema then? Now be a good time, don't you think?
It would be a great addition to commons and would free Tika developer 
from maintaining the code base for it

All the best,

Stephane Bastian

Mattmann, Chris A wrote:
> Hi Stephane,
>
>
>   
>> This is definitely a good news. Besides very good parsers, Aperture also
>> has strong support for mime type. I know we also have support for
>> detecting mime types but at some point and time we may consider using
>> theirs and focus solely on writing Parsers?
>>     
>
> I would be strongly against this mainly due to the fact that there is almost
> a 1-to-1 correspondence between having a good mime detection system, and
> parsing content. Tika has a fairly robust mime system based on
> freedesktop.org's system and I think there is value in Apache having a good
> mime detection system (in fact it was discussed, even before Tika's
> inception, to take the Nutch mime type code and turn it into a commons-*
> project).
>
> Thanks,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>   


Re: Aperture is available under the BSD

Posted by Antoni Myłka <an...@gmail.com>.
Mattmann, Chris A pisze:
> Hi Stephane,
> 
> 
>> This is definitely a good news. Besides very good parsers, Aperture also
>> has strong support for mime type. I know we also have support for
>> detecting mime types but at some point and time we may consider using
>> theirs and focus solely on writing Parsers?
> 
> I would be strongly against this mainly due to the fact that there is almost
> a 1-to-1 correspondence between having a good mime detection system, and
> parsing content. Tika has a fairly robust mime system based on
> freedesktop.org's system and I think there is value in Apache having a good
> mime detection system (in fact it was discussed, even before Tika's
> inception, to take the Nutch mime type code and turn it into a commons-*
> project).
> 

Mime type identification would be the easiest thing to colaborate on, 
since in general the interface is identical (put in a byte array and 
return a string with the mime type). Both MimeTypeIdentifiers have been 
actively maintained and used in production for years. I guess we could 
all benefit from pooling the resources.

Clearly Tika MimeTypes and MimeType classes have friendlier API, and 
Tika allows new patterns to be added at runtime which aperture doesn't, 
but I wonder if anyone tried to assess the real advantages or 
disadvantages of tika vs aperture mime type identifier? (number of 
recognized types, speed, memory consumption etc.)?

My dream is a project that maintains a single mime type identification 
class, but for every single identifiable mime type - there is a test 
document that confirms it. Our mimetypes.xml file lists patterns for 162 
mime types, yours tika-mimetypes.xml lists patterns for 78 mime types, 
but how many can we really recognize - that is an open question.

Apart from that there are three ideas we could explore:
1. An issue of ASCII text files with headers that happen to be magic 
numbers for some other type, http://tinyurl.com/66tabh,
2. specific treatment of text/xml mime type, to detect xml-specific mime 
types (by DTD,XSD,namespace etc.) http://tinyurl.com/6xolsx
3. specific treatment of ZIP mime type to detect zip-specific mime types 
(office 2007, open office, jars etc), without resorting to extensions.

None of this managed to gain critical mass within aperture itself.

I will take a closer look at the Tika MimeTypes class and will get back 
to you with something more concrete, but I'd like to know what do you 
think about this in general.

Antoni Mylka
antoni.mylka@gmail.com


Re: Aperture is available under the BSD

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.
Hi Stephane,


> This is definitely a good news. Besides very good parsers, Aperture also
> has strong support for mime type. I know we also have support for
> detecting mime types but at some point and time we may consider using
> theirs and focus solely on writing Parsers?

I would be strongly against this mainly due to the fact that there is almost
a 1-to-1 correspondence between having a good mime detection system, and
parsing content. Tika has a fairly robust mime system based on
freedesktop.org's system and I think there is value in Apache having a good
mime detection system (in fact it was discussed, even before Tika's
inception, to take the Nutch mime type code and turn it into a commons-*
project).

Thanks,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Aperture is available under the BSD

Posted by Stephane Bastian <st...@gmail.com>.
This is definitely a good news. Besides very good parsers, Aperture also 
has strong support for mime type. I know we also have support for 
detecting mime types but at some point and time we may consider using 
theirs and focus solely on writing Parsers?
One problem though is that parsers return RDF data, which is fine but 
seems to be more heavyweight than one would like

just my 2 cents,

Stephane

Jukka Zitting wrote:
> Hi,
>
> The Aperture project (http://aperture.sourceforge.net/) has relicensed
> all their code to the BSD license, see
> http://sourceforge.net/forum/forum.php?forum_id=891966.
>
> They probably have some code that we could reuse, and perhaps we also
> have some valuable bits to contribute to them. The BSD license is
> better in line with the Apache License than the OSL 3.0 they used
> before, so we're in a much better position for reusing their code now.
>
> BR,
>
> Jukka Zitting
>