You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/12/09 11:19:02 UTC
Aperture is available under the BSD
Hi,
The Aperture project (http://aperture.sourceforge.net/) has relicensed
all their code to the BSD license, see
http://sourceforge.net/forum/forum.php?forum_id=891966.
They probably have some code that we could reuse, and perhaps we also
have some valuable bits to contribute to them. The BSD license is
better in line with the Apache License than the OSL 3.0 they used
before, so we're in a much better position for reusing their code now.
BR,
Jukka Zitting
Re: Aperture is available under the BSD
Posted by Grant Ingersoll <gs...@apache.org>.
I think I would let things shake out a little bit with the change to a
new license. IANAL, but I think I would at least wait for a
release. It does seem to make sense, though.
Personally, though, I really like Tika's SAX model for extraction and
the, um, lack of RDF.
2 more cents...
Grant
On Dec 9, 2008, at 6:24 AM, Stephane Bastian wrote:
> This is definitely a good news. Besides very good parsers, Aperture
> also has strong support for mime type. I know we also have support
> for detecting mime types but at some point and time we may consider
> using theirs and focus solely on writing Parsers?
> One problem though is that parsers return RDF data, which is fine
> but seems to be more heavyweight than one would like
>
> just my 2 cents,
>
> Stephane
>
> Jukka Zitting wrote:
>> Hi,
>>
>> The Aperture project (http://aperture.sourceforge.net/) has
>> relicensed
>> all their code to the BSD license, see
>> http://sourceforge.net/forum/forum.php?forum_id=891966.
>>
>> They probably have some code that we could reuse, and perhaps we also
>> have some valuable bits to contribute to them. The BSD license is
>> better in line with the Apache License than the OSL 3.0 they used
>> before, so we're in a much better position for reusing their code
>> now.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
Re: Aperture is available under the BSD
Posted by Jérôme Charron <je...@gmail.com>.
>
> Yep, the mime type detection system in Tika is based on the one developed
> for Nutch primarily by Jerome Charron. Jerome worked on an update to this
> mime system, with the freedesktop.org-style interface, and then I worked to
> clean this up and get the functionality into Tika.
The basic idea was to easily update the Tika mime system simply by
downloading
and installing the freedesktop xml mime descriptors.
Best Regards
Jérôme
--
Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33675742890 <= ** NEW **
eMail : jerome.charron@shopreflex.com
http://blog.shopreflex.com/
http://www.shopreflex.com/
http://www.staragora.com/
Re: Aperture is available under the BSD
Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.
Hi Stephane,
Thanks for your email.
> I didn't know Tika mime type detection was based on freedesktop.org.
> I've also developed a mimeType detection system built on top of
> freedesktop, leveraging the shared-mime-info database to be accurate. Is
> this what you guys have done as well?
Yep, the mime type detection system in Tika is based on the one developed
for Nutch primarily by Jerome Charron. Jerome worked on an update to this
mime system, with the freedesktop.org-style interface, and then I worked to
clean this up and get the functionality into Tika.
> In any case, the point I was trying to make in my previous post was to
> leverage functionality that is available somewhere else as much as
> possible and focus on Tika core features.
Gotcha. My point is: mime detection _is_ one of Tika's core features :)
See: http://wiki.apache.org/incubator/TikaProposal
It's what we proposed as one of the core parts of the system to get Tika
approved as an Apache project and to get it to be more useful to the
community.
> True, mime type detection is important for Tika. However, as you pointed
> out, mime type detection is a project by itself. If the idea of creating
> a commons.xx project for mime detection was floating around earlier, why
> not starting an Apache commons.xxx project based on Tika detection
> schema then? Now be a good time, don't you think?
> It would be a great addition to commons and would free Tika developer
> from maintaining the code base for it
It was decided that, rather than go the commons-xxx route, that we would
maintain the code as part of the core functionality of Tika. There are
developers in Tika that are interested in mime detection as natural lockstep
with content analysis (myself included in that list) and because of this who
are very happy to maintain the mime detection code in Tika.
Thanks!
Cheers,
Chris
> Mattmann, Chris A wrote:
>> Hi Stephane,
>>
>>
>>
>>> This is definitely a good news. Besides very good parsers, Aperture also
>>> has strong support for mime type. I know we also have support for
>>> detecting mime types but at some point and time we may consider using
>>> theirs and focus solely on writing Parsers?
>>>
>>
>> I would be strongly against this mainly due to the fact that there is almost
>> a 1-to-1 correspondence between having a good mime detection system, and
>> parsing content. Tika has a fairly robust mime system based on
>> freedesktop.org's system and I think there is value in Apache having a good
>> mime detection system (in fact it was discussed, even before Tika's
>> inception, to take the Nutch mime type code and turn it into a commons-*
>> project).
>>
>> Thanks,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: Chris.Mattmann@jpl.nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Disclaimer: The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>
>
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
Re: Aperture is available under the BSD
Posted by Stephane Bastian <st...@gmail.com>.
Hi Chris,
I didn't know Tika mime type detection was based on freedesktop.org.
I've also developed a mimeType detection system built on top of
freedesktop, leveraging the shared-mime-info database to be accurate. Is
this what you guys have done as well?
In any case, the point I was trying to make in my previous post was to
leverage functionality that is available somewhere else as much as
possible and focus on Tika core features.
True, mime type detection is important for Tika. However, as you pointed
out, mime type detection is a project by itself. If the idea of creating
a commons.xx project for mime detection was floating around earlier, why
not starting an Apache commons.xxx project based on Tika detection
schema then? Now be a good time, don't you think?
It would be a great addition to commons and would free Tika developer
from maintaining the code base for it
All the best,
Stephane Bastian
Mattmann, Chris A wrote:
> Hi Stephane,
>
>
>
>> This is definitely a good news. Besides very good parsers, Aperture also
>> has strong support for mime type. I know we also have support for
>> detecting mime types but at some point and time we may consider using
>> theirs and focus solely on writing Parsers?
>>
>
> I would be strongly against this mainly due to the fact that there is almost
> a 1-to-1 correspondence between having a good mime detection system, and
> parsing content. Tika has a fairly robust mime system based on
> freedesktop.org's system and I think there is value in Apache having a good
> mime detection system (in fact it was discussed, even before Tika's
> inception, to take the Nutch mime type code and turn it into a commons-*
> project).
>
> Thanks,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Disclaimer: The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
Re: Aperture is available under the BSD
Posted by Antoni Myłka <an...@gmail.com>.
Mattmann, Chris A pisze:
> Hi Stephane,
>
>
>> This is definitely a good news. Besides very good parsers, Aperture also
>> has strong support for mime type. I know we also have support for
>> detecting mime types but at some point and time we may consider using
>> theirs and focus solely on writing Parsers?
>
> I would be strongly against this mainly due to the fact that there is almost
> a 1-to-1 correspondence between having a good mime detection system, and
> parsing content. Tika has a fairly robust mime system based on
> freedesktop.org's system and I think there is value in Apache having a good
> mime detection system (in fact it was discussed, even before Tika's
> inception, to take the Nutch mime type code and turn it into a commons-*
> project).
>
Mime type identification would be the easiest thing to colaborate on,
since in general the interface is identical (put in a byte array and
return a string with the mime type). Both MimeTypeIdentifiers have been
actively maintained and used in production for years. I guess we could
all benefit from pooling the resources.
Clearly Tika MimeTypes and MimeType classes have friendlier API, and
Tika allows new patterns to be added at runtime which aperture doesn't,
but I wonder if anyone tried to assess the real advantages or
disadvantages of tika vs aperture mime type identifier? (number of
recognized types, speed, memory consumption etc.)?
My dream is a project that maintains a single mime type identification
class, but for every single identifiable mime type - there is a test
document that confirms it. Our mimetypes.xml file lists patterns for 162
mime types, yours tika-mimetypes.xml lists patterns for 78 mime types,
but how many can we really recognize - that is an open question.
Apart from that there are three ideas we could explore:
1. An issue of ASCII text files with headers that happen to be magic
numbers for some other type, http://tinyurl.com/66tabh,
2. specific treatment of text/xml mime type, to detect xml-specific mime
types (by DTD,XSD,namespace etc.) http://tinyurl.com/6xolsx
3. specific treatment of ZIP mime type to detect zip-specific mime types
(office 2007, open office, jars etc), without resorting to extensions.
None of this managed to gain critical mass within aperture itself.
I will take a closer look at the Tika MimeTypes class and will get back
to you with something more concrete, but I'd like to know what do you
think about this in general.
Antoni Mylka
antoni.mylka@gmail.com
Re: Aperture is available under the BSD
Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.
Hi Stephane,
> This is definitely a good news. Besides very good parsers, Aperture also
> has strong support for mime type. I know we also have support for
> detecting mime types but at some point and time we may consider using
> theirs and focus solely on writing Parsers?
I would be strongly against this mainly due to the fact that there is almost
a 1-to-1 correspondence between having a good mime detection system, and
parsing content. Tika has a fairly robust mime system based on
freedesktop.org's system and I think there is value in Apache having a good
mime detection system (in fact it was discussed, even before Tika's
inception, to take the Nutch mime type code and turn it into a commons-*
project).
Thanks,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
Re: Aperture is available under the BSD
Posted by Stephane Bastian <st...@gmail.com>.
This is definitely a good news. Besides very good parsers, Aperture also
has strong support for mime type. I know we also have support for
detecting mime types but at some point and time we may consider using
theirs and focus solely on writing Parsers?
One problem though is that parsers return RDF data, which is fine but
seems to be more heavyweight than one would like
just my 2 cents,
Stephane
Jukka Zitting wrote:
> Hi,
>
> The Aperture project (http://aperture.sourceforge.net/) has relicensed
> all their code to the BSD license, see
> http://sourceforge.net/forum/forum.php?forum_id=891966.
>
> They probably have some code that we could reuse, and perhaps we also
> have some valuable bits to contribute to them. The BSD license is
> better in line with the Apache License than the OSL 3.0 they used
> before, so we're in a much better position for reusing their code now.
>
> BR,
>
> Jukka Zitting
>