You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Carsten Ziegeler <cz...@apache.org> on 2007/07/10 09:18:33 UTC

Support for document libraries

Afaik there is currently no central place at Apache where
libraries/frameworks for handling of specific document formats are
developed. We have single projects like poi of course.

If you are searching for java libraries which support a specific format,
like some image formats, you'll find many libraries of varying quality
and it's really hard (if not impossible) to choose a correct one.

I'm wondering if something could be done about it by starting a project
at Apache which supports various file formats (like images, mp3 etc.) -
perhaps by incubating some existing stuff.

Although Tika is more the framework for plugin in such stuff, it perhaps
makes sense to try to start something like that as sub projects of Tika?

WDYT?

Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org


Re: Support for document libraries

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Adding document format libraries as subprojects of Tika still "hides"
them somewhat. So this wouldn't really solve the problem of easily
finding such libraries. If new libraries should be developed, I would
think that a lab or Commons is better suited.

There were many talks over the years about creating an image library
inside the ASF but it has never developed into a real effort. It's a lot
of work and with ImageIO built into the JDK only exotic wishes are still
open.

If we had a Tika Wiki we could at least list potential existing libraries
and libraries that we'd like but don't exist. We could list licenses,
candidates for incubation, quality/maturity indicators...

Inside the XML Graphics project, we have the following available (if
anyone is interested to know):
* XMP metadata framework in XML Graphics Commons, read/write, work in
progress
* PostScript DSC in XML Graphics Commons, read/write (no PS interpreter!)
* PNG and TIFF codecs in XML Graphics Commons, read/write
* PDF in FOP, write only
* RTF in FOP, write only
* SVG in Batik, read/write

Others:
PDF (PDFBox @SourceForge), read/write, signalled interest for incubation

personal wishlist:
ODF, read/write
Mars, read/write

On 10.07.2007 09:18:33 Carsten Ziegeler wrote:
> Afaik there is currently no central place at Apache where
> libraries/frameworks for handling of specific document formats are
> developed. We have single projects like poi of course.
> 
> If you are searching for java libraries which support a specific format,
> like some image formats, you'll find many libraries of varying quality
> and it's really hard (if not impossible) to choose a correct one.
> 
> I'm wondering if something could be done about it by starting a project
> at Apache which supports various file formats (like images, mp3 etc.) -
> perhaps by incubating some existing stuff.
> 
> Although Tika is more the framework for plugin in such stuff, it perhaps
> makes sense to try to start something like that as sub projects of Tika?
> 
> WDYT?
> 
> Carsten
> -- 
> Carsten Ziegeler
> cziegeler@apache.org
> 


Jeremias Maerki


Re: Support for document libraries

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 7/10/07, robert burrell donkin <ro...@gmail.com> wrote:
> IMHO it makes sense to start them in tika but possibly commons might
> be a good long term home for some at least. if these really are
> libraries then it would be best to isolate them from the start and
> then add adaption code to tika.

Another potential home would be POI if they are interested in widening
their scope beyond Microsoft formats.

> for example, there is talk of a couple of possible options for
> MIME-type discovery. perhaps it would make sense to factor both
> options as libraries and just have the adapters in tika.

+1

BR,

Jukka Zitting

Re: Support for document libraries

Posted by Carsten Ziegeler <cz...@apache.org>.
robert burrell donkin wrote:
> IMHO it makes sense to start them in tika but possibly commons might
> be a good long term home for some at least. if these really are
> libraries then it would be best to isolate them from the start and
> then add adaption code to tika.
> 
> for example, there is talk of a couple of possible options for
> MIME-type discovery. perhaps it would make sense to factor both
> options as libraries and just have the adapters in tika.
> 
Yes, that definitly makes sense - these libs could be independent from
the "core" and the core must definitly be independent from the libs.
And layering this with adapters is a good idea.

Carsten

-- 
Carsten Ziegeler
cziegeler@apache.org


Re: Support for document libraries

Posted by robert burrell donkin <ro...@gmail.com>.
On 7/10/07, Carsten Ziegeler <cz...@apache.org> wrote:
> Bertrand Delacretaz wrote:
> > On 7/10/07, Carsten Ziegeler <cz...@apache.org> wrote:
> >
> >> ... Although Tika is more the framework for plugin in such stuff, it
> >> perhaps
> >> makes sense to try to start something like that as sub projects of
> >> Tika?...
> >
> > I would agree, although IMHO Tika should reuse existing libraries as
> > much as possible.
> >
> Yes, it doesn't make sense to reinvent the wheel if there are
> good-enough libraries out there. But afaik for several formats there
> aren't suitable libs available, so these are the cases where I think
> that it makes sense to "drag them in".

IMHO it makes sense to start them in tika but possibly commons might
be a good long term home for some at least. if these really are
libraries then it would be best to isolate them from the start and
then add adaption code to tika.

for example, there is talk of a couple of possible options for
MIME-type discovery. perhaps it would make sense to factor both
options as libraries and just have the adapters in tika.

- robert

Re: Support for document libraries

Posted by Carsten Ziegeler <cz...@apache.org>.
Bertrand Delacretaz wrote:
> On 7/10/07, Carsten Ziegeler <cz...@apache.org> wrote:
> 
>> ... Although Tika is more the framework for plugin in such stuff, it
>> perhaps
>> makes sense to try to start something like that as sub projects of
>> Tika?...
> 
> I would agree, although IMHO Tika should reuse existing libraries as
> much as possible.
> 
Yes, it doesn't make sense to reinvent the wheel if there are
good-enough libraries out there. But afaik for several formats there
aren't suitable libs available, so these are the cases where I think
that it makes sense to "drag them in".

> In some cases, the Tika part could just consist of automated tests for
> existing libraries, to help in selecting and validating them.
> 
> -Bertrand
> 


-- 
Carsten Ziegeler
cziegeler@apache.org


Re: Support for document libraries

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 7/10/07, Carsten Ziegeler <cz...@apache.org> wrote:

>... Although Tika is more the framework for plugin in such stuff, it perhaps
> makes sense to try to start something like that as sub projects of Tika?...

I would agree, although IMHO Tika should reuse existing libraries as
much as possible.

In some cases, the Tika part could just consist of automated tests for
existing libraries, to help in selecting and validating them.

-Bertrand