You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Nick Burch <ni...@apache.org> on 2015/10/08 17:45:10 UTC

A handful of Java file format libraries + tools - would commons be a good home?

Hi All

TL;DR - There's a handful of Java mini-projects, one per file format, 
each with a library and command-line tools, in and around Apache Tika. 
Would Commons be a good Apache home for them?


Apache Tika, for those who don't know, is a toolkit for detecting file 
types, then extracting consistent structured metadata and content. It 
wraps a whole bunch of other Java libraries, and hides all the 
complexity from users.

In a few cases, there hasn't been a suitably licensed / available 
library for a format that Tika wanted to support, so we've ended up 
having to write our own. As part of an experiment, some of these are in 
the Tika codebase, and some are hosted externally. A few of them are 
generally useful, in particular the Ogg and the MP3 ones.

For the formats where the support code is in Tika, we're not seeing any 
re-use beyond Tika. The code is embedded in the Tika Parsers jar, and 
no-one would think to look in there for some generic file format code. 
Nor would you really expect to find it in Tika anyway, even if it had 
its own jar. For the Ogg code, which we've tried hosting on Github, 
there has been some re-use of the code. There hasn't been all that much 
visibility though, and releasing without the Apache infrastructure can 
be a bit of a pain, plus one single person needs to take charge of the 
project.

For Ogg, as well as the Java library code, there's the Tika plugin code, 
and command line tools. No audio encoding/decoding yet, but much of the 
work is there if someone wanted to finish it off. We're considering 
adding a SAS7BDAT library to this little grouping shortly too, which as 
well as being used by Apache Tika, would also be used by Apache 
Metamodel, possibly some others too, and would have command line tools.


Following some discussions last week at ApacheCon / Apache Big Data / 
ApacheCon BarCamp on this, it was suggested we try asking here if you 
think these could have a good home in Apache Commons? On the one hand, 
they are in Java, and are re-usable. On the other, they have command 
line tool packages as well, which doesn't seem that commons-like, ditto 
the multimedia encoding/decoding parts which are nearly there.

What do you all think? Could Commons be a suitable home for them? Or 
should we look elsewhere? (We do have a backup idea if needed)

Thanks
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: A handful of Java file format libraries + tools - would commons be a good home?

Posted by Phil Steitz <ph...@gmail.com>.
On 10/13/15 4:15 PM, Nick Burch wrote:
> On Sat, 10 Oct 2015, Phil Steitz wrote:
>> On 10/8/15 8:45 AM, Nick Burch wrote:
>>> TL;DR - There's a handful of Java mini-projects, one per file
>>> format, each with a library and command-line tools, in and around
>>> Apache Tika. Would Commons be a good Apache home for them?
>>
>> This depends on what you mean by "home."  What does not work is just
>> parking code here and hoping someone else picks it up.  What can
>> work fine is moving some code here and working on it and building
>> community around it here.  There just needs to be a micro-community
>> interested and willing to generate interest in the code and maintain
>> it.  If this is the case, then you all are most welcome to join us.
>
> There's a small community for each, with a fair bit of overlap,
> and from Tika at least a strong desire that bugs get fixed and new
> releases pushed our after!
>
> Not enough to manage 3 +1s and some spares that a full PMC would
> need for each, and probably never would do for each format
> individually. It would need others from Commons to help with the
> release checks and voting. It would also involve releases of tool
> jars, possibly tool wrapper scripts, along with the more normal (I
> think?) for commons Java library jars.

No problem there, as long as it is all RAT happy etc.
>
> Does that still sound like something suitable for commons?

Seems suitable to me.  And of course, all are always welcome to
contribute to the components we already have :)


Phil
>
> Or do you think we'd be better bundling these handful of libraries
> and tools up in their own "a bit like commons" project with some
> volunteers from Tika and one or two others?
>
> Thanks
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: A handful of Java file format libraries + tools - would commons be a good home?

Posted by Nick Burch <ni...@apache.org>.
On Sat, 10 Oct 2015, Phil Steitz wrote:
> On 10/8/15 8:45 AM, Nick Burch wrote:
>> TL;DR - There's a handful of Java mini-projects, one per file
>> format, each with a library and command-line tools, in and around
>> Apache Tika. Would Commons be a good Apache home for them?
>
> This depends on what you mean by "home."  What does not work is just
> parking code here and hoping someone else picks it up.  What can
> work fine is moving some code here and working on it and building
> community around it here.  There just needs to be a micro-community
> interested and willing to generate interest in the code and maintain
> it.  If this is the case, then you all are most welcome to join us.

There's a small community for each, with a fair bit of overlap, and from 
Tika at least a strong desire that bugs get fixed and new releases pushed 
our after!

Not enough to manage 3 +1s and some spares that a full PMC would need for 
each, and probably never would do for each format individually. It would 
need others from Commons to help with the release checks and voting. It 
would also involve releases of tool jars, possibly tool wrapper scripts, 
along with the more normal (I think?) for commons Java library jars.

Does that still sound like something suitable for commons?

Or do you think we'd be better bundling these handful of libraries and 
tools up in their own "a bit like commons" project with some volunteers 
from Tika and one or two others?

Thanks
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: A handful of Java file format libraries + tools - would commons be a good home?

Posted by Phil Steitz <ph...@gmail.com>.
On 10/8/15 8:45 AM, Nick Burch wrote:
> Hi All
>
> TL;DR - There's a handful of Java mini-projects, one per file
> format, each with a library and command-line tools, in and around
> Apache Tika. Would Commons be a good Apache home for them?
>
>
> Apache Tika, for those who don't know, is a toolkit for detecting
> file types, then extracting consistent structured metadata and
> content. It wraps a whole bunch of other Java libraries, and hides
> all the complexity from users.
>
> In a few cases, there hasn't been a suitably licensed / available
> library for a format that Tika wanted to support, so we've ended
> up having to write our own. As part of an experiment, some of
> these are in the Tika codebase, and some are hosted externally. A
> few of them are generally useful, in particular the Ogg and the
> MP3 ones.
>
> For the formats where the support code is in Tika, we're not
> seeing any re-use beyond Tika. The code is embedded in the Tika
> Parsers jar, and no-one would think to look in there for some
> generic file format code. Nor would you really expect to find it
> in Tika anyway, even if it had its own jar. For the Ogg code,
> which we've tried hosting on Github, there has been some re-use of
> the code. There hasn't been all that much visibility though, and
> releasing without the Apache infrastructure can be a bit of a
> pain, plus one single person needs to take charge of the project.
>
> For Ogg, as well as the Java library code, there's the Tika plugin
> code, and command line tools. No audio encoding/decoding yet, but
> much of the work is there if someone wanted to finish it off.
> We're considering adding a SAS7BDAT library to this little
> grouping shortly too, which as well as being used by Apache Tika,
> would also be used by Apache Metamodel, possibly some others too,
> and would have command line tools.
>
>
> Following some discussions last week at ApacheCon / Apache Big
> Data / ApacheCon BarCamp on this, it was suggested we try asking
> here if you think these could have a good home in Apache Commons?
> On the one hand, they are in Java, and are re-usable. On the
> other, they have command line tool packages as well, which doesn't
> seem that commons-like, ditto the multimedia encoding/decoding
> parts which are nearly there.
>
> What do you all think? Could Commons be a suitable home for them?
> Or should we look elsewhere? (We do have a backup idea if needed)

This depends on what you mean by "home."  What does not work is just
parking code here and hoping someone else picks it up.  What can
work fine is moving some code here and working on it and building
community around it here.  There just needs to be a micro-community
interested and willing to generate interest in the code and maintain
it.  If this is the case, then you all are most welcome to join us.  

Phil
>
> Thanks
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org