You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/11/04 13:32:21 UTC

Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

Hi Folks,

Now that the Any23 0.9.0 release is out of the way. I would like to come
back to this thread. Firstly, thank you for all of the useful (and
encouraging) comments.
As I am subscribed to the dev-digest@tika list I get all replies through in
builk... which in this case has turned out to be quite handy. Ken, Julien
and Chris, please see my replies in line. I've posted them as I received
them within the digest email.
Thanks
Lewis

On Sat, Oct 19, 2013 at 8:31 PM, <de...@tika.apache.org> wrote:

>
> [DISCUSS] Integrate Apache Any23 into Apache Tika
>         10047 by: Lewis John Mcgibbney
>         10048 by: Ken Krugler
>         10049 by: Julien Nioche
>         10050 by: Chris Mattmann
>
>
Hi Ken,


>
> I haven't have much time to look into Any23, which includes reviewing
> Markus's patch for integrating some portions of that into Tika (see
> https://issues.apache.org/jira/browse/TIKA-980)
>
>
Thanks for the pointer. I was not even aware of this patch until you
dropped the link here.


> The main challenge I see is that Tika seems to do best as a wrapper for
> other parsers, versus outright ownership of parsers.
>

Actually Ken, this characteristic of Tika is also shared by some of the
Any23  implementations as well. An example would be our heavy reliance upon
several Sesame OpenRDF libraries for parsing of nquads, RDF, turtle, etc.
formats. The factory implementations we currently have would most likely
port pretty well to Tika.


>
> Which isn't to say that rolling Any23 into Tika wouldn't work, but without
> at least one active developer it would seem likely that it would languish,
> without active development.
>

Please don't take this the wrong way. The idea is not to *dump* Any23 in to
Tika. There are some of us within the community that maintain various parts
of the code as we use those parts (modules) for our specific types of work
and interests. There has been a growing recognition (for me anyway) that
the parts the active committers (under 3 of us currently) seem to be using
work... and work pretty well actually. This has meant that other aspects of
Any23 (such as extraction of structured content from CSV within the csv
module for example) have gone untouched for a while. I for one would
continue to work on the Any23 wrappers and parsers should they be taken
over to Tika as I have a use case for them. Hopefully, some of the Tika
users and developers may also find that under the common Tika API, they
also can have a use for the Any23 wrappers and parsers.


>
> -- Ken
>

Thanks for taking the time to write Ken.
Best


>
> Hi,
>

Hi Julien,


>
> I had a look at Any23 some time ago and found that it overlapped with quite
> a few other projects indeed but could (should?) have either relied on those
> projects (e.g. parsing and mimetype stuff to Tika) or delegated the
> functionality altogether (e.g. crawling to Nutch) instead of reinventing
> the wheel and spread itself thin.
>

Well we currently use and directly import Tika 1.4 (tika-core and
tika-parsers) as direct dependencies within any23-core, any23-encoding and
any23-mime modules so we have little duplication of code for those mimetype
and encoding stuff. The crawling is merely an example plugin of how Any23
could be deployed within a crawler... this I suspect would NOT become part
of what we would wish to bring over to Tika. The same most likely goes for
some of the other plugins we have. I do agree with your points here... and
I think that this is why the project failed to attract and build the
communtiy we so desperately require to keep moving forward.


>
> I am not familiar with the history of the project, where the code comes
> from and who was behind it but I am a bit surprised that the project was
> allowed to graduate from incubation without these points being addressed.
>

Please see here
http://any23.apache.org/acknowledgements.html
I was with the project through incubation. I actually thought we were doing
OK. It seemed like after we got our first incubating release out of the way
and we were moving on that things kind of went to the wall so to speak.


>
> Migrating the code to Tika as a whole would not be a good idea I think.
>

By 'as a whole' you mean migrating ALL of the Any23 code, or that OVERALL
it is not a good idea?
If your opinion is the 1st of the above then I share it. There is certainly
parts of the current Any23 code base that will not be of value to Tika that
is for sure.


> However from a Tika point of view, it could be interesting to have the meta
> parsers to convert the semantic information into a neutral representation
> as a ContentHandler as in TIKA-980.


+1


> Most people would probably be
> interested in that more than the generation side of Any23 (what is referred
> to as output format) which I think is not so relevant for Tika. From an
> Any23 perspective, the project could then focus on the generation side and
> just rely on Tika for pretty much everything else.
>

+1. Any23 is a useful toolkit with many facets. It is our job however to
recognize what is useful for keeping in line with what Tika does best and
finding the common ground there.


>
> Julien
>

Thanks Julien


>
> Lewis,
>

Hi Chris,


>
> I for one am supportive of this measure somehow. The exact
> mechanism by which we can do this is something that could
> involve e.g., taking you, or anyone else from the Any23 community
> (at this point I think it's really just you by my own accord lurking
> on the lists over there) that is interested and bringing you into
> the fray for Tika perhaps working on a branch to integrate Any23 and
> Tika more closely together, in patch-wise, piecemeal fashion along
> the standard way we operate over here in Tika and in Apache.
>

Huge +1. We recently had some positive news from Giovanni Tummarello that
folsk from FBK may be able to help the transition as well.
http://www.mail-archive.com/dev%40any23.apache.org/msg00962.html
I'll keep up with the developments. I for one am committed to making this
work. I'll continue the discussions over on dev@any23 in light of this
thread.


>
> This would also seem to be in tune with Julien's comment and feeling
> that parts of this would make a lot of sense to be part of Tika.
>

+1


>
> Lewis, why don't you get more opinions over the next week or so
> and if there are no strong objections, call a VOTE for the Tika
> PMC to VOTE on?
>
> Does that work?
>
> Yeah it does work Chris. We just made the relase for Any23 0.9.0 and maybe
we can drum up some Any23 committers to take it forward to Tika. As I said
above, I'll keep the momentum going and report back here in dribs and drabs
as we make progress.

Thank you everyone for the input.
Best
Lewis