You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/10/18 16:30:29 UTC

[DISCUSS] Integrate Apache Any23 into Apache Tika

Hi Tika Dev's/PMC,

This thread is aimed at recognizing common ground shared by Any23 and Tika
in an attempt to possibly integrate Any23 into Tika.
First however it will serve a purpose for me to put this into context and
also provide some rationale behind this initiative.

It is my understanding that the Tika PMC sponsored Any23 through the Apache
Incubator until we (the Any23 PMC) were ready to graduate having made an
incubating release and having grown the community somewhat. Post
graduation, we made a 0.8.0 release in July 2013.

It is also my understanding that the logical justification for the Tika PMC
sponsoring us, was that it was envisaged (by numerous dev's) that there was
already some common ground between the aim and objectives of both projects
e.g. mime type detection, parsing, extraction of metadata, serialization,
etc. therefore with a little positive thinking and understanding of both
projects, one can clearly see the shared interests.

I am speaking on behalf of the Any23 community here when I say that we have
however come to a realization that the community is not as vibrant as we
would like. This is combined with the fact that initial/original project
dev's are not around right now to keep the project moving in a forward
direction.

It is therefore of interest to us, to approach the Tika community with the
intention of discussing a proposal to integrate Any23 code into Apache Tika.

For those interested, the Any23 project URL is http://any23.apache.org, we
also have a live service which you can use to get a feel for what Any23
actually does. It can be found at http://any23.org.

Any feedback from this community would be really appreciated, as it looks
like the alternative would be for us to take the code into the Apache
Attic... which is always a last resort.

Thanks in advance.

Lewis

-- 
*Lewis*

Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

Posted by Julien Nioche <li...@gmail.com>.
Hi,

I had a look at Any23 some time ago and found that it overlapped with quite
a few other projects indeed but could (should?) have either relied on those
projects (e.g. parsing and mimetype stuff to Tika) or delegated the
functionality altogether (e.g. crawling to Nutch) instead of reinventing
the wheel and spread itself thin.

I am not familiar with the history of the project, where the code comes
from and who was behind it but I am a bit surprised that the project was
allowed to graduate from incubation without these points being addressed.

Migrating the code to Tika as a whole would not be a good idea I think.
However from a Tika point of view, it could be interesting to have the meta
parsers to convert the semantic information into a neutral representation
as a ContentHandler as in TIKA-980. Most people would probably be
interested in that more than the generation side of Any23 (what is referred
to as output format) which I think is not so relevant for Tika. From an
Any23 perspective, the project could then focus on the generation side and
just rely on Tika for pretty much everything else.

I haven't looked into Any23 in great detail and there could be other
interesting things to take from it.

Julien



On 18 October 2013 15:46, Ken Krugler <kk...@transpac.com> wrote:

> Hi Lewis,
>
> I haven't have much time to look into Any23, which includes reviewing
> Markus's patch for integrating some portions of that into Tika (see
> https://issues.apache.org/jira/browse/TIKA-980)
>
> The main challenge I see is that Tika seems to do best as a wrapper for
> other parsers, versus outright ownership of parsers.
>
> Which isn't to say that rolling Any23 into Tika wouldn't work, but without
> at least one active developer it would seem likely that it would languish,
> without active development.
>
> But maybe that's OK…
>
> -- Ken
>
> On Oct 18, 2013, at 7:30am, Lewis John Mcgibbney wrote:
>
> > Hi Tika Dev's/PMC,
> >
> > This thread is aimed at recognizing common ground shared by Any23 and
> Tika
> > in an attempt to possibly integrate Any23 into Tika.
> > First however it will serve a purpose for me to put this into context and
> > also provide some rationale behind this initiative.
> >
> > It is my understanding that the Tika PMC sponsored Any23 through the
> Apache
> > Incubator until we (the Any23 PMC) were ready to graduate having made an
> > incubating release and having grown the community somewhat. Post
> > graduation, we made a 0.8.0 release in July 2013.
> >
> > It is also my understanding that the logical justification for the Tika
> PMC
> > sponsoring us, was that it was envisaged (by numerous dev's) that there
> was
> > already some common ground between the aim and objectives of both
> projects
> > e.g. mime type detection, parsing, extraction of metadata, serialization,
> > etc. therefore with a little positive thinking and understanding of both
> > projects, one can clearly see the shared interests.
> >
> > I am speaking on behalf of the Any23 community here when I say that we
> have
> > however come to a realization that the community is not as vibrant as we
> > would like. This is combined with the fact that initial/original project
> > dev's are not around right now to keep the project moving in a forward
> > direction.
> >
> > It is therefore of interest to us, to approach the Tika community with
> the
> > intention of discussing a proposal to integrate Any23 code into Apache
> Tika.
> >
> > For those interested, the Any23 project URL is http://any23.apache.org,
> we
> > also have a live service which you can use to get a feel for what Any23
> > actually does. It can be found at http://any23.org.
> >
> > Any feedback from this community would be really appreciated, as it looks
> > like the alternative would be for us to take the code into the Apache
> > Attic... which is always a last resort.
> >
> > Thanks in advance.
> >
> > Lewis
> >
> > --
> > *Lewis*
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

Posted by Ken Krugler <kk...@transpac.com>.
Hi Lewis,

I haven't have much time to look into Any23, which includes reviewing Markus's patch for integrating some portions of that into Tika (see https://issues.apache.org/jira/browse/TIKA-980)

The main challenge I see is that Tika seems to do best as a wrapper for other parsers, versus outright ownership of parsers.

Which isn't to say that rolling Any23 into Tika wouldn't work, but without at least one active developer it would seem likely that it would languish, without active development.

But maybe that's OK…

-- Ken

On Oct 18, 2013, at 7:30am, Lewis John Mcgibbney wrote:

> Hi Tika Dev's/PMC,
> 
> This thread is aimed at recognizing common ground shared by Any23 and Tika
> in an attempt to possibly integrate Any23 into Tika.
> First however it will serve a purpose for me to put this into context and
> also provide some rationale behind this initiative.
> 
> It is my understanding that the Tika PMC sponsored Any23 through the Apache
> Incubator until we (the Any23 PMC) were ready to graduate having made an
> incubating release and having grown the community somewhat. Post
> graduation, we made a 0.8.0 release in July 2013.
> 
> It is also my understanding that the logical justification for the Tika PMC
> sponsoring us, was that it was envisaged (by numerous dev's) that there was
> already some common ground between the aim and objectives of both projects
> e.g. mime type detection, parsing, extraction of metadata, serialization,
> etc. therefore with a little positive thinking and understanding of both
> projects, one can clearly see the shared interests.
> 
> I am speaking on behalf of the Any23 community here when I say that we have
> however come to a realization that the community is not as vibrant as we
> would like. This is combined with the fact that initial/original project
> dev's are not around right now to keep the project moving in a forward
> direction.
> 
> It is therefore of interest to us, to approach the Tika community with the
> intention of discussing a proposal to integrate Any23 code into Apache Tika.
> 
> For those interested, the Any23 project URL is http://any23.apache.org, we
> also have a live service which you can use to get a feel for what Any23
> actually does. It can be found at http://any23.org.
> 
> Any feedback from this community would be really appreciated, as it looks
> like the alternative would be for us to take the code into the Apache
> Attic... which is always a last resort.
> 
> Thanks in advance.
> 
> Lewis
> 
> -- 
> *Lewis*

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

Posted by Chris Mattmann <ma...@apache.org>.
Lewis,

I for one am supportive of this measure somehow. The exact
mechanism by which we can do this is something that could
involve e.g., taking you, or anyone else from the Any23 community
(at this point I think it's really just you by my own accord lurking
on the lists over there) that is interested and bringing you into
the fray for Tika perhaps working on a branch to integrate Any23 and
Tika more closely together, in patch-wise, piecemeal fashion along
the standard way we operate over here in Tika and in Apache.

This would also seem to be in tune with Julien's comment and feeling
that parts of this would make a lot of sense to be part of Tika.

Lewis, why don't you get more opinions over the next week or so
and if there are no strong objections, call a VOTE for the Tika
PMC to VOTE on?

Does that work?

Thanks!

Cheers,
Chris


-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Friday, October 18, 2013 7:30 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: [DISCUSS] Integrate Apache Any23 into Apache Tika

>Hi Tika Dev's/PMC,
>
>This thread is aimed at recognizing common ground shared by Any23 and Tika
>in an attempt to possibly integrate Any23 into Tika.
>First however it will serve a purpose for me to put this into context and
>also provide some rationale behind this initiative.
>
>It is my understanding that the Tika PMC sponsored Any23 through the
>Apache
>Incubator until we (the Any23 PMC) were ready to graduate having made an
>incubating release and having grown the community somewhat. Post
>graduation, we made a 0.8.0 release in July 2013.
>
>It is also my understanding that the logical justification for the Tika
>PMC
>sponsoring us, was that it was envisaged (by numerous dev's) that there
>was
>already some common ground between the aim and objectives of both projects
>e.g. mime type detection, parsing, extraction of metadata, serialization,
>etc. therefore with a little positive thinking and understanding of both
>projects, one can clearly see the shared interests.
>
>I am speaking on behalf of the Any23 community here when I say that we
>have
>however come to a realization that the community is not as vibrant as we
>would like. This is combined with the fact that initial/original project
>dev's are not around right now to keep the project moving in a forward
>direction.
>
>It is therefore of interest to us, to approach the Tika community with the
>intention of discussing a proposal to integrate Any23 code into Apache
>Tika.
>
>For those interested, the Any23 project URL is http://any23.apache.org, we
>also have a live service which you can use to get a feel for what Any23
>actually does. It can be found at http://any23.org.
>
>Any feedback from this community would be really appreciated, as it looks
>like the alternative would be for us to take the code into the Apache
>Attic... which is always a last resort.
>
>Thanks in advance.
>
>Lewis
>
>-- 
>*Lewis*