You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2009/04/07 23:29:06 UTC

Getting rid of jackrabbit-text-extractors

Hi,

JCR-1878 is now resolved and Jackrabbit trunk is depending on Apache
Tika for text extraction functionality. Thus there is little more need
for jackrabbit-text-extractors as a standalone component. Anyone who
needs that functionality separately from jackrabbit-core should just
go for Tika directly.

For backwards compatibility with existing configurations (and
potential extensions) we still need the current
org.apache.jackrabbit.extractor classes, but I'm thinking of simply
moving the entire package to jackrabbit-core and deprecating
everything except the new Tika-based extractor. In fact I'd even go as
far as changing the indexing code in jackrabbit-core to use the Tika
Parser interface directly and only provide a backwards-compatibility
layer for the TextExtractor classes we have.

Thus Jackrabbit 1.6 would no longer contain a separate text-extractors
jar, but all the existing TextExtractor classes would still be
incluced. In Jackrabbit 2.0 we'd drop all the TextExtractors and only
use Tika Parsers.

BR,

Jukka Zitting

Re: Getting rid of jackrabbit-text-extractors

Posted by Marcel Reutegger <ma...@gmx.net>.

On Wed, Apr 8, 2009 at 14:13, Jukka Zitting <ju...@gmail.com> wrote:
> (I assume you mean jackrabbit-text-extractors)

ah, right. sorry.

> The SearchIndex class currently has a hard dependency to TextExtractor
> that needs to be there also on runtime, so we can't make the
> text-extractors dependency optional without changing things. I'd
> prefer to replace that dependency with one to the Tika Parser
> interface, but then we need a hard Maven dependency on Tika.

ideally I'd like to have the parser/text-extractor dependencies
optional and whoever uses
jackrabbit-core needs to decide which parsers/text-extractors he wants
to use (the same
actually also applies to persistence managers). however, that requires
that we get rid of the
runtime dependencies to either Tika and/or jackrabbit-text-extractors,
but I don't know how
this can be done easily (without reflection hacks).

> In either case I think it's best for everyone if the current
> TextExtractor classes will remain in the runtime classpath (in either
> the text-extractors or the core jar) so that there's no need to modify
> existing configurations.

OK, I agree. backward compatibility should be guaranteed for 1.6 and
there shouldn't be
deployment surprises with an upgrade.

regards
 marcel

Re: Getting rid of jackrabbit-text-extractors

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Apr 8, 2009 at 1:43 PM, Marcel Reutegger
<ma...@gmx.net> wrote:
> On Tue, Apr 7, 2009 at 23:29, Jukka Zitting <ju...@gmail.com> wrote:
>> Thus Jackrabbit 1.6 would no longer contain a separate text-extractors
>> jar, but all the existing TextExtractor classes would still be
>> incluced. In Jackrabbit 2.0 we'd drop all the TextExtractors and only
>> use Tika Parsers.
>
> hmm, this adds quite some dependencies to jackrabbit-core.

Currently we already have quite a few parsing dependencies through
jackrabbit-text-extractors. Tika has even more, but with TIKA-1878
we're already including them.

There's been some discussion in Tika about splitting Tika into a core
jar with no dependencies (or just a few like commons-io), and a
separate parser jar (or more) that contain the Parser implementations
that depend on the various parser libraries like POI. I could push
that idea forward in Tika if it would be useful in Jackrabbit.

> What if we kept the dependency from jackrabbit-core to
> jackrabbit-jcr-tests at version 1.5 but at the same time flag it
> optional? That would remove it from the dependency tree but you'd
> still have it in the pom (until we remove it in 2.0).

(I assume you mean jackrabbit-text-extractors)

The SearchIndex class currently has a hard dependency to TextExtractor
that needs to be there also on runtime, so we can't make the
text-extractors dependency optional without changing things. I'd
prefer to replace that dependency with one to the Tika Parser
interface, but then we need a hard Maven dependency on Tika.

In either case I think it's best for everyone if the current
TextExtractor classes will remain in the runtime classpath (in either
the text-extractors or the core jar) so that there's no need to modify
existing configurations.

BR,

Jukka Zitting

Re: Getting rid of jackrabbit-text-extractors

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi,

On Tue, Apr 7, 2009 at 23:29, Jukka Zitting <ju...@gmail.com> wrote:
> JCR-1878 is now resolved and Jackrabbit trunk is depending on Apache
> Tika for text extraction functionality. Thus there is little more need
> for jackrabbit-text-extractors as a standalone component. Anyone who
> needs that functionality separately from jackrabbit-core should just
> go for Tika directly.

+1

> For backwards compatibility with existing configurations (and
> potential extensions) we still need the current
> org.apache.jackrabbit.extractor classes, but I'm thinking of simply
> moving the entire package to jackrabbit-core and deprecating
> everything except the new Tika-based extractor. In fact I'd even go as
> far as changing the indexing code in jackrabbit-core to use the Tika
> Parser interface directly and only provide a backwards-compatibility
> layer for the TextExtractor classes we have.
>
> Thus Jackrabbit 1.6 would no longer contain a separate text-extractors
> jar, but all the existing TextExtractor classes would still be
> incluced. In Jackrabbit 2.0 we'd drop all the TextExtractors and only
> use Tika Parsers.

hmm, this adds quite some dependencies to jackrabbit-core.

What if we kept the dependency from jackrabbit-core to
jackrabbit-jcr-tests at version 1.5 but at the same time flag it
optional? That would remove it from the dependency tree but you'd
still have it in the pom (until we remove it in 2.0).

regards
 marcel