You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Marcel Reutegger <ma...@gmx.net> on 2007/10/01 14:27:10 UTC

spellchecker

Hi,

I'm about to write a spellchecker extension for the lucene query handler in 
jackrabbit. I planned to use the lucene-spellchecker contrib, however I don't 
want to introduce another dependency in the jackrabbit-core. because the 
spellchecker contrib in lucene only includes a handful of classes I would prefer 
to copy the classes and refactor them into the jackrabbit package space.

does anyone have a better idea how to handle this?

regards
  marcel

Re: spellchecker

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> Some concerns though, as I figure the spell checker would use the
> search index as a dictionary. Can there be a case where this feature
> could be used to circumvent access controls to retrieve isolated
> pieces of content from read-protected documents?

yes, that would be possible. whatever is indexed will be used as a basis for 
spell checking. we should probably put a warning on the spell checker (at least 
on the one I'm implementing).

>> I planned to use the lucene-spellchecker contrib, however I don't
>> want to introduce another dependency in the jackrabbit-core. because the
>> spellchecker contrib in lucene only includes a handful of classes I would prefer
>> to copy the classes and refactor them into the jackrabbit package space.
>>
>> does anyone have a better idea how to handle this?
> 
> Would there be interest within the Lucene team to include the feature
> in a future release of lucene-core?

hmm, don't know. I haven't asked them. but if I had to decide I wouldn't include 
it, because it is clearly not a core feature.

> I see where Felix is going with extra modules, but there's always a
> cost in complexity with such modularity and I'm not sure if this
> feature is worth that overhead.

I think introducing the spell checker in the contrib is a good start. we can 
still promote it or integrate it later into an existing module.

regards
  marcel

Re: spellchecker

Posted by Felix Meschberger <fm...@gmail.com>.

Hi,

Am Dienstag, den 02.10.2007, 11:04 +0300 schrieb Jukka Zitting:
> I see where Felix is going with extra modules, but there's always a
> cost in complexity with such modularity and I'm not sure if this
> feature is worth that overhead.

I agree that it may be overhead for this isolated feature. In overall
terms of Jackrabbit extensibility, the added complexity just adds to the
versatility of Jackrabbit.

And of course, there would be APIs to be obeyed and managed :-)

Regards
Felix

Re: spellchecker

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 10/1/07, Marcel Reutegger <ma...@gmx.net> wrote:
> I'm about to write a spellchecker extension for the lucene query handler in
> jackrabbit.

Cool!

Some concerns though, as I figure the spell checker would use the
search index as a dictionary. Can there be a case where this feature
could be used to circumvent access controls to retrieve isolated
pieces of content from read-protected documents? I guess the threat is
a bit theoretical, but how about a case where an attacker just wants
to know if a repository contains some specific material (a list of
specific names, etc.). The attacker could use the spellchecker as a
mechanism to find out if a workspace contains a document with a
specific name or keyword.

> I planned to use the lucene-spellchecker contrib, however I don't
> want to introduce another dependency in the jackrabbit-core. because the
> spellchecker contrib in lucene only includes a handful of classes I would prefer
> to copy the classes and refactor them into the jackrabbit package space.
>
> does anyone have a better idea how to handle this?

Would there be interest within the Lucene team to include the feature
in a future release of lucene-core?

I see where Felix is going with extra modules, but there's always a
cost in complexity with such modularity and I'm not sure if this
feature is worth that overhead.

BR,

Jukka Zitting

RE: spellchecker

Posted by Ard Schrijvers <a....@hippo.nl>.


> Marcel Reutegger wrote:
> the primary use case is in fact a 'did you mean' functionality.
> 
> e.g. if a query does not return the expected results (too few 
> results, or whatever the application thinks is not 
> appropriate) you would then execute the following query:
> 
> /jcr:root[rep:spellcheck('softeware')]/(rep:spellcheck())
> 
> this will always return the root node of the workspace 
> because rep:spellcheck() with a string always evaluates to 
> true. the string literal is the same as the one you can pass 
> to a jcr:contains() function. the function rep:spellcheck() 
> without arguments is a pseudo property which refers to the 
> previous function and returns a suggestion for the misspelled 
> statement or null if the statement is correctly spelled.

Thanks for the explanation Marcel (I have been on holiday hence my late
reaction)

Regards Ard

> 
> regards
>   marcel
>

Re: spellchecker

Posted by Marcel Reutegger <ma...@gmx.net>.

Ard Schrijvers wrote:
> Just wondering how this interface would look like, and how you are
> planning that the spellchecker is used. I really think it is nice to
> have, but I cannot picture how to use it from for example xpath or sql
> (I suppose it will introduce some "did you mean" option as well when
> searching and you find only hits with a score < min ). 

the primary use case is in fact a 'did you mean' functionality.

e.g. if a query does not return the expected results (too few results, or 
whatever the application thinks is not appropriate) you would then execute the 
following query:

/jcr:root[rep:spellcheck('softeware')]/(rep:spellcheck())

this will always return the root node of the workspace because rep:spellcheck() 
with a string always evaluates to true. the string literal is the same as the 
one you can pass to a jcr:contains() function. the function rep:spellcheck() 
without arguments is a pseudo property which refers to the previous function and 
returns a suggestion for the misspelled statement or null if the statement is 
correctly spelled.

regards
  marcel

RE: spellchecker

Posted by Ard Schrijvers <a....@hippo.nl>.

Hello Marcel,

> Marcel Reutegger wrote:
> ok, after some re-thinking I will rather create a new contrib 
> and introduce an spellchecker interface in jackrabbit-core 
> (well, I intended to do that anyway...). the 
> jackrabbit-spellchecker (?) will then be similar to the 
> wordnet-synonyms contrib. if you want the spellchecker 
> functionality you need to configure it and deploy the 
> relevant jar files.
> 

Just wondering how this interface would look like, and how you are
planning that the spellchecker is used. I really think it is nice to
have, but I cannot picture how to use it from for example xpath or sql
(I suppose it will introduce some "did you mean" option as well when
searching and you find only hits with a score < min ). 

Regards Ard

Re: spellchecker

Posted by Felix Meschberger <fm...@gmail.com>.

Am Montag, den 01.10.2007, 16:01 +0200 schrieb Marcel Reutegger:
> Felix Meschberger wrote:
> > Just another question: Could the spellchecker be instantiated and
> > injected into a running Jackrabbit instance ?
> 
> AFAIK jackrabbit is not that flexible. the configuration is rather fixed. but 

Yes, this is a long standing issue with Jackrabbit, actually ...

> maybe we can come up with a list of components that can be configured or 
> deployed at runtime, while some of the components like persistence managers are 
> probably only configurable at startup.

We could add some sort of Module API, which would allow to inject
additional Jackrabbit functionality. Of course, the internal Jackrabbit
API available to such modules must be well defined and not just be "all
Jackrabbit-core packages", because this would prevent effective
modularization.

Regarding PersistenceManagers: As these are used to back workspaces, the
question is, do we want to implement API to shutdown workspaces - not
just stopping idle workspaces but a method to stop a workspace, which
may include forcing sessions (which is allowed as per the Spec !). And I
am very much in favor of such a flexible solution, where we may "inject"
persistence managers by just creating the workspaces (actually creating
a workspace in this respect would be starting a workspace) and removing
them again by stopping the workspaces.

This would very much enhance the modularity and extensibility of
Jackrabbit.

Again, this turns out to be a question of an internal API. Such an API
is essential to modularize Jackrabbit as a whole. And to be sure, this
is not the same as the Jackrabbit API, which is intended to be used by
client applications. This is an API used solely to extend Jackrabbit
instances itself.

> what are the requirements to be able to inject such a bundle? do we have to 
> implement certain interfaces or is a setter method sufficient?

There is no requirement other than the mentioned Jackrabbit extension
API. Whatever Jackrabbit provides to extend it may be used by bundles to
plug into Jackrabbit.

Regards
Felix

Re: spellchecker

Posted by Marcel Reutegger <ma...@gmx.net>.

Felix Meschberger wrote:
> Just another question: Could the spellchecker be instantiated and
> injected into a running Jackrabbit instance ?

AFAIK jackrabbit is not that flexible. the configuration is rather fixed. but 
maybe we can come up with a list of components that can be configured or 
deployed at runtime, while some of the components like persistence managers are 
probably only configurable at startup.

what are the requirements to be able to inject such a bundle? do we have to 
implement certain interfaces or is a setter method sufficient?

regards
  marcel

Re: spellchecker

Posted by Felix Meschberger <fm...@gmail.com>.

Cool !

Just another question: Could the spellchecker be instantiated and
injected into a running Jackrabbit instance ? My use case would be a
OSGi framework environment, where an administrator may just deploy the
spellchecker as a bundle and the spellchecker bundle would inject itself
into a Jackrabbit instance. And of course, when the bundle is removed,
the spellchecker should be removed from the instance again.

Regards
Felix


Am Montag, den 01.10.2007, 15:10 +0200 schrieb Marcel Reutegger:
> ok, after some re-thinking I will rather create a new contrib and introduce an 
> spellchecker interface in jackrabbit-core (well, I intended to do that 
> anyway...). the jackrabbit-spellchecker (?) will then be similar to the 
> wordnet-synonyms contrib. if you want the spellchecker functionality you need to 
> configure it and deploy the relevant jar files.
> 
> regards
>   marcel
> 
> Marcel Reutegger wrote:
> > Felix Meschberger wrote:
> >> Why does this extension have to reside in the Jackrabbit core ? If at
> >> all possible, it should also be made an add-on/extension to the core.
> >> This way adding the extension would also require addition of the
> >> dependencies and therefore copying/refactoring is not required.
> > 
> > the spellchecker would be an integral part of the lucene query handler 
> > implementation, which is contained in jackrabbit-core. if the 
> > spellchecker is an optional add-on to the query handler you have to 
> > configure it when you want to use it. that seems overcomplicated and I'd 
> > like to avoid that.
> > 
> > we could of course separate the query handler implementation from the 
> > core but since we only have one implementation I don't see how that is 
> > useful.
> > 
> >> Otherwise, you might want to use the maven dependency plugin to copy the
> >> required classes into the destination location at build time.
> > 
> > sounds like an interesting option.
> > 
> >> BTW: What is the use of a spellchecker in an infrastructure component
> >> like a JCR repository ?
> > 
> > because that's the only reasonable way to implement spell checking of 
> > fulltext query statements. just using an external dictionary results in 
> > bad suggestions and it is not able to consider content that is actually 
> > present in the repository.
> > 
> > regards
> >  marcel
> > 
>

Re: spellchecker

Posted by Marcel Reutegger <ma...@gmx.net>.

ok, after some re-thinking I will rather create a new contrib and introduce an 
spellchecker interface in jackrabbit-core (well, I intended to do that 
anyway...). the jackrabbit-spellchecker (?) will then be similar to the 
wordnet-synonyms contrib. if you want the spellchecker functionality you need to 
configure it and deploy the relevant jar files.

regards
  marcel

Marcel Reutegger wrote:
> Felix Meschberger wrote:
>> Why does this extension have to reside in the Jackrabbit core ? If at
>> all possible, it should also be made an add-on/extension to the core.
>> This way adding the extension would also require addition of the
>> dependencies and therefore copying/refactoring is not required.
> 
> the spellchecker would be an integral part of the lucene query handler 
> implementation, which is contained in jackrabbit-core. if the 
> spellchecker is an optional add-on to the query handler you have to 
> configure it when you want to use it. that seems overcomplicated and I'd 
> like to avoid that.
> 
> we could of course separate the query handler implementation from the 
> core but since we only have one implementation I don't see how that is 
> useful.
> 
>> Otherwise, you might want to use the maven dependency plugin to copy the
>> required classes into the destination location at build time.
> 
> sounds like an interesting option.
> 
>> BTW: What is the use of a spellchecker in an infrastructure component
>> like a JCR repository ?
> 
> because that's the only reasonable way to implement spell checking of 
> fulltext query statements. just using an external dictionary results in 
> bad suggestions and it is not able to consider content that is actually 
> present in the repository.
> 
> regards
>  marcel
>

Re: spellchecker

Posted by Marcel Reutegger <ma...@gmx.net>.

Felix Meschberger wrote:
> Why does this extension have to reside in the Jackrabbit core ? If at
> all possible, it should also be made an add-on/extension to the core.
> This way adding the extension would also require addition of the
> dependencies and therefore copying/refactoring is not required.

the spellchecker would be an integral part of the lucene query handler 
implementation, which is contained in jackrabbit-core. if the spellchecker is an 
optional add-on to the query handler you have to configure it when you want to 
use it. that seems overcomplicated and I'd like to avoid that.

we could of course separate the query handler implementation from the core but 
since we only have one implementation I don't see how that is useful.

> Otherwise, you might want to use the maven dependency plugin to copy the
> required classes into the destination location at build time.

sounds like an interesting option.

> BTW: What is the use of a spellchecker in an infrastructure component
> like a JCR repository ?

because that's the only reasonable way to implement spell checking of fulltext 
query statements. just using an external dictionary results in bad suggestions 
and it is not able to consider content that is actually present in the repository.

regards
  marcel

Re: spellchecker

Posted by Felix Meschberger <fm...@gmail.com>.

Hi,

Why does this extension have to reside in the Jackrabbit core ? If at
all possible, it should also be made an add-on/extension to the core.
This way adding the extension would also require addition of the
dependencies and therefore copying/refactoring is not required.

Otherwise, you might want to use the maven dependency plugin to copy the
required classes into the destination location at build time.

BTW: What is the use of a spellchecker in an infrastructure component
like a JCR repository ?

Regards
Felix

Am Montag, den 01.10.2007, 14:27 +0200 schrieb Marcel Reutegger:
> Hi,
> 
> I'm about to write a spellchecker extension for the lucene query handler in 
> jackrabbit. I planned to use the lucene-spellchecker contrib, however I don't 
> want to introduce another dependency in the jackrabbit-core. because the 
> spellchecker contrib in lucene only includes a handful of classes I would prefer 
> to copy the classes and refactor them into the jackrabbit package space.
> 
> does anyone have a better idea how to handle this?
> 
> regards
>   marcel