You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2010/05/19 15:51:00 UTC

solr and analyzers module

Hello,

I am doing some work to shuffle things around and consolidate
analyzers into what will hopefully be its own versioned module (such
that you could use an older version with a newer Lucene core and we
could remove "fake" Version and use real jar file versions).

For a while I have been thinking about how we might apply this to
Solr, so it gets the same benefit. At the same time, there are other
"problems" with analysis in Solr I would like to fix at the same time:

1. Solr, like Lucene, should be able to work with an older analyzers
module for backwards compatibility purposes.
2. Solr users should optionally be able to use analyzers that are not
in common (smartcn, stempel, icu, ...) easily. Currently this is a
tradeoff against the size of the solr war file (so they are not
included). At the same time it seems silly to make solr contribs for
'more analyzers'.

The current idea I have is that Solr would not include
analyzers-common.jar bundled into its war file at all. Instead, all
analyzers modules would also serve as plugins to Solr (you stick them
in solrhome/lib).  By default, Solr would just include
analyzers-common this way, instead of in the war file itself.

So with this idea, analyzers are just a Solr plugin, and the default
Solr install includes the ones it does today, so most users would not
see the difference. But if a user wants Polish, Smart Chinese, or
improved Unicode support, they would be able to drop in one of the
additional analyzer modules easily.

The factories for Solr serve as a buffer to hide the implementation
details, and I think they should be part of these analyzer modules, so
when you produce an analyzers artifact it is both a plugin to Lucene
and also a plugin to Solr. In my opinion, this factory interface is
very well defined and achieves for Solr <-> analyzers what we want to
achieve for Lucene <-> analyzers, a minimal interface.

Down the road, we could look at improving on this further, for example
any given release of analyzers artifacts could include additional
artifacts that "go with it":
1. example configuration files like stopwords lists for different languages
2. example schema definitions (even snippets) for Solr users as a
documentation artifact, so they know how to use this stuff.
...

Thoughts, alternatives proposals?

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: solr and analyzers module

Posted by Chris Hostetter <ho...@fucit.org>.
: > FWIW: the other thing you may not be aware of is that schema.xml has
: > always had a "version" attribute on the top level <schema/> declaration
	...
: You are right, I was unaware of this. But i'm confused that its currently 1.3.

it's always been completley independent of the Solr version.  it's the 
"Schema version"

: To deal with improved defaults in this new modularized world, I feel
: that we just shouldnt have so many concrete Analyzers in java that
: really should be "examples"
	...
: examples of how to do analysis for Solr users. I'd like to be able to
: just have these, and rid of the concrete Java implementations
: entirely.

+1 ... on the other hand, the best way to ensure that "examples" work is 
to compile & test them, so ... damned if we do, damned if we don't.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: solr and analyzers module

Posted by Robert Muir <rc...@gmail.com>.
On Thu, May 20, 2010 at 3:07 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> FWIW: the other thing you may not be aware of is that schema.xml has
> always had a "version" attribute on the top level <schema/> declaration
> that also dictates some default behavior.  For example: initially all solr
> fields were multiValued by default, but if you have version="1.1" (or
> higher) then the "default" value for the multiValued property of fields
> changes.
>
> it's one more peice of flexability that the factories have when deciding
> what tokenizer/filter to produce in the goal of backcompatibility (but
> sensible defaults)

You are right, I was unaware of this. But i'm confused that its currently 1.3.

I guess what I am looking for overall is something more like this:
* Version is everywhere in analyzers but mostly for two reasons:
bug-back-compat, and defaults.
* The analyzers module has its own version numbers (so you drop in an
old one for the same bytecode=backcompat).
* This isn't a free-for-all, we try to implement backwards
compatibility, at least across minor releases, in the sense that we
just dont go hog-wild, to keep upgrades easy.
* At the same time, the concept of bug-back-compat goes away, if we
fix a bug, we fix a bug. If you want precise back compat you use the
old jar file.

The issue of improving defaults generally sits inside actual lucene
Analyzer (or the equivalent "schema example" for Solr), but usually
not in the TokenStreams themselves.

To deal with improved defaults in this new modularized world, I feel
that we just shouldnt have so many concrete Analyzers in java that
really should be "examples"

Instead, an idea is that as we migrate the Solr analysis factories to
the analyzer module, we also produce "artifacts" which are starter
examples of how to do analysis for Solr users. I'd like to be able to
just have these, and rid of the concrete Java implementations
entirely.

Perhaps for ease of use, a Java user could actually get a "Analyzer"
at runtime from one of these example definitions (and use this
declarative mechanism in a non-Solr app too).

They should still be able to extend Analyzer themselves like they do
today, but I don't think we should provide examples in two different
languages: Java code and XML.

I feel this would give us enough flexibility to get rid of Version,
and at the same time keep the API consistent and make upgrading easy.

(This still isn't a concrete proposal at all, just some random thoughts)

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: solr and analyzers module

Posted by Chris Hostetter <ho...@fucit.org>.
: All tokenstreams and tokenfilters will of course have backwards
: compatibility across minor releases at least, but these factories give
: us some additional flexibility in how we preserve that backwards
: compatibility for Solr.
: 
: For example in a major release we might remove a deprecated
: TokenStream alltogether, but by having the factories also, we can keep
: continuity for Solr schemas (just change the implementation).

right -- otherthings the factories have given us in the past is the 
ability to add "optional" args to the factory declarations, that (when 
absent) default to the legacy behavior, but in all example configs we 
"suggest" a setting that provides "better" behavior -- since these are 
factory settings, they can even change hte underlying 
Tokenizer/TOkenFilter that is produced by the factory.

FWIW: the other thing you may not be aware of is that schema.xml has 
always had a "version" attribute on the top level <schema/> declaration 
that also dictates some default behavior.  For example: initially all solr 
fields were multiValued by default, but if you have version="1.1" (or 
higher) then the "default" value for the multiValued property of fields 
changes.

it's one more peice of flexability that the factories have when deciding 
what tokenizer/filter to produce in the goal of backcompatibility (but 
sensible defaults)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: solr and analyzers module

Posted by Robert Muir <rc...@gmail.com>.
On Wed, May 19, 2010 at 4:57 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : 1. Solr, like Lucene, should be able to work with an older analyzers
> : module for backwards compatibility purposes.
>
> While i don't disagree with you, Solr "philosiphy" has generally
> discouraged the use of "Analyzer" classes in favor of more more discreet
> Tokenizer & TokenFilter pieces --

I am sorry for the confusing terminology I used, when I say "analyzer"
for Solr, I refer to the one created from the schema definition.

I should have said, "analysis components".

In this model if the factories and the tokenstreams are in a versioned
jar file, its going to produce the same results they had before (thus
backwards compat).

I didnt mean to refer to actually instantiating a definition from a
lucene Analyzer proper, but this too would also work, because its in
the same versioned jar file... in short, all analysis components in
one place.

> Just to be clear: what you are suggesting is that module-analyzer-XXX.jar
> artifact of modules/analyzers/XXX should not only contain the Tokenizers &
> TokenFilters that relate to XXX, but also the Factories solr expects to
> initialize them -- so a user only needs to add that
> module-analyzer-XXX.jar to their Solr lib dir to get all the
> functionality, instead of needing module-analyzer-XXX.jar plus some
> solr-analyzer-XXX-glue.jar
>
>        ...am i understanding that correctly?

Yes, this is what I propose.

All tokenstreams and tokenfilters will of course have backwards
compatibility across minor releases at least, but these factories give
us some additional flexibility in how we preserve that backwards
compatibility for Solr.

For example in a major release we might remove a deprecated
TokenStream alltogether, but by having the factories also, we can keep
continuity for Solr schemas (just change the implementation).

They also give us the capability to emit warnings and such about
deprecations, and other things that we need.

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: solr and analyzers module

Posted by Chris Hostetter <ho...@fucit.org>.
: 1. Solr, like Lucene, should be able to work with an older analyzers
: module for backwards compatibility purposes.

While i don't disagree with you, Solr "philosiphy" has generally 
discouraged the use of "Analyzer" classes in favor of more more discreet 
Tokenizer & TokenFilter pieces -- direct support for Analyzers is mainly a 
result of wanting to allow an easy trannasition for people that already 
have custom analyzers they wrote for direct use in Lucene.  The more fine 
grain analysis chain appraoch that Solr encourages makes it easier for 
people to debug what is going on, and allows for more customization of the 
individiaul stages of the "Analayzer" thta gets built on the fly.

That said: if we can make it easier to use Analyzers, i'm all for it -- I 
just don't want to set things up in a way that people choose to use 
XyzAnalyzer from an analyzer module, when they could get the exact same 
behavior by chaining together XTokenizer, YTokenFilter, and ZTokenFilter 
(from the same module) and in the later case have more transparent 
debugging and fine grained configuration controls.

: So with this idea, analyzers are just a Solr plugin, and the default
: Solr install includes the ones it does today, so most users would not
: see the difference. But if a user wants Polish, Smart Chinese, or
: improved Unicode support, they would be able to drop in one of the
: additional analyzer modules easily.
: 
: The factories for Solr serve as a buffer to hide the implementation
: details, and I think they should be part of these analyzer modules, so

Just to be clear: what you are suggesting is that module-analyzer-XXX.jar 
artifact of modules/analyzers/XXX should not only contain the Tokenizers & 
TokenFilters that relate to XXX, but also the Factories solr expects to 
initialize them -- so a user only needs to add that 
module-analyzer-XXX.jar to their Solr lib dir to get all the 
functionality, instead of needing module-analyzer-XXX.jar plus some 
solr-analyzer-XXX-glue.jar

	...am i understanding that correctly?

I'm all in favor of this -- anticipating that some of the stuff in 
IndexSchema might eventually get "promoted" up in to a lucene 
contirb/module is the key reason why we made sure a few years back to 
prevent letting FieldTypes/TokenizerFactories/TokenFilter factories be 
"aware" of the SolrCore or the IndexSchema classes -- instead all they are 
allowed to know about is hte concept of a "ResourceLoader" for accessing 
external file resources (ie: via a classpath and or effective directory).

So refactorying the factory APIs + the ResourceLoader into a new module 
should be relatively straight forward (knock on wood)

: 2. example schema definitions (even snippets) for Solr users as a
: documentation artifact, so they know how to use this stuff.

+1


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: solr and analyzers module

Posted by ka...@nokia.com.
Nobody in their right mind can disagree with (1).  I should also point out that writing a custom analyzer is a very typical activity (as is a custom scorer), so this should be made as straightforward as is possible.

Karl


-----Original Message-----
From: ext Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, May 19, 2010 9:51 AM
To: dev@lucene.apache.org
Subject: solr and analyzers module

Hello,

I am doing some work to shuffle things around and consolidate
analyzers into what will hopefully be its own versioned module (such
that you could use an older version with a newer Lucene core and we
could remove "fake" Version and use real jar file versions).

For a while I have been thinking about how we might apply this to
Solr, so it gets the same benefit. At the same time, there are other
"problems" with analysis in Solr I would like to fix at the same time:

1. Solr, like Lucene, should be able to work with an older analyzers
module for backwards compatibility purposes.
2. Solr users should optionally be able to use analyzers that are not
in common (smartcn, stempel, icu, ...) easily. Currently this is a
tradeoff against the size of the solr war file (so they are not
included). At the same time it seems silly to make solr contribs for
'more analyzers'.

The current idea I have is that Solr would not include
analyzers-common.jar bundled into its war file at all. Instead, all
analyzers modules would also serve as plugins to Solr (you stick them
in solrhome/lib).  By default, Solr would just include
analyzers-common this way, instead of in the war file itself.

So with this idea, analyzers are just a Solr plugin, and the default
Solr install includes the ones it does today, so most users would not
see the difference. But if a user wants Polish, Smart Chinese, or
improved Unicode support, they would be able to drop in one of the
additional analyzer modules easily.

The factories for Solr serve as a buffer to hide the implementation
details, and I think they should be part of these analyzer modules, so
when you produce an analyzers artifact it is both a plugin to Lucene
and also a plugin to Solr. In my opinion, this factory interface is
very well defined and achieves for Solr <-> analyzers what we want to
achieve for Lucene <-> analyzers, a minimal interface.

Down the road, we could look at improving on this further, for example
any given release of analyzers artifacts could include additional
artifacts that "go with it":
1. example configuration files like stopwords lists for different languages
2. example schema definitions (even snippets) for Solr users as a
documentation artifact, so they know how to use this stuff.
...

Thoughts, alternatives proposals?

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org