You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Hayden Muhl <ha...@gmail.com> on 2015/07/13 23:28:02 UTC

Refactor JapaneseTokenizer to take a dictionary as a parameter

I'd like to make the JapaneseTokenizer a little more flexible by allowing
Solr users to supply their own dictionary via the JapaneseTokenizerFactory.
We looked into using the existing User Dictionary functionality, but it
didn't suit our use case. We compiled our own dictionary, and had to do a
bit of a kludge to get the JapaneseTokenizer to recognize it.

Here's what I'd like to do.

* Publish the Kuromoji tools as a JAR to Maven
* Refactor the various Dictionary classes to optionally allow loading a
dictionary from the file system instead of hard coding loading the
dictionary from the classpath. If no file system location is provided, fall
back to the default dictionary on the classpath.
* Move instantiation of the dictionary to the JapaneseTokenizerFactory and
pass the dictionary in as a parameter to the JapaneseTokenizer constructor

I've looked into the code, and this seems like a manageable change, but I
want to make sure I'm not breaking anything.

Currently the Dictionary classes maintain their own singleton instances in
static variables. It seems to me, it might be better if the
JapaneseTokenizerFactory were to hold on to an instance of the dictionary
and pass this in to any JapaneseTokenizer created. Was there a reason for
using a singleton pattern for the various Dictionary classes, or can this
be changed?

Is there any objection to publishing the Kuromoji tools to Maven? They were
very easy to compile and use. Packaging them up as a JAR file was simple,
but I will need a bit of direction as to how to do this within the current
conventions for Lucene's build.xml files.

- Hayden

Re: Refactor JapaneseTokenizer to take a dictionary as a parameter

Posted by Upayavira <uv...@odoko.co.uk>.

Where do you get the impression that the dictionary must come from the
classpath? When you see ResourceLoader, this is a Lucene construct, and
Solr provides its own that will look into the core's conf directory
(whether that is on the file system in non-cloud mode, or Zookeeper for
cloud mode). Follow the resource loading patterns that Lucene provides
for maximum usability.

Are the Kuromoji tools something new that you want to include into Solr?
In which case, can their maintainer push them to Maven for Lucene to
consume via Ant/Ivy?

Upayavira

On Mon, Jul 13, 2015, at 10:28 PM, Hayden Muhl wrote:
> I'd like to make the JapaneseTokenizer a little more flexible by
> allowing Solr users to supply their own dictionary via the
> JapaneseTokenizerFactory. We looked into using the existing User
> Dictionary functionality, but it didn't suit our use case. We compiled
> our own dictionary, and had to do a bit of a kludge to get the
> JapaneseTokenizer to recognize it.
>
> Here's what I'd like to do.
>
> * Publish the Kuromoji tools as a JAR to Maven
> * Refactor the various Dictionary classes to optionally allow loading
>   a dictionary from the file system instead of hard coding loading the
>   dictionary from the classpath. If no file system location is
>   provided, fall back to the default dictionary on the classpath.
> * Move instantiation of the dictionary to the JapaneseTokenizerFactory
>   and pass the dictionary in as a parameter to the JapaneseTokenizer
>   constructor
>
> I've looked into the code, and this seems like a manageable change,
> but I want to make sure I'm not breaking anything.
>
> Currently the Dictionary classes maintain their own singleton
> instances in static variables. It seems to me, it might be better if
> the JapaneseTokenizerFactory were to hold on to an instance of the
> dictionary and pass this in to any JapaneseTokenizer created. Was
> there a reason for using a singleton pattern for the various
> Dictionary classes, or can this be changed?
>
> Is there any objection to publishing the Kuromoji tools to Maven? They
> were very easy to compile and use. Packaging them up as a JAR file was
> simple, but I will need a bit of direction as to how to do this within
> the current conventions for Lucene's build.xml files.
>
> - Hayden