You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Bill Janssen <ja...@parc.com> on 2010/09/25 21:45:41 UTC

Bring out SmartChineseAnalyzer in PyLucene?

I'd like to be able to use the HMM-based Chinese Tokenizer in PyLucene,
available in 3.x as
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer, apparently.

I don't see this in PyLucene 3.0.2.  Is this because it ends up in
separate jar file that isn't part of the PyLucene build?

Bill

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Andi Vajda <va...@apache.org>.

On Mon, 27 Sep 2010, Robert Muir wrote:

> yeah i just tested, i get the same error in java if i do this:
> 
> Class.forName("org.apache.lucene.analysis.cn.smart.AnalyzerProfile");
> 
> So really we just need to clean up this analyzer some, because the data
> files are packaged in the jar and i think this AnalyzerProfile stuff is not
> needed and confusing.

Even better. In the meantime, I just added
    --exclude org.apache.lucene.analysis.cn.smart.AnalyzerProfile

and that works around the issue.

Thanks !

Andi..

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Sep 27, 2010 at 1:35 AM, Andi Vajda <va...@apache.org> wrote:

>
> On Mon, 27 Sep 2010, Robert Muir wrote:
>
>
>>
>> On Sun, Sep 26, 2010 at 10:05 PM, Andi Vajda <va...@apache.org> wrote:
>>
>> Doing so causes this complaint to be emitted while building the
>> wrappers:
>>
>> WARNING: Can not find lexical dictionary directory!
>> WARNING: This will cause unpredictable exceptions in your application!
>> WARNING: Please refer to the manual to download the dictionaries.
>>
>> What's the trick to have the lexical dictionary directory found ?
>> A quick glance at the javadocs [1] doesn't seem to say.
>>
>> Andi..
>>
>> [1] http://lucene.apache.org/java/3_0_2/api/contrib-smartcn/index.html
>>
>>
>> the way this analyzer loads resources is a bit hairy. take a look at
>> WordDictionary.getInstance() for example.
>> so when this is called, it first checks, in this order:
>> * from inside the jar itself: resources/.../hmm/coredict.mem [this should
>> always succeed!]
>> * from the AnalyzerProfile thing, which is whats emitting the error.
>>
>>       try {
>>         singleInstance.load();
>>       } catch (IOException e) {
>>         String wordDictRoot = AnalyzerProfile.ANALYSIS_DATA_DIR;
>>         singleInstance.load(wordDictRoot);
>>
>> So when you are building the wrappers, is it just that you are causing
>> java
>> to load this AnalyzerProfile some other way manually? because as soon as
>> you try to load AnalyzerProfile its going to emit these
>> warnings...
>> if pylucene is just loading this class itself (to make it accessible via
>> python), i think this is just harmless?
>>
>
> Yes, that's most likely it. JCC loads each and every public class unless
> told not to to generate wrappers for it. If this AnalyzerProfile class is
> not necessary from Python, it should be put on the --exclude list.
>

yeah i just tested, i get the same error in java if i do this:

Class.forName("org.apache.lucene.analysis.cn.smart.AnalyzerProfile");

So really we just need to clean up this analyzer some, because the data
files are packaged in the jar and i think this AnalyzerProfile stuff is not
needed and confusing.

-- 
Robert Muir
rcmuir@gmail.com

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Andi Vajda <va...@apache.org>.

On Mon, 27 Sep 2010, Robert Muir wrote:

> 
> 
> On Sun, Sep 26, 2010 at 10:05 PM, Andi Vajda <va...@apache.org> wrote:
> 
> Doing so causes this complaint to be emitted while building the
> wrappers:
> 
> WARNING: Can not find lexical dictionary directory!
> WARNING: This will cause unpredictable exceptions in your application!
> WARNING: Please refer to the manual to download the dictionaries.
> 
> What's the trick to have the lexical dictionary directory found ?
> A quick glance at the javadocs [1] doesn't seem to say.
> 
> Andi..
> 
> [1] http://lucene.apache.org/java/3_0_2/api/contrib-smartcn/index.html
> 
> 
> the way this analyzer loads resources is a bit hairy. take a look at
> WordDictionary.getInstance() for example.
> so when this is called, it first checks, in this order:
> * from inside the jar itself: resources/.../hmm/coredict.mem [this should
> always succeed!]
> * from the AnalyzerProfile thing, which is whats emitting the error.
>
>       try {
>         singleInstance.load();
>       } catch (IOException e) {
>         String wordDictRoot = AnalyzerProfile.ANALYSIS_DATA_DIR;
>         singleInstance.load(wordDictRoot);
> 
> So when you are building the wrappers, is it just that you are causing java
> to load this AnalyzerProfile some other way manually? 
> because as soon as you try to load AnalyzerProfile its going to emit these
> warnings...
> if pylucene is just loading this class itself (to make it accessible via
> python), i think this is just harmless?

Yes, that's most likely it. JCC loads each and every public class unless 
told not to to generate wrappers for it. If this AnalyzerProfile class is 
not necessary from Python, it should be put on the --exclude list.

Thanks for the tip !

Andi..

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Robert Muir <rc...@gmail.com>.

On Sun, Sep 26, 2010 at 10:05 PM, Andi Vajda <va...@apache.org> wrote:

>
> Doing so causes this complaint to be emitted while building the wrappers:
>
> WARNING: Can not find lexical dictionary directory!
> WARNING: This will cause unpredictable exceptions in your application!
> WARNING: Please refer to the manual to download the dictionaries.
>
> What's the trick to have the lexical dictionary directory found ?
> A quick glance at the javadocs [1] doesn't seem to say.
>
> Andi..
>
> [1] http://lucene.apache.org/java/3_0_2/api/contrib-smartcn/index.html
>

the way this analyzer loads resources is a bit hairy. take a look at
WordDictionary.getInstance() for example.

so when this is called, it first checks, in this order:
* from inside the jar itself: resources/.../hmm/coredict.mem [this should
always succeed!]
* from the AnalyzerProfile thing, which is whats emitting the error.

      try {
        singleInstance.load();
      } catch (IOException e) {
        String wordDictRoot = AnalyzerProfile.ANALYSIS_DATA_DIR;
        singleInstance.load(wordDictRoot);

So when you are building the wrappers, is it just that you are causing java
to load this AnalyzerProfile some other way manually?
because as soon as you try to load AnalyzerProfile its going to emit these
warnings...
if pylucene is just loading this class itself (to make it accessible via
python), i think this is just harmless?

-- 
Robert Muir
rcmuir@gmail.com

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Andi Vajda <va...@apache.org>.

On Sat, 25 Sep 2010, Bill Janssen wrote:

> Right now you've got this one:
>
> ANALYZERS_JAR=$(LUCENE)/build/contrib/analyzers/common/lucene-analyzers-$(LUCENE_VER).jar
>
> How about adding:
>
> SMARTCNA_JAR=$(LUCENE)/build/contrib/analyzers/smartcn/lucene-smartcn-$(LUCENE_VER).jar
>
> and then adding SMARTCNA_JAR to the list of jars?

Doing so causes this complaint to be emitted while building the wrappers:

WARNING: Can not find lexical dictionary directory!
WARNING: This will cause unpredictable exceptions in your application!
WARNING: Please refer to the manual to download the dictionaries.

What's the trick to have the lexical dictionary directory found ?
A quick glance at the javadocs [1] doesn't seem to say.

Andi..

[1] http://lucene.apache.org/java/3_0_2/api/contrib-smartcn/index.html

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Bill Janssen <ja...@parc.com>.

Right now you've got this one:

ANALYZERS_JAR=$(LUCENE)/build/contrib/analyzers/common/lucene-analyzers-$(LUCENE_VER).jar

How about adding:

SMARTCNA_JAR=$(LUCENE)/build/contrib/analyzers/smartcn/lucene-smartcn-$(LUCENE_VER).jar

and then adding SMARTCNA_JAR to the list of jars?

While we're at it, I'd also like to get the SpatialLucene extensions in there:

SPATIAL_JAR=$(LUCENE)/build/contrib/spatial/lucene-spatial-$(LUCENE_VER).jar

Thanks!

Bill

Re: Bring out SmartChineseAnalyzer in PyLucene?

Posted by Andi Vajda <va...@apache.org>.

On Sep 25, 2010, at 12:45, Bill Janssen <ja...@parc.com> wrote:

> I'd like to be able to use the HMM-based Chinese Tokenizer in  
> PyLucene,
> available in 3.x as
> org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer, apparently.
>
> I don't see this in PyLucene 3.0.2.  Is this because it ends up in
> separate jar file that isn't part of the PyLucene build?

Probably. If you have that jar file ready, it should be trivial to add  
it to the jcc call in PyLucene's Makefile.

Andi..

>
> Bill