You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Che Dong <ch...@hotmail.com> on 2002/09/08 07:48:22 UTC

fixed url and How to contribute code to lucene sandbox?

http://www.chedong.com/tech/lucene.html

fixed  reference url with:
http://jakarta.apache.org/lucene/

BTW:
How to contribute code to lucene sandbox?


Che, Dong

----- Original Message ----- 
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Sunday, September 08, 2002 12:01 AM
Subject: Re: Lucene introduction in Chinese


> Thank you for this.
> I think we should add this to the contribution page or some other place
> on the Lucene site (I'll take a look in a bit).
> I would like to just add a link to it.
> 
> Note: the link to Lucene's home page at the bottom of the page is
> wrong: http://jakarta.apache.org/Lucene/
>  should be
> http://jakarta.apache.org/lucene/
> 
> Thanks,
> Otis
> 
> 


Re: about bigram based word segment

Posted by Herman Chen <hc...@intumit.com>.
I think there's another flaw with the bigram approach when the query
consists of 3+ characters.  i.e. a query of w1w2w3 would match such
text as w1w2w4w2w3.  Currently I do unigram tokenization and perform
auto phrase queries for cjk searches, but performance could take a hit in
large-scale situations.

----- Original Message -----
From: "Che Dong" <ch...@hotmail.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, September 13, 2002 9:43 AM
Subject: about bigram based word segment


> > I don't know any Asian languages but from earlier experimentations, I
> > remember that some time bigram tokenization could hurt matching, e.g.:
> >
> > w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> > miss a search for w2. w1 w2 w3 would work better.
> >
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
> you search "w1w2" and "w2w1" will return with same the result. isn't it?
>
>
> with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
> or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
> will avoid above charactor sequence problem.
>
> According to the stat. the bigram based word segment returned best
resutls. but need queryParser parser query with "and" relation by default
>
> You can try the bigram based word segment at http://search.163.com  in
category search and news search(web page is powered by google).
> google's Chinese language analysis is provided by basistech with
Dictionary based word segment.
> http://www.basistech.com/products/language-analysis/cma.html
>
>
>
> Che, Dong
>
>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: about bigram based word segment

Posted by Alex Murzaku <mu...@yahoo.com>.
--- Che Dong <ch...@hotmail.com> wrote:
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
> you search "w1w2" and "w2w1" will return with same the result. isn't
> it?

That wouldn't be the case if you quote the two characters (therefore
you submit a "phrase query".) But this discussion would be more
appropriate in the user group... 

=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


about bigram based word segment

Posted by Che Dong <ch...@hotmail.com>.
> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
> 
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
> 
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
you search "w1w2" and "w2w1" will return with same the result. isn't it?


with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.

According to the stat. the bigram based word segment returned best resutls. but need queryParser parser query with "and" relation by default 

You can try the bigram based word segment at http://search.163.com  in  category search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html



Che, Dong





Re: fixed url and How to contribute code to lucene sandbox?

Posted by Alex Murzaku <mu...@yahoo.com>.
I don't know any Asian languages but from earlier experimentations, I
remember that some time bigram tokenization could hurt matching, e.g.:

w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
miss a search for w2. w1 w2 w3 would work better.

--- Doug Cutting <cu...@lucene.com> wrote:
> Che Dong wrote:
> > 2. CJK support: 
> >        2.1 sigram based(no word segment just use one character as a
> token):  modified from StandardTokenizer.java
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
> >     CJKTokenizer for Asia language(Chinese Japanese Korean) Word
> Segment
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
> >     StandardTokenizer with sigram based CJK Support
> > 
> >     2.2 bigram based word segment: modified from SimpleTokenizer to
> CJKTokenizer.java
> >    
>
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html
> 
> I think it would be great to have some support for asian languages
> built 
> into Lucene.  Which of these approaches do you think is best?  I like
> 
> the idea of a StandardTokenizer or SimpleTokenizer that automatically
> 
> provides this via bigrams.  What do others think?
> 
> Doug
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Doug Cutting <cu...@lucene.com>.
Che Dong wrote:
> 1. custom sorting beside default score sorting: make docID alias one field you need output sorting
> solved  by sort data before indexing(example sorted by field PostDate), so docID can be an alias to the sort field. if we make hitCollector
> sort with docID or 1/docID or even complex stragety (docID * score)...
> http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115469
> IndexOrderSearcher: sort data before indexing and use 1/docID instead of score 

That's an interesting approach.  I don't recall ever seeing this message 
when it was originally posted.  Sorry.

I had imagined instead adding this functionality to Hits.java.  Having 
a different Searcher implementation makes it possible for folks to use 
MultiSearcher to combine results from an IndexSearcher and an 
IndexOrderSearcher, which would not make sense.  If the functionality 
instead resides in Hits.java, then it could not be misused in this way.

So the way I was going to do it was to add something to Hits.java like:
   public static final long ORDER_BY_SCORE = 1;
   public static final long ORDER_BY_DOC_NUM = 1;
   public void setHitOrdering(int order);

If ORDER_BY_SCORE is specfied then Hits would work as it does now.  This 
would be the default.  But when ORDER_BY_DOC_NUM is specified then 
Hits.java would use a HitCollector to implement this ordering.

> 2. CJK support: 
>        2.1 sigram based(no word segment just use one character as a token):  modified from StandardTokenizer.java
>     http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
>     CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
>     http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
>     StandardTokenizer with sigram based CJK Support
> 
>     2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
>     http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html

I think it would be great to have some support for asian languages built 
into Lucene.  Which of these approaches do you think is best?  I like 
the idea of a StandardTokenizer or SimpleTokenizer that automatically 
provides this via bigrams.  What do others think?

Doug



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Che Dong <ch...@hotmail.com>.
I checked the I post before 
http://nagoya.apache.org/eyebrowse/SearchList?listId=&listName=lucene-dev@jakarta.apache.org&searchText=Che&defaultField=sender&Search=Search


mainly in two fields:

1. custom sorting beside default score sorting: make docID alias one field you need output sorting
solved  by sort data before indexing(example sorted by field PostDate), so docID can be an alias to the sort field. if we make hitCollector
sort with docID or 1/docID or even complex stragety (docID * score)...
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115469
IndexOrderSearcher: sort data before indexing and use 1/docID instead of score 

2. CJK support: 
       2.1 sigram based(no word segment just use one character as a token):  modified from StandardTokenizer.java
    http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
    CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
    http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
    StandardTokenizer with sigram based CJK Support

    2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
    http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html
    
Thank you

I also have some advise and working on lucene structure(Document Field Index) => XML binding. If we Make a standard lucene.dtd as a default lucene input format maight be use for applacation intergration with lucene.


Che, Dong
----- Original Message ----- 
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Sunday, September 08, 2002 2:08 PM
Subject: Re: fixed url and How to contribute code to lucene sandbox?


> I will add this to the contributions page.
> 
> --Peter
> On Saturday, September 7, 2002, at 10:48 PM, Che Dong wrote:
> 
> > http://www.chedong.com/tech/lucene.html
> >
> > fixed  reference url with:
> > http://jakarta.apache.org/lucene/
> >
> > BTW:
> > How to contribute code to lucene sandbox?
> >
> >
> > Che, Dong
> >
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <ot...@yahoo.com>
> > To: "Lucene Developers List" <lu...@jakarta.apache.org>
> > Sent: Sunday, September 08, 2002 12:01 AM
> > Subject: Re: Lucene introduction in Chinese
> >
> >
> >> Thank you for this.
> >> I think we should add this to the contribution page or some other 
> >> place
> >> on the Lucene site (I'll take a look in a bit).
> >> I would like to just add a link to it.
> >>
> >> Note: the link to Lucene's home page at the bottom of the page is
> >> wrong: http://jakarta.apache.org/Lucene/
> >>  should be
> >> http://jakarta.apache.org/lucene/
> >>
> >> Thanks,
> >> Otis
> >>
> >>
> >
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 

Re: fixed url and How to contribute code to lucene sandbox?

Posted by Peter Carlson <ca...@bookandhammer.com>.
I will add this to the contributions page.

--Peter
On Saturday, September 7, 2002, at 10:48 PM, Che Dong wrote:

> http://www.chedong.com/tech/lucene.html
>
> fixed  reference url with:
> http://jakarta.apache.org/lucene/
>
> BTW:
> How to contribute code to lucene sandbox?
>
>
> Che, Dong
>
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Sunday, September 08, 2002 12:01 AM
> Subject: Re: Lucene introduction in Chinese
>
>
>> Thank you for this.
>> I think we should add this to the contribution page or some other 
>> place
>> on the Lucene site (I'll take a look in a bit).
>> I would like to just add a link to it.
>>
>> Note: the link to Lucene's home page at the bottom of the page is
>> wrong: http://jakarta.apache.org/Lucene/
>>  should be
>> http://jakarta.apache.org/lucene/
>>
>> Thanks,
>> Otis
>>
>>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Kelvin Tan <ke...@apache.org>.
Peter,

I agree Sandbox code should be APL. One of the primary reasons Sandbox was 
created was to avoid having to use the mailing list archives as a 
repository of sorts for contributions, which IMHO, kinda sucks.

The questions here are:

1) What about bits of APL one-class contributions? Do they go into Sandbox? 
If so, do their contributors become Sandbox committers? If not, who'd be 
responsible for maintaining them?
2) What about non-APL contributions? Is there an alternative to using the 
mail archives to provide access to them?

Regards,
Kelvin


On Sun, 8 Sep 2002 23:07:27 -0700, Peter Carlson wrote:
>I think that all the code in the sandbox should be APL. If not, then
>I
>would suggest people email their contribution and we create a link
>to
>it in a mail archive.
>This way we don't have to maintain it, but it is available with out
>a
>special web server from the contributor.
>
>--Peter
>
>
>
>
>On Sunday, September 8, 2002, at 08:10 PM, Kelvin Tan wrote:
>
>>For code to be added to Sandbox, it also has to be APL.
>>
>>Otis, I suggest creating a space on Lucene's website for these ad-
>>hoc
>>contribs. I know Sandbox was meant for this, but its not reasonable
>>to
>>expect everyone to APL their code. I'm willing to maintain this
>>section if
>>necessary. Attachments can be emailed to me or to the list, and
>>I'll
>>add
>>them in. The alternative is we relax the requirement for Sandbox
>>code
>>to be
>>APL, or create a SF project for this stuff (ugh).
>>
>>Regards,
>>Kelvin
>>
>>
>>On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>>>BTW:
>>>>How to contribute code to lucene sandbox?
>>>
>>>You can just mail lucene-dev and attach your code.
>>>Is this regarding your other contributions?
>>>I haven't had the chance to really look at them yet :(
>>>
>>>Stuff that goes into Sandbox is usually a software component or a
>>>project.  We haven't really put any code snippets (e.g. single
>>>classes)
>>>in there.
>>>Maybe we should have a place to use as a repository for various
>>>code
>>>snippets that people contribute, that would otherwise get lost in
>>>the
>>>mailing list archived, I don't know.
>>>
>>>Otis
>>>
>>>
>>>__________________________________________________
>>>Do You Yahoo!?
>>>Yahoo! Finance - Get real-time stock quotes
>>>http://finance.yahoo.com
>>>
>>>--
>>>To unsubscribe, e-mail:   <mailto:lucene-dev-
>>>unsubscribe@jakarta.apache.org>
>>>For additional commands, e-mail: <mailto:lucene-dev-
>>>help@jakarta.apache.org>
>>
>>
>>
>>
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
>>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-
>unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-
>help@jakarta.apache.org>





--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Peter Carlson <ca...@bookandhammer.com>.
I think that all the code in the sandbox should be APL. If not, then I 
would suggest people email their contribution and we create a link to 
it in a mail archive.
This way we don't have to maintain it, but it is available with out a 
special web server from the contributor.

--Peter




On Sunday, September 8, 2002, at 08:10 PM, Kelvin Tan wrote:

> For code to be added to Sandbox, it also has to be APL.
>
> Otis, I suggest creating a space on Lucene's website for these ad-hoc
> contribs. I know Sandbox was meant for this, but its not reasonable to
> expect everyone to APL their code. I'm willing to maintain this 
> section if
> necessary. Attachments can be emailed to me or to the list, and I'll 
> add
> them in. The alternative is we relax the requirement for Sandbox code 
> to be
> APL, or create a SF project for this stuff (ugh).
>
> Regards,
> Kelvin
>
>
> On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>> BTW:
>>> How to contribute code to lucene sandbox?
>>
>> You can just mail lucene-dev and attach your code.
>> Is this regarding your other contributions?
>> I haven't had the chance to really look at them yet :(
>>
>> Stuff that goes into Sandbox is usually a software component or a
>> project.  We haven't really put any code snippets (e.g. single
>> classes)
>> in there.
>> Maybe we should have a place to use as a repository for various code
>> snippets that people contribute, that would otherwise get lost in the
>> mailing list archived, I don't know.
>>
>> Otis
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Yahoo! Finance - Get real-time stock quotes
>> http://finance.yahoo.com
>>
>> --
>> To unsubscribe, e-mail:   <mailto:lucene-dev-
>> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-dev-
>> help@jakarta.apache.org>
>
>
>
>
>
> --
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Kelvin Tan <ke...@apache.org>.
For code to be added to Sandbox, it also has to be APL.

Otis, I suggest creating a space on Lucene's website for these ad-hoc 
contribs. I know Sandbox was meant for this, but its not reasonable to 
expect everyone to APL their code. I'm willing to maintain this section if 
necessary. Attachments can be emailed to me or to the list, and I'll add 
them in. The alternative is we relax the requirement for Sandbox code to be 
APL, or create a SF project for this stuff (ugh).

Regards,
Kelvin


On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>BTW:
>>How to contribute code to lucene sandbox?
>
>You can just mail lucene-dev and attach your code.
>Is this regarding your other contributions?
>I haven't had the chance to really look at them yet :(
>
>Stuff that goes into Sandbox is usually a software component or a
>project.  We haven't really put any code snippets (e.g. single
>classes)
>in there.
>Maybe we should have a place to use as a repository for various code
>snippets that people contribute, that would otherwise get lost in the
>mailing list archived, I don't know.
>
>Otis
>
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! Finance - Get real-time stock quotes
>http://finance.yahoo.com
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-
>unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-
>help@jakarta.apache.org>





--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: fixed url and How to contribute code to lucene sandbox?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
> BTW:
> How to contribute code to lucene sandbox?

You can just mail lucene-dev and attach your code.
Is this regarding your other contributions?
I haven't had the chance to really look at them yet :(

Stuff that goes into Sandbox is usually a software component or a
project.  We haven't really put any code snippets (e.g. single classes)
in there.
Maybe we should have a place to use as a repository for various code
snippets that people contribute, that would otherwise get lost in the
mailing list archived, I don't know.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>