You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James liu <li...@gmail.com> on 2006/09/20 08:07:12 UTC

wana use CJKAnalyzer

My step to support CJK...:
1:add lucene-analyzers-2.0.0.jar to "C:\cygwin\tmp\solr-nightly\lib"
2:use cmd, "cd C:\cygwin\tmp\solr-nightly","ant dist"
3:copy "C:\cygwin\tmp\solr-nightly\dist\solr- 1.0.war" to
"C:\cygwin\tmp\solr-nightly\example\webapps\solr.war"

4:modify schema(conf/schema.conf), like yours,,just "<analyzer class="
org.apache.lucene.analysis.cjk.CJKAnalyzer"/>"
5:delete solr/data/index;
6:start jetty,java -jar start.jar
7:no error.
8: http://localhost:8983/solr/admin,,,i click analyzer link,,,and try test
analyzer chinese word,but nothing happend.

9: i use xml.php to add index(english is well),it show me ok
10: i try lukeall.jar to see solr's index data. but it show me like my
attachements.


xml.php maybe error althrough no error show.

i write jl.xml to example/exampledocs

use cygwin: sh post.sh jl.xml,no error。

and i use lukeall.jar to see,,nothing changed.

i failure.

maybe someone can give me some advice to solve it.


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Mike Klaas <mi...@gmail.com>.
On 9/19/06, James liu <li...@gmail.com> wrote:

> 4:modify schema(conf/schema.conf), like yours,,just "<analyzer
> class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>"

Are you testing the same field to which you are adding the analyzer?
I noticed in another mail that you added this to the "text_lu" field
type--the solr example uses "text", as I recall.

-Mike

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
Hoss, thk for ur help

2006/9/21, Chris Hostetter <ho...@fucit.org>:
>
>
>
> : 6:start jetty,java -jar start.jar
> : 7:no error.
> : 8: http://localhost:8983/solr/admin,,,i click analyzer link,,,and try
> : test analyzer chinese word,but nothing happend.
>
> ...i don't know much about non latin characters but i tried making the
> same changes you did, and asked a coworker who speaks/types chinese to try
> outthe Analyziz page, and he said it worked fine for him.
>
> one comment he had was that it only works if your www browser is
> configured to use utf-8 or to auto-select the character encoding (in which
> case it uses utf-8 because that's what the HTML page itself specifies as
> the encoding).  if you browser is explicitly configured to use Simplified
> Chinese (or, i assume, Traditional Chinese) as the encoding, then it won't
> work (the page he got looks like it might be what you are seeing: no data
> returned under the form, as if you had provided no input)


i tried to avoid browser and use post.sh (in example/exampledocs),,i put
jl.xml
<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
      <field name="id">111</field>
    <field name="content">姓名是刘平</field>
  </doc>
  <doc>
      <field name="id">112</field>
    <field name="content">姓名是小王</field>
  </doc>
  <doc>
      <field name="id">113</field>
    <field name="content">老婆不在家</field>
  </doc>
</add>

under cygwin,i use sh post.sh jl.xml.

so i think it is not www browser's problem.

which difference? system: i use win2003,,java i use
"C:\Sun\AppServer\jdk",,,,tutorial i try is ok. cygwin is install from
internet.

test by jetty(it include solr) and tomcat 5.5


could you zip your code to me,,i try it.

if failure,,i think only envirement make problem.

can i contact with the chinese coworker?

Traditional Chinese <> Simplified Chinese..

and i use Simplified Chinese。


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Chris Hostetter <ho...@fucit.org>.

: 6:start jetty,java -jar start.jar
: 7:no error.
: 8: http://localhost:8983/solr/admin,,,i click analyzer link,,,and try
: test analyzer chinese word,but nothing happend.

...i don't know much about non latin characters but i tried making the
same changes you did, and asked a coworker who speaks/types chinese to try
outthe Analyziz page, and he said it worked fine for him.

one comment he had was that it only works if your www browser is
configured to use utf-8 or to auto-select the character encoding (in which
case it uses utf-8 because that's what the HTML page itself specifies as
the encoding).  if you browser is explicitly configured to use Simplified
Chinese (or, i assume, Traditional Chinese) as the encoding, then it won't
work (the page he got looks like it might be what you are seeing: no data
returned under the form, as if you had provided no input)

can you double check what encoding your browser is using when you
submit the form?


-- 
regards
jl




-Hoss


Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
attachements:  schema.xml

2006/9/20, James liu <li...@gmail.com>:
>
> i m java newer. so i print these steps.
>
> solr tutorial i test is ok.
>
> anything you wanna know, mail me.
>



-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
i m java newer. so i print these steps.

solr tutorial i test is ok.

anything you wanna know, mail me.

Re: wana use CJKAnalyzer

Posted by Yonik Seeley <yo...@apache.org>.
On 9/21/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> : i just wanna say: no your help,maybe i will give up.....thk u again.
> :
> : http://www.flickr.com/photos/93031839@N00/248815068/
>
> : > thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna help
> : > me..
> : >
> : > i can use solr with Chinese Word.
>
> I'm sorry, i'm really confused now ... it seems like you got things
> working, but you also say "maybe i will give up" ... ?

I read that as "without your help, maybe I would have given up".

-Yonik

Re: wana use CJKAnalyzer

Posted by Walter Underwood <wu...@netflix.com>.
On 9/22/06 10:22 AM, "Yonik Seeley" <yo...@apache.org> wrote:

> What I think might be ideal: If there is a charset definition, then
> let the servlet handle it by requesting a Writer.  If there isn't
> a charset definition, request a byte-oriented InputStream from the
> container and let the XML parser try and figure out the encoding.

RFC 3023 is precise about this, so there is no need to guess.
The only question is how to implement the required behavior.
Here is a summary.

If there is a charset spec, use it and ignore any encoding
spec in the content.

If there is no charset spec for text/xml, use ASCII.

If there is no charset spec for application/xml, follow the
XML spec to determine encoding.

The safest way to send XML over HTTP is:

* use a standard XML encoding: UTF-8 or UTF-16 with BOM
* include an encoding in the <?xml?> line in the document
* use a content-type of application/xml without a charset param

Details are at http://www.ietf.org/rfc/rfc3023.txt

wunder
--
Walter Underwood
Search Guru, Neflix


Re: wana use CJKAnalyzer

Posted by Yonik Seeley <yo...@apache.org>.
On 9/22/06, Walter Underwood <wu...@netflix.com> wrote:
> This might be a Solr bug. Solr should be able to accept XML in any
> of the required encodings (ASCII, Latin 1, UTF-8, and UTF-16).
> Getting XML content types exactly right is tricky, see RFC 3023.

Right now Solr pays attention to Content-type in the HTTP-headers (it
lets the servlet container handle charset conversions), and ignores
any charset declaration in the XML itself.

What I think might be ideal: I
  f there is a charset definition, then let the servlet handle it by
requesting a Writer.  If there isn't a charset definition, request a
byte-oriented InputStream from the container and let the XML parser
try and figure out the encoding.

-Yonik

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
2006/9/25, Walter Underwood <wu...@netflix.com>:
>
> This document has two problems. First, the document is not well-formed
> XML.
> Open it  in Firefox and you will see this error:
>
>    XML Parsing Error: mismatched tag. Expected: </doc>.
>    Location: file:///Users/wunderwood/Desktop/jl.xml
>    Line Number 15, Column 3:
>
> After I fix that, it still is not legal UTF-8.


Im sorry that it have more <doc>, because i test more data in
solr. In order to transfter attachements, i reduced jl.xml and not check.
so, you find this problem.
yes, it is not legal utf-8.
utf-8 encoding i mean that is file encoding mode.
when you create new xml by using editplus, and save it, it appears window
that have a selection encoding mode.(u can find it with attachements)
That is jl.xml,Index it by post.sh.

if you use "script language", like solrphp(my solrphp not from solr's wiki)
that i modified. you must send your xml with encoding utf-8.
for instance, i try send my.xml to http://localhost:8983/solr/update-< this
url's head information should have ""Content-Type: text/xml;charset=utf-8"";
Solr work well after with head information.


Does Solr report parsing errors? It really should. Maybe a 400 Bad Request
> response with a text/plain body showing the error message.


after i fixed "more <doc" problem, solr work well.

wunder
>
>
> On 9/22/06 6:24 PM, "James liu" <li...@gmail.com> wrote:
> >
> > 2006/9/23, Walter Underwood <wu...@netflix.com>:
> >> On 9/21/06 5:37 PM, "James liu" <li...@gmail.com> wrote:
> >>
> >>> > Yes,it working. the root of my problem is xml muse be encoded by
> utf-8.
> >>> > if use php,it not about www browser. just notice that
> >>> > curl header information must be utf-8.
> >>> > if use post.sh,xml muse be encoded by utf-8.(my editplus default
> encode
> >>> > style is ansi)
> >>
> >> This might be a Solr bug. Solr should be able to accept XML in any
> >> of the required encodings (ASCII, Latin 1, UTF-8, and UTF-16).
> >> Getting XML content types exactly right is tricky, see RFC 3023.
> >>
> >> What curl command line was used?
> >
> > No sepcial curl command i use.just solr-nightly/example/exampledocs
> post.sh.
> > but my jl.xml encoded  utf-8(i use editplus, i tried to use  xml
> encoding utf
> > 8, but it is not effect).
> > solrphp i use curl "$header=array("Content-Type:
> > text/xml;charset=utf-8");curl_setopt($ch, CURLOPT_HTTPHEADER,
> $header);", this
> > is php.
> >
> >> What encoding is the XML?
> >>
> >> Can you give a sample XML file?
> >
> > see attachments, anything you need mail me.
> >
> >> wunder
> >> --
> >> Walter Underwood
> >> Search Guru, Netflix
> >>
> >
> >
>
>
>
>


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Walter Underwood <wu...@netflix.com>.
This document has two problems. First, the document is not well-formed XML.
Open it  in Firefox and you will see this error:

   XML Parsing Error: mismatched tag. Expected: </doc>.
   Location: file:///Users/wunderwood/Desktop/jl.xml
   Line Number 15, Column 3:

After I fix that, it still is not legal UTF-8.

Does Solr report parsing errors? It really should. Maybe a 400 Bad Request
response with a text/plain body showing the error message.

wunder


On 9/22/06 6:24 PM, "James liu" <li...@gmail.com> wrote:
> 
> 2006/9/23, Walter Underwood <wu...@netflix.com>:
>> On 9/21/06 5:37 PM, "James liu" <li...@gmail.com> wrote:
>> 
>>> > Yes,it working. the root of my problem is xml muse be encoded by utf-8.
>>> > if use php,it not about www browser. just notice that
>>> > curl header information must be utf-8.
>>> > if use post.sh,xml muse be encoded by utf-8.(my editplus default encode
>>> > style is ansi)
>> 
>> This might be a Solr bug. Solr should be able to accept XML in any
>> of the required encodings (ASCII, Latin 1, UTF-8, and UTF-16).
>> Getting XML content types exactly right is tricky, see RFC 3023.
>> 
>> What curl command line was used?
> 
> No sepcial curl command i use.just solr-nightly/example/exampledocs post.sh.
> but my jl.xml encoded  utf-8(i use editplus, i tried to use  xml encoding utf
> 8, but it is not effect).
> solrphp i use curl "$header=array("Content-Type:
> text/xml;charset=utf-8");curl_setopt($ch, CURLOPT_HTTPHEADER, $header);", this
> is php. 
> 
>> What encoding is the XML?
>> 
>> Can you give a sample XML file?
> 
> see attachments, anything you need mail me.
> 
>> wunder
>> --
>> Walter Underwood
>> Search Guru, Netflix
>> 
> 
> 



Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
2006/9/23, Walter Underwood <wu...@netflix.com>:
>
> On 9/21/06 5:37 PM, "James liu" <li...@gmail.com> wrote:
>
> > Yes,it working. the root of my problem is xml muse be encoded by utf-8.
> > if use php,it not about www browser. just notice that
> > curl header information must be utf-8.
> > if use post.sh,xml muse be encoded by utf-8.(my editplus default encode
> > style is ansi)
>
> This might be a Solr bug. Solr should be able to accept XML in any
> of the required encodings (ASCII, Latin 1, UTF-8, and UTF-16).
> Getting XML content types exactly right is tricky, see RFC 3023.
>
> What curl command line was used?


No sepcial curl command i use.just solr-nightly/example/exampledocs post.sh.
but my jl.xml encoded  utf-8(i use editplus, i tried to use xml encoding utf
8, but it is not effect).
solrphp i use curl "$header=array("Content-Type:
text/xml;charset=utf-8");curl_setopt($ch, CURLOPT_HTTPHEADER, $header);",
this is php.

What encoding is the XML?
>
> Can you give a sample XML file?


see attachments, anything you need mail me.

wunder
> --
> Walter Underwood
> Search Guru, Netflix
>
>


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Walter Underwood <wu...@netflix.com>.
On 9/21/06 5:37 PM, "James liu" <li...@gmail.com> wrote:

> Yes,it working. the root of my problem is xml muse be encoded by utf-8.
> if use php,it not about www browser. just notice that
> curl header information must be utf-8.
> if use post.sh,xml muse be encoded by utf-8.(my editplus default encode
> style is ansi)

This might be a Solr bug. Solr should be able to accept XML in any
of the required encodings (ASCII, Latin 1, UTF-8, and UTF-16).
Getting XML content types exactly right is tricky, see RFC 3023.

What curl command line was used?

What encoding is the XML?

Can you give a sample XML file?

wunder
--
Walter Underwood
Search Guru, Netflix


Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
2006/9/22, Chris Hostetter <ho...@fucit.org>:
>
>
> : i just wanna say: no your help,maybe i will give up.....thk u again.
> :
> : http://www.flickr.com/photos/93031839@N00/248815068/
>
> : > thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna
> help
> : > me..
> : >
> : > i can use solr with Chinese Word.
>
> I'm sorry, i'm really confused now ... it seems like you got things
> working, but you also say "maybe i will give up" ... ?



Express problem. It is maybe i would have given up...in fact i not give
up.my english is poor.
Thk Yonik.

1) if you did get things working, what was the root of your problem, was
> it the utf-8 issue when using the forms in your browser or adding docs?


Yes,it working. the root of my problem is xml muse be encoded by utf-8.
if use php,it not about www browser. just notice that
curl header information must be utf-8.
if use post.sh,xml muse be encoded by utf-8.(my editplus default encode
style is ansi)


2) if things aren't working right, what is the current state of things?
> ... from the picture "solr_chinese" on your flicker page, Luke seems to be
> showing you Chinese characters in a Solr index ... are they not being
> tokenized properly or something?
>
>
>
>
> -Hoss
>
>


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Chris Hostetter <ho...@fucit.org>.
: i just wanna say: no your help,maybe i will give up.....thk u again.
:
: http://www.flickr.com/photos/93031839@N00/248815068/

: > thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna help
: > me..
: >
: > i can use solr with Chinese Word.

I'm sorry, i'm really confused now ... it seems like you got things
working, but you also say "maybe i will give up" ... ?

1) if you did get things working, what was the root of your problem, was
it the utf-8 issue when using the forms in your browser or adding docs?

2) if things aren't working right, what is the current state of things?
... from the picture "solr_chinese" on your flicker page, Luke seems to be
showing you Chinese characters in a Solr index ... are they not being
tokenized properly or something?




-Hoss


Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
i just wanna say: no your help,maybe i will give up.....thk u again.

http://www.flickr.com/photos/93031839@N00/248815068/

2006/9/21, James liu <li...@gmail.com>:
>
> thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna help
> me..
>
> i can use solr with Chinese Word.
>
>
>
>
>
>


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna help
me..

i can use solr with Chinese Word.

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
i recompile it.

when i ant dist...cmd shows some api is old and uncheck...

Is it problem?


my java version you can find
http://www.flickr.com/photos/93031839@N00/?saved=1



2006/9/21, James liu <li...@gmail.com>:
>
> i dont know it is import i add junit....
>
> when i use ant dist,,,it show me error information : not found junit,,,so
> i download and add it.
>
> Is it problem about CJKAnalyzer?
>
>
>


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
i dont know it is import i add junit....

when i use ant dist,,,it show me error information : not found junit,,,so i
download and add it.

Is it problem about CJKAnalyzer?

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
i use lukeall.jar to check indexdata.

u can find picture from http://www.flickr.com/photos/93031839@N00/?saved=1


solr.jpg is i use lukeall.jar to check solr's index data.

lucene.jpg is  i use lukeall.jar to check lucene's index data.


now i use lucene is ok.

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
sorry,,it is wrong...

my schema.xml

<?xml version="1.0" ?>
<schema name="example" version="1.1">
  <types>
    <fieldtype name="text" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>
    </fieldtype>
    <fieldtype name="integer" class="solr.IntField"/>
  </types>
  <fields>
    <field name="id" type="integer" indexed="true" stored="true"/>
    <field name="content" type="text" indexed="true" stored="true"/>
  </fields>
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>content</defaultSearchField>
</schema>



在06-9-21,James liu <li...@gmail.com> 写道:
>
> to mike:
>
> " Are you testing the same field to which you are adding the analyzer?
> I noticed in another mail that you added this to the "text_lu" field
> type--the solr example uses "text", as I recall."
>
> now my schema.xml:
> <?xml version="1.0" encoding="UTF-8"?>
> <add>
>   <doc>
>       <field name="id">111</field>
>     <field name="content">姓名是刘平</field>
>   </doc>
>   <doc>
>       <field name="id">112</field>
>     <field name="content">姓名是小王</field>
>   </doc>
>   <doc>
>       <field name="id">113</field>
>     <field name="content">老婆不在家</field>
>   </doc>
> </add>
>
> but i m failed..
>



-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
to mike:
" Are you testing the same field to which you are adding the analyzer?
I noticed in another mail that you added this to the "text_lu" field
type--the solr example uses "text", as I recall."

now my schema.xml:
<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
      <field name="id">111</field>
    <field name="content">姓名是刘平</field>
  </doc>
  <doc>
      <field name="id">112</field>
    <field name="content">姓名是小王</field>
  </doc>
  <doc>
      <field name="id">113</field>
    <field name="content">老婆不在家</field>
  </doc>
</add>

but i m failed..

Re: wana use CJKAnalyzer

Posted by Chris Hostetter <ho...@fucit.org>.
: you find index data from my attachements. its name is solr.jpg and lucene
: breaking well, its name is lucene.jpg

FYI: the mailing list only allows text attachments, if you want to refer
to images you have to send a URL to an image online somewhere instead.

-Hoss


Re: wana use CJKAnalyzer

Posted by James liu <li...@gmail.com>.
2006/9/20, Yonik Seeley <yo...@apache.org>:
>
> On 9/20/06, James liu <li...@gmail.com> wrote:
> > My step to support CJK...:
> > 1:add lucene-analyzers-2.0.0.jar to
> > "C:\cygwin\tmp\solr-nightly\lib"
> > 2:use cmd, "cd C:\cygwin\tmp\solr-nightly","ant dist"
> > 3:copy "C:\cygwin\tmp\solr-nightly\dist\solr- 1.0.war" to
> > "C:\cygwin\tmp\solr-nightly\example\webapps\solr.war"
> >
> > 4:modify schema(conf/schema.conf), like yours,,just "<analyzer
> > class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>"
> > 5:delete solr/data/index;
> > 6:start jetty,java -jar start.jar
> > 7:no error.
> > 8: http://localhost:8983/solr/admin,,,i click analyzer
> > link,,,and try test analyzer chinese word,but nothing happend.
>
> When you say nothing happened, do you mean the analyzer didn't change
> the text at all, or you didn't see any output at all?  Did you type
> some text into the input fields?  Does it work for you with english
> text?


i think it shows clear. step 8: i use admin page 's analyzer, analyzer
didn't change the text at all, and nothing output. im sure i type chinese
word into the input fields.
it work with english text.


> 9: i use xml.php to add index(english is well),it show me ok
> > 10: i try lukeall.jar to see solr's index data. but it show me like my
> > attachements.
>
> Please be explicit on what the problem is... not many people on this
> list can look at CJK and see what is wrong.

yes i know. im sorry not be explicit .

  Do you mean that the
> analyzer isn't breaking up your text into words?
>
> -Yonik
>

i means i follow these step, it break my text into words.but
 i can't know these words..

you find index data from my attachements. its name is solr.jpg and lucene
breaking well, its name is lucene.jpg


-- 
regards
jl

Re: wana use CJKAnalyzer

Posted by Yonik Seeley <yo...@apache.org>.
On 9/20/06, James liu <li...@gmail.com> wrote:
> My step to support CJK...:
> 1:add lucene-analyzers-2.0.0.jar to
> "C:\cygwin\tmp\solr-nightly\lib"
> 2:use cmd, "cd C:\cygwin\tmp\solr-nightly","ant dist"
> 3:copy "C:\cygwin\tmp\solr-nightly\dist\solr- 1.0.war" to
> "C:\cygwin\tmp\solr-nightly\example\webapps\solr.war"
>
> 4:modify schema(conf/schema.conf), like yours,,just "<analyzer
> class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>"
> 5:delete solr/data/index;
> 6:start jetty,java -jar start.jar
> 7:no error.
> 8: http://localhost:8983/solr/admin,,,i click analyzer
> link,,,and try test analyzer chinese word,but nothing happend.

When you say nothing happened, do you mean the analyzer didn't change
the text at all, or you didn't see any output at all?  Did you type
some text into the input fields?  Does it work for you with english
text?

> 9: i use xml.php to add index(english is well),it show me ok
> 10: i try lukeall.jar to see solr's index data. but it show me like my
> attachements.

Please be explicit on what the problem is... not many people on this
list can look at CJK and see what is wrong.  Do you mean that the
analyzer isn't breaking up your text into words?

-Yonik