You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2011/02/25 11:23:27 UTC

which unicode version is supported with lucene

Dear list,

a very basic question about lucene, which version of
unicode can be handled (indexed and searched) with lucene?

It looks like lucene can only handle the very old Unicode 2.0
but not the newer 3.1 version (4 byte utf-8 unicode).

Is that true?

Regards,
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: which unicode version is supported with lucene

Posted by Uwe Schindler <uw...@thetaphi.de>.
What APIs are you using to communicate with Solr? If you are using XML it may be limited by the XML parser used... If you are using SolrJ with binary request handler it should in all cases go through.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Friday, February 25, 2011 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
> 
> 
> So Solr trunk should already handle Unicode above BMP for field type string?
> Strange...
> 
> Regards,
> Bernd
> 
> Am 25.02.2011 14:40, schrieb Uwe Schindler:
> > Solr trunk is using Lucene trunk since Lucene and Solr are merged.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> >> Sent: Friday, February 25, 2011 2:19 PM
> >> To: simon.willnauer@gmail.com
> >> Cc: java-user@lucene.apache.org
> >> Subject: Re: which unicode version is supported with lucene
> >>
> >> Hi Simon,
> >>
> >> actually I'm working with Solr from trunk but followed the problem
> >> all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
> >>
> >> My field is:
> >> <field name="dcdescription" type="string" indexed="false"
> >> stored="true" />
> >>
> >> No analysis done at all, just stored the content for result display.
> >> But the result is unpredictable and can end in invalid utf-8 code.
> >>
> >> Regards,
> >> Bernd
> >>
> >>
> >> Am 25.02.2011 13:43, schrieb Simon Willnauer:
> >>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> >>> <be...@uni-bielefeld.de> wrote:
> >>>> Hi Simon,
> >>>>
> >>>> thanks for the details.
> >>>>
> >>>> My platform supports and uses code above BMP (0x10000 and up).
> >>>> So the limit is Lucene.
> >>>> Don't know how to handle this problem.
> >>>> May be deleting all code above BMP...???
> >>>
> >>> the code will work fine even if they are in you text. It will just
> >>> not respect them maybe throw them away during tokenization etc. so
> >>> it really depends what you are using on the analyzer side. maybe you
> >>> can give us little more details on what you use for analysis. One
> >>> option would be to build 3.1 from the source and use the analyzers
> >>> from there?!
> >>>
> >>>>
> >>>> Good to hear that Lucene 3.1 will come soon.
> >>>> Any rough estimation when Lucene 3.1 will be available?
> >>>
> >>> I hope it will happen within the next 4 weeks
> >>>
> >>> simon
> >>>
> >>>>
> >>>> Regards,
> >>>> Bernd
> >>>>
> >>>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
> >>>>> Hey Bernd,
> >>>>>
> >>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> >>>>> <be...@uni-bielefeld.de> wrote:
> >>>>>> Dear list,
> >>>>>>
> >>>>>> a very basic question about lucene, which version of unicode can
> >>>>>> be handled (indexed and searched) with lucene?
> >>>>>
> >>>>> if you ask for what the indexer / query can handle then it is
> >>>>> really what UTF-8 can handle. Strings passed to the writer /
> >>>>> reader are converted to UTF-8 internally (rough picture). On Trunk
> >>>>> we are indexing bytes only (UTF-8 bytes by default). so the
> >>>>> question is really what you platform supports in terms of
> >>>>> utilities / operations on characters and strings. Since Lucene 3.0
> >>>>> we are on Java 1.5 and have the possibility to respect code points
> which are above the BMP.
> >>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented
> >>>>> us from moving forward to Unicode 4.0. If you look at
> >>>>> Character.java all methods have been converted to operate on
> >>>>> UTF-32 code points instead of UTF-16 code points in Java 1.4.
> >>>>>
> >>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> >>>>> APIs are not in use yet in the latest released version. Lucene 3.1
> >>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer
> >>>>> codebase (I think there are one or two which still have problems,
> >>>>> I should check... Robert did we fix all NGram stuff?).
> >>>>>
> >>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
> >>>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
> >>>>> soon I hope) will fix most of the problems and includes ICU based
> >>>>> analysis for full Unicode 5 support.
> >>>>>
> >>>>> hope that helps
> >>>>>
> >>>>> simon
> >>>>>>
> >>>>>> It looks like lucene can only handle the very old Unicode 2.0 but
> >>>>>> not the newer 3.1 version (4 byte utf-8 unicode).
> >>>>>>
> >>>>>> Is that true?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Bernd
> >>>>>>
> >>>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Sun, Feb 27, 2011 at 2:15 PM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> Jepp, its back online.
> Just did a short test and reported my results to jira, but is the
> error from the xml output still a jetty problem or is it from XMLwriter?

The patch has been committed, so you should just be able to try trunk (or 3x).

I also just committed a char beyond the BMP to utf8-example.xml
and the indexing and XML output works fine for me.

Index the example docs, then do a query for "BMP" to bring up that document.

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Jepp, its back online. 
Just did a short test and reported my results to jira, but is the
error from the xml output still a jetty problem or is it from XMLwriter? 

Regards, Bernd

> It's back online! It would be good, if you could confirm, we did 
> hard work
> to fix this and report the bugs in Jetty to Jetty itself
> 
> Thanks,
> Uwe!
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> > Sent: Sunday, February 27, 2011 3:04 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: which unicode version is supported with lucene
> > 
> > Hi Robert,
> > 
> > thanks to you and Yonik for looking into this.
> > As soon as Apache jira is back online I will try your jetty 
> version and
> give
> > feedback.
> > 
> > Regards,
> > Bernd
> > 
> > > On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> > > <be...@uni-bielefeld.de> wrote:
> > > > Hi Yonik,
> > > >
> > > > good point, yes we are using Jetty.
> > > > Do you know if Tomcat has this limitation?
> > > >
> > >
> > > Hi Bernd, I placed some patched Jetty jar files on
> > > https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.
> > >
> > > Maybe then you can get past your problem with Jetty.
> > >
> > > -------------------------------------------------------------
> ----
> > > ----
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > 
> > ---------------------------------------------------------------
> ------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> -----------------------------------------------------------------
> ----
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: which unicode version is supported with lucene

Posted by Uwe Schindler <uw...@thetaphi.de>.
It's back online! It would be good, if you could confirm, we did hard work
to fix this and report the bugs in Jetty to Jetty itself

Thanks,
Uwe!

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Sunday, February 27, 2011 3:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
> 
> Hi Robert,
> 
> thanks to you and Yonik for looking into this.
> As soon as Apache jira is back online I will try your jetty version and
give
> feedback.
> 
> Regards,
> Bernd
> 
> > On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> > <be...@uni-bielefeld.de> wrote:
> > > Hi Yonik,
> > >
> > > good point, yes we are using Jetty.
> > > Do you know if Tomcat has this limitation?
> > >
> >
> > Hi Bernd, I placed some patched Jetty jar files on
> > https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.
> >
> > Maybe then you can get past your problem with Jetty.
> >
> > -----------------------------------------------------------------
> > ----
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Hi Robert,

thanks to you and Yonik for looking into this.
As soon as Apache jira is back online I will try your jetty version
and give feedback.

Regards,
Bernd

> On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
> > Hi Yonik,
> >
> > good point, yes we are using Jetty.
> > Do you know if Tomcat has this limitation?
> >
> 
> Hi Bernd, I placed some patched Jetty jar files on
> https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.
> 
> Maybe then you can get past your problem with Jetty.
> 
> -----------------------------------------------------------------
> ----
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> Hi Yonik,
>
> good point, yes we are using Jetty.
> Do you know if Tomcat has this limitation?
>

Hi Bernd, I placed some patched Jetty jar files on
https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.

Maybe then you can get past your problem with Jetty.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
I just tried vim as editor, seams to work.

- start vim
- enter i (for insert)
- enter <ctrl>+v and then <shift>+U (for uppercase U)
- enter upper Unicode with 8 digits
(e.g. 0001D5A0 for U+1D5A0 [MATHEMATICAL SANS-SERIF CAPITAL A])


Am 25.02.2011 15:16, schrieb Yonik Seeley:
> On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
>> Hi Yonik,
>>
>> good point, yes we are using Jetty.
>> Do you know if Tomcat has this limitation?
> 
> Tomcat's defaults are worse - you need to configure it to use UTF-8 by
> default for URLs.
> Once you do, it passes all those tests (last I checked).  Those tests
> are really about UTF-8 working in GET/POST query arguments.  Solr may
> still be able to handle indexing and returning full UTF-8, but you
> wouldn't be able to query for it w/o using surrogates if you're using
> Jetty.
> 
> It would be good to test though - does anyone know how to add a char
> above the BMP to utf8-example.xml?
> 
> -Yonik
> http://lucidimagination.com
> 
> 
>> Regards,
>> Bernd
>>
>> Am 25.02.2011 14:54, schrieb Yonik Seeley:
>>> On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
>>> <be...@uni-bielefeld.de> wrote:
>>>> So Solr trunk should already handle Unicode above BMP for field type string?
>>>> Strange...
>>>
>>> One issue is that jetty doesn't support UTF-8 beyond the BMP:
>>>
>>> /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
>>> Solr server is up.
>>> HTTP GET is accepting UTF-8
>>> HTTP POST is accepting UTF-8
>>> HTTP POST defaults to UTF-8
>>> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
>>> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
>>> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
>>> multilingual plane
>>>
>>> -Yonik
>>> http://lucidimagination.com
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Feb 25, 2011 at 10:04 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> But firefox complains on XML output, and any other output like JSON it
> looks mangled.
> My bet is Jetty's UTF8 encoding for the response also doesn't handle
> the full range.
>

I created a JIRA issue on jetty's issue tracker with a tentative fix:
http://jira.codehaus.org/browse/JETTY-1340

Our test_utf8.sh passes with this.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Feb 25, 2011 at 9:31 AM, Robert Muir <rc...@gmail.com> wrote:
> Then i searched on 'range' via the admin gui to retrieve this
> document, and chrome blew up with "This page contains the following
> errors: error on line 17 at column 306: Encoding error"

I got an error in firefox too.
I added the following example (commented out for now):
    <field name="features">Outside the BMP:𐌈 codepoint=10308, a
circle with an x inside. UTF8=f0908c88 UTF16=d800 df08</field>

I can verify it got into Solr OK by querying with python format (which
escapes everything outside the ascii range for each 16 bit char):
http://localhost:8983/solr/select?q=BMP&wt=python&indent=true

[...]
          u'Outside the BMP:\ud800\udf08 codepoint=10308, a circle
with an x inside. UTF8=f0908c88 UTF16=d800 df08']}]

But firefox complains on XML output, and any other output like JSON it
looks mangled.
My bet is Jetty's UTF8 encoding for the response also doesn't handle
the full range.

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Feb 25, 2011 at 9:16 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
>
> On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
> > Hi Yonik,
> >
> > good point, yes we are using Jetty.
> > Do you know if Tomcat has this limitation?
>
> Tomcat's defaults are worse - you need to configure it to use UTF-8 by
> default for URLs.
> Once you do, it passes all those tests (last I checked).  Those tests
> are really about UTF-8 working in GET/POST query arguments.  Solr may
> still be able to handle indexing and returning full UTF-8, but you
> wouldn't be able to query for it w/o using surrogates if you're using
> Jetty.
>
> It would be good to test though - does anyone know how to add a char
> above the BMP to utf8-example.xml?
>

I tried the following, then tried to search on this character (U+29B05
/ UTF8:[f0 a9 ac 85]) with jetty and got no results.
I also went to the analysis.jsp as a quick test, and noted that jetty
treats it as if it were U+9B05 / UTF8: [e9 ac 85].

Then i searched on 'range' via the admin gui to retrieve this
document, and chrome blew up with "This page contains the following
errors: error on line 17 at column 306: Encoding error"

Didn't try tomcat.

Index: utf8-example.xml
===================================================================
--- utf8-example.xml (revision 1074125)
+++ utf8-example.xml (working copy)
@@ -34,6 +34,7 @@
     <field name="features">eaiou with umlauts: ëäïöü</field>
     <field name="features">tag with escaped chars: &lt;nicetag/&gt;</field>
     <field name="features">escaped ampersand: Bonnie &amp; Clyde</field>
+    <field name="features">full unicode range (supplementary char): 𩬅</field>
     <field name="price">0</field>
     <!-- no popularity, get the default from schema.xml -->
     <field name="inStock">true</field>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> Hi Yonik,
>
> good point, yes we are using Jetty.
> Do you know if Tomcat has this limitation?

Tomcat's defaults are worse - you need to configure it to use UTF-8 by
default for URLs.
Once you do, it passes all those tests (last I checked).  Those tests
are really about UTF-8 working in GET/POST query arguments.  Solr may
still be able to handle indexing and returning full UTF-8, but you
wouldn't be able to query for it w/o using surrogates if you're using
Jetty.

It would be good to test though - does anyone know how to add a char
above the BMP to utf8-example.xml?

-Yonik
http://lucidimagination.com


> Regards,
> Bernd
>
> Am 25.02.2011 14:54, schrieb Yonik Seeley:
>> On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
>> <be...@uni-bielefeld.de> wrote:
>>> So Solr trunk should already handle Unicode above BMP for field type string?
>>> Strange...
>>
>> One issue is that jetty doesn't support UTF-8 beyond the BMP:
>>
>> /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
>> Solr server is up.
>> HTTP GET is accepting UTF-8
>> HTTP POST is accepting UTF-8
>> HTTP POST defaults to UTF-8
>> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
>> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
>> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
>> multilingual plane
>>
>> -Yonik
>> http://lucidimagination.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Hi Yonik,

good point, yes we are using Jetty.
Do you know if Tomcat has this limitation?

Regards,
Bernd

Am 25.02.2011 14:54, schrieb Yonik Seeley:
> On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
>> So Solr trunk should already handle Unicode above BMP for field type string?
>> Strange...
> 
> One issue is that jetty doesn't support UTF-8 beyond the BMP:
> 
> /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
> Solr server is up.
> HTTP GET is accepting UTF-8
> HTTP POST is accepting UTF-8
> HTTP POST defaults to UTF-8
> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
> multilingual plane
> 
> -Yonik
> http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> So Solr trunk should already handle Unicode above BMP for field type string?
> Strange...

One issue is that jetty doesn't support UTF-8 beyond the BMP:

/opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
multilingual plane

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
So Solr trunk should already handle Unicode above BMP for field type string?
Strange...

Regards,
Bernd

Am 25.02.2011 14:40, schrieb Uwe Schindler:
> Solr trunk is using Lucene trunk since Lucene and Solr are merged.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
>> Sent: Friday, February 25, 2011 2:19 PM
>> To: simon.willnauer@gmail.com
>> Cc: java-user@lucene.apache.org
>> Subject: Re: which unicode version is supported with lucene
>>
>> Hi Simon,
>>
>> actually I'm working with Solr from trunk but followed the problem all the
>> way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
>>
>> My field is:
>> <field name="dcdescription" type="string" indexed="false" stored="true" />
>>
>> No analysis done at all, just stored the content for result display.
>> But the result is unpredictable and can end in invalid utf-8 code.
>>
>> Regards,
>> Bernd
>>
>>
>> Am 25.02.2011 13:43, schrieb Simon Willnauer:
>>> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
>>> <be...@uni-bielefeld.de> wrote:
>>>> Hi Simon,
>>>>
>>>> thanks for the details.
>>>>
>>>> My platform supports and uses code above BMP (0x10000 and up).
>>>> So the limit is Lucene.
>>>> Don't know how to handle this problem.
>>>> May be deleting all code above BMP...???
>>>
>>> the code will work fine even if they are in you text. It will just not
>>> respect them maybe throw them away during tokenization etc. so it
>>> really depends what you are using on the analyzer side. maybe you can
>>> give us little more details on what you use for analysis. One option
>>> would be to build 3.1 from the source and use the analyzers from
>>> there?!
>>>
>>>>
>>>> Good to hear that Lucene 3.1 will come soon.
>>>> Any rough estimation when Lucene 3.1 will be available?
>>>
>>> I hope it will happen within the next 4 weeks
>>>
>>> simon
>>>
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>>>>> Hey Bernd,
>>>>>
>>>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>>>>> <be...@uni-bielefeld.de> wrote:
>>>>>> Dear list,
>>>>>>
>>>>>> a very basic question about lucene, which version of unicode can be
>>>>>> handled (indexed and searched) with lucene?
>>>>>
>>>>> if you ask for what the indexer / query can handle then it is really
>>>>> what UTF-8 can handle. Strings passed to the writer / reader are
>>>>> converted to UTF-8 internally (rough picture). On Trunk we are
>>>>> indexing bytes only (UTF-8 bytes by default). so the question is
>>>>> really what you platform supports in terms of utilities / operations
>>>>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>>>>> have the possibility to respect code points which are above the BMP.
>>>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>>>>> from moving forward to Unicode 4.0. If you look at Character.java
>>>>> all methods have been converted to operate on UTF-32 code points
>>>>> instead of UTF-16 code points in Java 1.4.
>>>>>
>>>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>>>>> APIs are not in use yet in the latest released version. Lucene 3.1
>>>>> holds a largely converted Analyzer / TokenFilter / Tokenizer
>>>>> codebase (I think there are one or two which still have problems, I
>>>>> should check... Robert did we fix all NGram stuff?).
>>>>>
>>>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>>>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
>>>>> soon I hope) will fix most of the problems and includes ICU based
>>>>> analysis for full Unicode 5 support.
>>>>>
>>>>> hope that helps
>>>>>
>>>>> simon
>>>>>>
>>>>>> It looks like lucene can only handle the very old Unicode 2.0 but
>>>>>> not the newer 3.1 version (4 byte utf-8 unicode).
>>>>>>
>>>>>> Is that true?
>>>>>>
>>>>>> Regards,
>>>>>> Bernd
>>>>>>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: which unicode version is supported with lucene

Posted by Uwe Schindler <uw...@thetaphi.de>.
Solr trunk is using Lucene trunk since Lucene and Solr are merged.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Friday, February 25, 2011 2:19 PM
> To: simon.willnauer@gmail.com
> Cc: java-user@lucene.apache.org
> Subject: Re: which unicode version is supported with lucene
> 
> Hi Simon,
> 
> actually I'm working with Solr from trunk but followed the problem all the
> way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
> 
> My field is:
> <field name="dcdescription" type="string" indexed="false" stored="true" />
> 
> No analysis done at all, just stored the content for result display.
> But the result is unpredictable and can end in invalid utf-8 code.
> 
> Regards,
> Bernd
> 
> 
> Am 25.02.2011 13:43, schrieb Simon Willnauer:
> > On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> > <be...@uni-bielefeld.de> wrote:
> >> Hi Simon,
> >>
> >> thanks for the details.
> >>
> >> My platform supports and uses code above BMP (0x10000 and up).
> >> So the limit is Lucene.
> >> Don't know how to handle this problem.
> >> May be deleting all code above BMP...???
> >
> > the code will work fine even if they are in you text. It will just not
> > respect them maybe throw them away during tokenization etc. so it
> > really depends what you are using on the analyzer side. maybe you can
> > give us little more details on what you use for analysis. One option
> > would be to build 3.1 from the source and use the analyzers from
> > there?!
> >
> >>
> >> Good to hear that Lucene 3.1 will come soon.
> >> Any rough estimation when Lucene 3.1 will be available?
> >
> > I hope it will happen within the next 4 weeks
> >
> > simon
> >
> >>
> >> Regards,
> >> Bernd
> >>
> >> Am 25.02.2011 12:04, schrieb Simon Willnauer:
> >>> Hey Bernd,
> >>>
> >>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> >>> <be...@uni-bielefeld.de> wrote:
> >>>> Dear list,
> >>>>
> >>>> a very basic question about lucene, which version of unicode can be
> >>>> handled (indexed and searched) with lucene?
> >>>
> >>> if you ask for what the indexer / query can handle then it is really
> >>> what UTF-8 can handle. Strings passed to the writer / reader are
> >>> converted to UTF-8 internally (rough picture). On Trunk we are
> >>> indexing bytes only (UTF-8 bytes by default). so the question is
> >>> really what you platform supports in terms of utilities / operations
> >>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
> >>> have the possibility to respect code points which are above the BMP.
> >>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
> >>> from moving forward to Unicode 4.0. If you look at Character.java
> >>> all methods have been converted to operate on UTF-32 code points
> >>> instead of UTF-16 code points in Java 1.4.
> >>>
> >>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> >>> APIs are not in use yet in the latest released version. Lucene 3.1
> >>> holds a largely converted Analyzer / TokenFilter / Tokenizer
> >>> codebase (I think there are one or two which still have problems, I
> >>> should check... Robert did we fix all NGram stuff?).
> >>>
> >>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
> >>> support characters within the BMP <= 0xFFFF. 3.1 (to be released
> >>> soon I hope) will fix most of the problems and includes ICU based
> >>> analysis for full Unicode 5 support.
> >>>
> >>> hope that helps
> >>>
> >>> simon
> >>>>
> >>>> It looks like lucene can only handle the very old Unicode 2.0 but
> >>>> not the newer 3.1 version (4 byte utf-8 unicode).
> >>>>
> >>>> Is that true?
> >>>>
> >>>> Regards,
> >>>> Bernd
> >>>>
> >>
> 
> --
> **********************************************************
> ***
> Bernd Fehling                Universitätsbibliothek Bielefeld
> Dipl.-Inform. (FH)                        Universitätsstr. 25
> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
> 
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> **********************************************************
> ***
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Hi Simon,

actually I'm working with Solr from trunk but followed the problem
all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.

My field is:
<field name="dcdescription" type="string" indexed="false" stored="true" />

No analysis done at all, just stored the content for result display.
But the result is unpredictable and can end in invalid utf-8 code.

Regards,
Bernd


Am 25.02.2011 13:43, schrieb Simon Willnauer:
> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
>> Hi Simon,
>>
>> thanks for the details.
>>
>> My platform supports and uses code above BMP (0x10000 and up).
>> So the limit is Lucene.
>> Don't know how to handle this problem.
>> May be deleting all code above BMP...???
> 
> the code will work fine even if they are in you text. It will just not
> respect them maybe throw them away during tokenization etc. so it
> really depends what you are using on the analyzer side. maybe you can
> give us little more details on what you use for analysis. One option
> would be to build 3.1 from the source and use the analyzers from
> there?!
> 
>>
>> Good to hear that Lucene 3.1 will come soon.
>> Any rough estimation when Lucene 3.1 will be available?
> 
> I hope it will happen within the next 4 weeks
> 
> simon
> 
>>
>> Regards,
>> Bernd
>>
>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>>> Hey Bernd,
>>>
>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>>> <be...@uni-bielefeld.de> wrote:
>>>> Dear list,
>>>>
>>>> a very basic question about lucene, which version of
>>>> unicode can be handled (indexed and searched) with lucene?
>>>
>>> if you ask for what the indexer / query can handle then it is really
>>> what UTF-8 can handle. Strings passed to the writer / reader are
>>> converted to UTF-8 internally (rough picture). On Trunk we are
>>> indexing bytes only (UTF-8 bytes by default). so the question is
>>> really what you platform supports in terms of utilities / operations
>>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>>> have the possibility to respect code points which are above the BMP.
>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>>> from moving forward to Unicode 4.0. If you look at Character.java all
>>> methods have been converted to operate on UTF-32 code points instead
>>> of UTF-16 code points in Java 1.4.
>>>
>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>>> APIs are not in use yet in the latest released version. Lucene 3.1
>>> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
>>> (I think there are one or two which still have problems, I should
>>> check... Robert did we fix all NGram stuff?).
>>>
>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
>>> I hope) will fix most of the problems and includes ICU based analysis
>>> for full Unicode 5 support.
>>>
>>> hope that helps
>>>
>>> simon
>>>>
>>>> It looks like lucene can only handle the very old Unicode 2.0
>>>> but not the newer 3.1 version (4 byte utf-8 unicode).
>>>>
>>>> Is that true?
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehling@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Simon Willnauer <si...@googlemail.com>.
On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> Hi Simon,
>
> thanks for the details.
>
> My platform supports and uses code above BMP (0x10000 and up).
> So the limit is Lucene.
> Don't know how to handle this problem.
> May be deleting all code above BMP...???

the code will work fine even if they are in you text. It will just not
respect them maybe throw them away during tokenization etc. so it
really depends what you are using on the analyzer side. maybe you can
give us little more details on what you use for analysis. One option
would be to build 3.1 from the source and use the analyzers from
there?!

>
> Good to hear that Lucene 3.1 will come soon.
> Any rough estimation when Lucene 3.1 will be available?

I hope it will happen within the next 4 weeks

simon

>
> Regards,
> Bernd
>
> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>> Hey Bernd,
>>
>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>> <be...@uni-bielefeld.de> wrote:
>>> Dear list,
>>>
>>> a very basic question about lucene, which version of
>>> unicode can be handled (indexed and searched) with lucene?
>>
>> if you ask for what the indexer / query can handle then it is really
>> what UTF-8 can handle. Strings passed to the writer / reader are
>> converted to UTF-8 internally (rough picture). On Trunk we are
>> indexing bytes only (UTF-8 bytes by default). so the question is
>> really what you platform supports in terms of utilities / operations
>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>> have the possibility to respect code points which are above the BMP.
>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>> from moving forward to Unicode 4.0. If you look at Character.java all
>> methods have been converted to operate on UTF-32 code points instead
>> of UTF-16 code points in Java 1.4.
>>
>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>> APIs are not in use yet in the latest released version. Lucene 3.1
>> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
>> (I think there are one or two which still have problems, I should
>> check... Robert did we fix all NGram stuff?).
>>
>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
>> I hope) will fix most of the problems and includes ICU based analysis
>> for full Unicode 5 support.
>>
>> hope that helps
>>
>> simon
>>>
>>> It looks like lucene can only handle the very old Unicode 2.0
>>> but not the newer 3.1 version (4 byte utf-8 unicode).
>>>
>>> Is that true?
>>>
>>> Regards,
>>> Bernd
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Hi Simon,

thanks for the details.

My platform supports and uses code above BMP (0x10000 and up).
So the limit is Lucene.
Don't know how to handle this problem.
May be deleting all code above BMP...???

Good to hear that Lucene 3.1 will come soon.
Any rough estimation when Lucene 3.1 will be available?

Regards,
Bernd

Am 25.02.2011 12:04, schrieb Simon Willnauer:
> Hey Bernd,
> 
> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
>> Dear list,
>>
>> a very basic question about lucene, which version of
>> unicode can be handled (indexed and searched) with lucene?
> 
> if you ask for what the indexer / query can handle then it is really
> what UTF-8 can handle. Strings passed to the writer / reader are
> converted to UTF-8 internally (rough picture). On Trunk we are
> indexing bytes only (UTF-8 bytes by default). so the question is
> really what you platform supports in terms of utilities / operations
> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
> have the possibility to respect code points which are above the BMP.
> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
> from moving forward to Unicode 4.0. If you look at Character.java all
> methods have been converted to operate on UTF-32 code points instead
> of UTF-16 code points in Java 1.4.
> 
> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> APIs are not in use yet in the latest released version. Lucene 3.1
> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
> (I think there are one or two which still have problems, I should
> check... Robert did we fix all NGram stuff?).
> 
> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
> I hope) will fix most of the problems and includes ICU based analysis
> for full Unicode 5 support.
> 
> hope that helps
> 
> simon
>>
>> It looks like lucene can only handle the very old Unicode 2.0
>> but not the newer 3.1 version (4 byte utf-8 unicode).
>>
>> Is that true?
>>
>> Regards,
>> Bernd
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: which unicode version is supported with lucene

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Feb 25, 2011 at 6:04 AM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> Since 3.0 is a Java Generics / move to Java 1.5 only release these
> APIs are not in use yet in the latest released version. Lucene 3.1
> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
> (I think there are one or two which still have problems, I should
> check... Robert did we fix all NGram stuff?).
>
>
No... and honestly they have other serious problems (such as only looking at
first 1024 chars of input in the document, look at the jira issues). I
recommend against using them in general, but definitely if you have
codepoints outside of the BMP...

Re: which unicode version is supported with lucene

Posted by Simon Willnauer <si...@googlemail.com>.
Hey Bernd,

On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> Dear list,
>
> a very basic question about lucene, which version of
> unicode can be handled (indexed and searched) with lucene?

if you ask for what the indexer / query can handle then it is really
what UTF-8 can handle. Strings passed to the writer / reader are
converted to UTF-8 internally (rough picture). On Trunk we are
indexing bytes only (UTF-8 bytes by default). so the question is
really what you platform supports in terms of utilities / operations
on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
have the possibility to respect code points which are above the BMP.
Lucene 2.9 still has Java 1.4 System Requirements that prevented us
from moving forward to Unicode 4.0. If you look at Character.java all
methods have been converted to operate on UTF-32 code points instead
of UTF-16 code points in Java 1.4.

Since 3.0 is a Java Generics / move to Java 1.5 only release these
APIs are not in use yet in the latest released version. Lucene 3.1
holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
(I think there are one or two which still have problems, I should
check... Robert did we fix all NGram stuff?).

So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
I hope) will fix most of the problems and includes ICU based analysis
for full Unicode 5 support.

hope that helps

simon
>
> It looks like lucene can only handle the very old Unicode 2.0
> but not the newer 3.1 version (4 byte utf-8 unicode).
>
> Is that true?
>
> Regards,
> Bernd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org