You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "DM Smith (JIRA)" <ji...@apache.org> on 2009/08/11 05:54:15 UTC

[jira] Created: (LUCENE-1799) Unicode compression

Unicode compression
-------------------

                 Key: LUCENE-1799
                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Store
    Affects Versions: 2.4.1
            Reporter: DM Smith
            Priority: Minor


In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.

This led to the comment that a different or compressed encoding would be a generally useful feature. 

BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.

SCSU is another Unicode compression algorithm that could be used. 

An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-1799) Unicode compression

Posted by Robert Muir <rc...@gmail.com>.

doh! well if you have it, that will be very handy for verification.
I'll create a separate issue for this shortly, maybe you can review the
patch

Thanks,
Robert

On Thu, Nov 19, 2009 at 1:06 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Robert,
>
> Ack, actually two days ago I updated my Lucene trunk checkout and removed
> that code, thinking its utility had evaporated!
>
> But maybe IntelliJ will save my bacon in its local history cache.  (Praise
> IntelliJ!)  I'll check tonight when I get home.
>
> Steve
>
> On 11/19/2009 at 10:16 AM, Robert Muir wrote:
> > Steven, do you still have a test setup to measure collation key
> > generation performance with Lucene?
> >
> >
> > On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> >
> >       Hi Robert,
> >
> >
> >       On 11/18/2009 at 7:16 PM, Robert Muir wrote:    > Looking at the
> > collation support, we could maybe improve     >
> IndexableBinaryStringTools
> > by using char[]/byte[] with offset and        > length. The existing
> > ByteBuffer/CharBuffer methods could stay, they are    > consistent with
> > Charset api and are not wrong imo, but instead defer to       > the new
> > char[]/byte[] ones... the current buffer-based ones require the       >
> > buffer to have a backing array anyway or will throw an exception.
> >
> >
> >       +1
> >
> >       I used *Buffers because I thought it simplified method
> > prototypes, no other reason.
> >
> >       Steve
> >
> >
> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: [jira] Commented: (LUCENE-1799) Unicode compression

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Robert,

Ack, actually two days ago I updated my Lucene trunk checkout and removed that code, thinking its utility had evaporated!

But maybe IntelliJ will save my bacon in its local history cache.  (Praise IntelliJ!)  I'll check tonight when I get home.

Steve

On 11/19/2009 at 10:16 AM, Robert Muir wrote:
> Steven, do you still have a test setup to measure collation key
> generation performance with Lucene?
> 
> 
> On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe <sa...@syr.edu> wrote:
> 
> 
> 	Hi Robert,
> 
> 
> 	On 11/18/2009 at 7:16 PM, Robert Muir wrote: 	> Looking at the
> collation support, we could maybe improve 	> IndexableBinaryStringTools
> by using char[]/byte[] with offset and 	> length. The existing
> ByteBuffer/CharBuffer methods could stay, they are 	> consistent with
> Charset api and are not wrong imo, but instead defer to 	> the new
> char[]/byte[] ones... the current buffer-based ones require the 	>
> buffer to have a backing array anyway or will throw an exception.
> 
> 
> 	+1
> 
> 	I used *Buffers because I thought it simplified method
> prototypes, no other reason.
> 
> 	Steve
> 
> 
> 
> 
> 
> 
> --
> Robert Muir
> rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1799) Unicode compression

Posted by Robert Muir <rc...@gmail.com>.

Steven, do you still have a test setup to measure collation key generation
performance with Lucene?

On Thu, Nov 19, 2009 at 9:38 AM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Robert,
>
> On 11/18/2009 at 7:16 PM, Robert Muir wrote:
> > Looking at the collation support, we could maybe improve
> > IndexableBinaryStringTools by using char[]/byte[] with offset and
> > length. The existing ByteBuffer/CharBuffer methods could stay, they are
> > consistent with Charset api and are not wrong imo, but instead defer to
> > the new char[]/byte[] ones... the current buffer-based ones require the
> > buffer to have a backing array anyway or will throw an exception.
>
> +1
>
> I used *Buffers because I thought it simplified method prototypes, no other
> reason.
>
> Steve
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: [jira] Commented: (LUCENE-1799) Unicode compression

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Robert,

On 11/18/2009 at 7:16 PM, Robert Muir wrote:
> Looking at the collation support, we could maybe improve
> IndexableBinaryStringTools by using char[]/byte[] with offset and
> length. The existing ByteBuffer/CharBuffer methods could stay, they are
> consistent with Charset api and are not wrong imo, but instead defer to
> the new char[]/byte[] ones... the current buffer-based ones require the
> buffer to have a backing array anyway or will throw an exception.

+1

I used *Buffers because I thought it simplified method prototypes, no other reason.

Steve

Re: [jira] Commented: (LUCENE-1799) Unicode compression

Posted by Robert Muir <rc...@gmail.com>.

btw, does anyone have a guess at how expensive this
ByteBuffer/CharBuffer.wrap() is?

Looking at the collation support, we could maybe improve
IndexableBinaryStringTools by using char[]/byte[] with offset and length.
The existing ByteBuffer/CharBuffer methods could stay, they are consistent
with Charset api and are not wrong imo,
but instead defer to the new char[]/byte[] ones... the current buffer-based
ones require the buffer to have a backing array anyway or will throw an
exception.

On Wed, Nov 18, 2009 at 2:12 PM, Earwin Burrfoot (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779602#action_12779602]
>
> Earwin Burrfoot commented on LUCENE-1799:
> -----------------------------------------
>
> bq. as far as the encoding itself, BOCU-1 is available in the ICU library
> ICU's API requires to use ByteBuffer and CharBuffer for input/output. And
> even if I missed some nice method, encoder/decoder operates internally on
> said buffers. Thus, a wrap/unwrap for each String is inevitable.
>
> > Unicode compression
> > -------------------
> >
> >                 Key: LUCENE-1799
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Store
> >    Affects Versions: 2.4.1
> >            Reporter: DM Smith
> >            Priority: Minor
> >
> > In lucene-1793, there is the off-topic suggestion to provide compression
> of Unicode data. The motivation was a custom encoding in a Russian analyzer.
> The original supposition was that it provided a more compact index.
> > This led to the comment that a different or compressed encoding would be
> a generally useful feature.
> > BOCU-1 was suggested as a possibility. This is a patented algorithm by
> IBM with an implementation in ICU. If Lucene provide it's own implementation
> a freely avIlable, royalty-free license would need to be obtained.
> > SCSU is another Unicode compression algorithm that could be used.
> > An advantage of these methods is that they work on the whole of Unicode.
> If that is not needed an encoding such as iso8859-1 (or whatever covers the
> input) could be used.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779577#action_12779577 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

bq. The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors. 

Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff
(To get the royalty-free patent you must be "fully compliant", they have already done this).

If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.


> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779629#action_12779629 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

Earwin, i do not really like this implementation either.

So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency.
but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779621#action_12779621 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

bq. ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.
Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU: 

{code}
public static int compress(String source, byte buffer[], int offset)
public static int getCompressionLength(String source) 
...
{code}

But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU).
For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779682#action_12779682 ] 

Earwin Burrfoot commented on LUCENE-1799:
-----------------------------------------

bq. but then i guess we have to deal with this patent thing... i do not really know what is involved with that.
CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779571#action_12779571 ] 

Michael McCandless commented on LUCENE-1799:
--------------------------------------------

The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780129#action_12780129 ] 

DM Smith commented on LUCENE-1799:
----------------------------------

The sample code is probably what is on this page, here:
    http://unicode.org/notes/tn6/#Sample_Code

>From what I gather reading the whole page:
If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license.

There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification?

As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779580#action_12779580 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN:
http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779576#action_12779576 ] 

Mark Miller commented on LUCENE-1799:
-------------------------------------

pretty simple though, isnt it? Just pull the vector reader/writer from the codec as well?

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779602#action_12779602 ] 

Earwin Burrfoot commented on LUCENE-1799:
-----------------------------------------

bq. as far as the encoding itself, BOCU-1 is available in the ICU library
ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779510#action_12779510 ] 

Earwin Burrfoot commented on LUCENE-1799:
-----------------------------------------

> Earwin, if implemented as a directory, we lose many of the advantages.
Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant.
Waiting for flexible indexing, hoping it's gonna be flexible..

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741868#action_12741868 ] 

Earwin Burrfoot commented on LUCENE-1799:
-----------------------------------------

I think right now this can be implemented as a delegating Directory.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779442#action_12779442 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

Earwin, if implemented as a directory, we lose many of the advantages.

For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance
* you can encode/decode BOCU-1 to/from UTF-16 more than twice as fast as you can UTF-8 to/from UTF-16 (for this language)
* also, resulting bytes are less than half the size of UTF-8 (for this language), yet sort order is still preserved, so it should work for term dictionary, etc.

Note: I have never measured bocu performance in practice.

I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec... 


> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1799) Unicode compression

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779513#action_12779513 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

bq. Waiting for flexible indexing, hoping it's gonna be flexible.. 

it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1.
I do not know if in flexible indexing, it will be possible for a codec to change behavior like this... 
maybe someone knows if this is planned eventually or not?


> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org