You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2019/12/26 13:17:37 UTC

[GitHub] [lucene-solr] jpountz opened a new pull request #1126: LUCENE-5201: Terms dictionary compression.

jpountz opened a new pull request #1126: LUCENE-5201: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126
 
 
   Compress blocks of suffixes in order to make the terms dictionary more
   space-efficient. Two compression algorithms are used depending on which one is
   more space-efficient:
    - LowercaseAsciiCompression, which applies when all bytes are in the
      `[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
      lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
      It is very often applicable on analyzed content and decompresses very quickly
      thanks to auto-vectorization support in the JVM.
    - LZ4, when the compression ratio is less than 0.75.
   
   I was a bit unhappy with the complexity of the high-compression LZ4 option, so
   I simplified it in order to only keep the logic that detects duplicate strings.
   The logic about what to do in case overlapping matches are found, which was
   responsible for most of the complexity while only yielding tiny benefits, has
   been removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362484551
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   I don't view the public methods of an Enum as "internals".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362477434
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
 
 Review comment:
   What's wrong with the default enum toString?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362740669
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/compress/LZ4.java
 ##########
 @@ -0,0 +1,397 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.compress;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * LZ4 compression and decompression routines.
+ *
+ * http://code.google.com/p/lz4/
+ * http://fastcompression.blogspot.fr/p/lz4.html
 
 Review comment:
   Not ASL2 but an ASL2-compatible license: BSD 2-clause.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362462638
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnumFrame.java
 ##########
 @@ -163,15 +179,48 @@ void loadBlock() throws IOException {
     // instead of linear scan to find target term; eg
     // we could have simple array of offsets
 
+    final long startSuffixFP = ste.in.getFilePointer();
     // term suffixes:
-    code = ste.in.readVInt();
-    isLeafBlock = (code & 1) != 0;
-    int numBytes = code >>> 1;
-    if (suffixBytes.length < numBytes) {
-      suffixBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
+    if (version >= BlockTreeTermsReader.VERSION_COMPRESSED_SUFFIXES) {
+      final long codeL = ste.in.readVLong();
+      isLeafBlock = (codeL & 0x04) != 0;
+      final int numSuffixBytes = (int) (codeL >>> 3);
+      if (suffixBytes.length < numSuffixBytes) {
+        suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
+      }
+      compressionAlg = (int) codeL & 0x03;
 
 Review comment:
   Even fancier than I though it'd have to be! Looks great I think.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r361624773
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnumFrame.java
 ##########
 @@ -163,15 +179,48 @@ void loadBlock() throws IOException {
     // instead of linear scan to find target term; eg
     // we could have simple array of offsets
 
+    final long startSuffixFP = ste.in.getFilePointer();
     // term suffixes:
-    code = ste.in.readVInt();
-    isLeafBlock = (code & 1) != 0;
-    int numBytes = code >>> 1;
-    if (suffixBytes.length < numBytes) {
-      suffixBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
+    if (version >= BlockTreeTermsReader.VERSION_COMPRESSED_SUFFIXES) {
+      final long codeL = ste.in.readVLong();
+      isLeafBlock = (codeL & 0x04) != 0;
+      final int numSuffixBytes = (int) (codeL >>> 3);
+      if (suffixBytes.length < numSuffixBytes) {
+        suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
+      }
+      compressionAlg = (int) codeL & 0x03;
 
 Review comment:
   An enum with final code field for clarity in this and other places, maybe? Should be effectively the same at runtime.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362519930
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   It doesn't address the concern that something that looks as innocuous as reordering constants or adding one at any position but in the end breaks serialization. I have a slight preference for being explicit like Dawid, but I've been in the middle of this controversy many times already and it's a matter of robustness vs. conciseness in the end. If you feel strongly about it @dsmiley I'll switch back to using `#ordinal()`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362514660
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
 
 Review comment:
   I switched back to the default implementation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362483152
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   Exactly. I prefer the explicit code. Gives you control and context. Ordinals are really internal details of a particular implementation of enums.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362742470
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/compress/LZ4.java
 ##########
 @@ -0,0 +1,397 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.compress;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * LZ4 compression and decompression routines.
+ *
+ * http://code.google.com/p/lz4/
+ * http://fastcompression.blogspot.fr/p/lz4.html
 
 Review comment:
   I wonder: can you put an ASL license header on files that are under a different license (even if it's permissive/ compatible)?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz merged pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz merged pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362481693
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
 
 Review comment:
   I suggest then adding a default implementation that toLowerCase's name().  It would also be helpful to add a comment explaining where this is used; I didn't know.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362543015
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java
 ##########
 @@ -75,6 +75,17 @@
   /** Total number of bytes used to store term suffixes. */
   public long totalBlockSuffixBytes;
 
+  /**
+   * Number of times each compression method has been used.
+   * 0 = uncompressed
+   * 1 = lowercase_ascii
+   * 2 = LZ4
+   */
+  public final long[] compressionAlgorithms = new long[3];
 
 Review comment:
   Cool that you track this in BlockTree stats!  Did you post the stats somewhere?  Edit: ahh, I see the [cool stats here](https://issues.apache.org/jira/browse/LUCENE-4702?focusedCommentId=17003640&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17003640), thanks. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362542300
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   +1 for the explicit codes too.  Relying on enum ordinals is dangerously fragile ...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362545794
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/compress/LZ4.java
 ##########
 @@ -0,0 +1,397 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.compress;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * LZ4 compression and decompression routines.
+ *
+ * http://code.google.com/p/lz4/
+ * http://fastcompression.blogspot.fr/p/lz4.html
 
 Review comment:
   Are these also Apache 2.0 licensed?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362489542
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   BTW I can understand not wanting clients of an enum to call ordinal() on it; they should call a `getCode()` method, and the implementation of that could be ordinal().

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362480952
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
 
 Review comment:
   I wanted to keep the string representations lowercased in Stats#toString like in the previous iteration of this pull request, but no strong feeling, I don't mind removing these `toString` implementations.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dweiss commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362516139
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   You are entitled to your opinion, David. But the comment Adrien made still holds: when you reorder or add/ remove enum constants, their ordinal will change and this side effect is worth guarding against. The "code" may be verbose but it is also explicit and makes making a mistake more difficult. I think it's valuable here.
   
   Anyway. My original comment about hardcoded values was only to point out that code used the same constants in different places. I suggested an enum (like Adrien implemented) but it could as well be a set of constant integers declared in one place. I don't care about this (but I do care about using or overriding ordinal()...).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362534343
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   Oops I had not seen @dweiss 's comment when writing mine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362476910
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   Why bother with this -- explicit "code" in constructors (when the enum's intrinsic ordinal will do) and explicit BY_CODE construction when you could do `private static final CompressionAlgorithm[] BY_CODE = values();`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362461295
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnumFrame.java
 ##########
 @@ -163,15 +179,48 @@ void loadBlock() throws IOException {
     // instead of linear scan to find target term; eg
     // we could have simple array of offsets
 
+    final long startSuffixFP = ste.in.getFilePointer();
     // term suffixes:
-    code = ste.in.readVInt();
-    isLeafBlock = (code & 1) != 0;
-    int numBytes = code >>> 1;
-    if (suffixBytes.length < numBytes) {
-      suffixBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
+    if (version >= BlockTreeTermsReader.VERSION_COMPRESSED_SUFFIXES) {
+      final long codeL = ste.in.readVLong();
+      isLeafBlock = (codeL & 0x04) != 0;
+      final int numSuffixBytes = (int) (codeL >>> 3);
+      if (suffixBytes.length < numSuffixBytes) {
+        suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
+      }
+      compressionAlg = (int) codeL & 0x03;
 
 Review comment:
   I pushed a change that introduces an enum, does it look better?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362759447
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/compress/LZ4.java
 ##########
 @@ -0,0 +1,397 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.compress;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * LZ4 compression and decompression routines.
+ *
+ * http://code.google.com/p/lz4/
+ * http://fastcompression.blogspot.fr/p/lz4.html
 
 Review comment:
   I think some would argue you can here since it's a rewrite, but I agree it's probably safer to retain the BSD license instead. I changed it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362480549
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   I'm fine either way but I know some people strongly prefer decoupling ids - which are used for serialization and shouldn't change - from enum ordinals that might change if new constants are inserted or if values are reordered.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.

Posted by GitBox <gi...@apache.org>.
dsmiley commented on a change in pull request #1126: LUCENE-4702: Terms dictionary compression.
URL: https://github.com/apache/lucene-solr/pull/1126#discussion_r362481104
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/codecs/blocktree/CompressionAlgorithm.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.blocktree;
+
+import java.io.IOException;
+
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.compress.LowercaseAsciiCompression;
+
+/**
+ * Compression algorithm used for suffixes of a block of terms.
+ */
+enum CompressionAlgorithm {
+
+  NO_COMPRESSION(0x00) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      in.readBytes(out, 0, len);
+    }
+
+    @Override
+    public String toString() {
+      return "no_compression";
+    }
+  },
+
+  LOWERCASE_ASCII(0x01) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      LowercaseAsciiCompression.decompress(in, out, len);
+    }
+
+    @Override
+    public String toString() {
+      return "lowercase_ascii";
+    }
+  },
+
+  LZ4(0x02) {
+
+    @Override
+    void read(DataInput in, byte[] out, int len) throws IOException {
+      org.apache.lucene.util.compress.LZ4.decompress(in, len, out, 0);
+    }
+
+    @Override
+    public String toString() {
+      return "lz4";
+    }
+  };
+
+  private static final CompressionAlgorithm[] BY_CODE = new CompressionAlgorithm[3];
 
 Review comment:
   I understand that but it sandbags the current implementation with verbosity it doesn't need.  We can adjust in the future if we actually need to.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org