You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2001/10/11 20:44:47 UTC

Added comments to InputStream and OutputStream

I figured that I might as well be adding comments as I am reading and 
figuring out the code.
One thing I was not clear on - characters are stored with 1 to 3 bytes. 
Is that sufficient to represent all Unicode characters? I thought 
Unicode was four bytes.

Index: InputStream.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/store/InputStream.java,v
retrieving revision 1.1.1.1
diff -w -u -r1.1.1.1 InputStream.java
--- InputStream.java    2001/09/18 16:29:59     1.1.1.1
+++ InputStream.java    2001/10/11 18:37:23
@@ -60,8 +60,6 @@
   Abstract class for input from a file in a Directory.
   @author Doug Cutting
 */
-
-/** A random-access input stream */
 abstract public class InputStream implements Cloneable {
   final static int BUFFER_SIZE = OutputStream.BUFFER_SIZE;

@@ -81,6 +79,7 @@
     return buffer[bufferPosition++];
   }

+  /** InputStream-like methods @see java.io.InputStream */
   public final void readBytes(byte[] b, int offset, int len)
        throws IOException {
     if (len < BUFFER_SIZE) {
@@ -97,11 +96,22 @@
     }
   }

+  /** Read an integer from the stream. The integer must have been written
+   *  by a call to OutputStream.writeInt. It is stored as four bytes, 
from most to least
+   *  significant.
+   */
   public final int readInt() throws IOException {
     return ((readByte() & 0xFF) << 24) | ((readByte() & 0xFF) << 16)
          | ((readByte() & 0xFF) <<  8) |  (readByte() & 0xFF);
   }

+  /** Read a compressed integer from the stream. The integer must have been
+   *  written by a call to OutputStream.writeVInt. It is stored as a 
series of bytes, from
+   *  least significant to the most significant. Each byte contains 7 
bits of data
+   *  and the 8th (0x80) bit that indicates the last byte of the 
integer. With this
+   *  format, smaller integers occupy only one byte, larger ones - two 
bytes, and
+   *  so on up to 4 bytes.
+   */
   public final int readVInt() throws IOException {
     byte b = readByte();
     int i = b & 0x7F;
@@ -112,10 +122,18 @@
     return i;
   }

+  /** Read a long from the stream. The long must have been written by a 
call to
+   *  OutputStream.writeLong. It is stored as 8 bytes, from most 
significant to the least
+   *  significant.
+   */
   public final long readLong() throws IOException {
     return (((long)readInt()) << 32) | (readInt() & 0xFFFFFFFFL);
   }

+  /** Read a compressed long from the stream. The long must have been 
written by
+   *  a call to OutputStream.writeVLong. It is stored similarly to the 
VInt, but may occupy
+   *  1 to 8 bytes.
+   */
   public final long readVLong() throws IOException {
     byte b = readByte();
     long i = b & 0x7F;
@@ -126,6 +144,10 @@
     return i;
   }

+  /** Read a string from the stream. The string must have been written 
by a call
+   *  to OutputStream.writeString. It is stored as a VInt (see readVInt)
+   *  indicating the string size, followed by that many chars (see 
readChars).
+   */
   public final String readString() throws IOException {
     int length = readVInt();
     if (chars == null || length > chars.length)
@@ -134,6 +156,12 @@
     return new String(chars, 0, length);
   }

+  /** Read an array of characters, placing them into the provided buffer.
+   *  The read characters are placed into array starting with the index 
<i>start</i>
+   *  and continuing for <i>length</i> characters. The characters must 
have been
+   *  written with a call to OutputStream.writeChards. Each character 
is stored
+   *  using one, two, or three bytes, depending on the value of the 
character.
+   */
   public final void readChars(char[] buffer, int start, int length)
        throws IOException {
     final int end = start + length;
@@ -179,6 +207,7 @@
     return bufferStart + bufferPosition;
   }

+  /** RandomAccessFile-like methods @see java.io.RandomAccessFile */
   public final void seek(long pos) throws IOException {
     if (pos >= bufferStart && pos < (bufferStart + bufferLength))
       bufferPosition = (int)(pos - bufferStart);  // seek within buffer
@@ -191,10 +220,16 @@
   }
   abstract protected void seekInternal(long pos) throws IOException;

+  /** RandomAccessFile-like methods @see java.io.RandomAccessFile */
   public final long length() {
     return length;
   }

+  /** Create a clone of this stream. The clone provides access to the same
+   *  undelying descriptor as the original file, however it maintains 
its own
+   *  buffer and file position so it can be used concurrently with the 
original
+   *  file and other clones.
+   */
   public Object clone() {
     InputStream clone = null;
     try {
Index: OutputStream.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/store/OutputStream.java,v
retrieving revision 1.1.1.1
diff -w -u -r1.1.1.1 OutputStream.java
--- OutputStream.java   2001/09/18 16:29:59     1.1.1.1
+++ OutputStream.java   2001/10/11 18:37:23
@@ -60,8 +60,6 @@
   Abstract class for output from a file in a Directory.
   @author Doug Cutting
 */
-
-/** A random-access output stream */
 abstract public class OutputStream {
   final static int BUFFER_SIZE = 1024;

@@ -76,11 +74,15 @@
     buffer[bufferPosition++] = b;
   }

+  /** OutputStream-like methods @see java.io.InputStream */
   public final void writeBytes(byte[] b, int length) throws IOException {
     for (int i = 0; i < length; i++)
       writeByte(b[i]);
   }

+  /** Write an integer into the stream. The integer can be read by calling
+   *  InputStream.readInt. It is stored using four bytes.
+   */
   public final void writeInt(int i) throws IOException {
     writeByte((byte)(i >> 24));
     writeByte((byte)(i >> 16));
@@ -88,6 +90,10 @@
     writeByte((byte) i);
   }

+  /** Write a compressed integer into the stream. The integer can be 
read by
+   *  calling InputStream.readVInt. It is stored using from one to four 
bytes,
+   *  depending on the value of the integer.
+   */
   public final void writeVInt(int i) throws IOException {
     while ((i & ~0x7F) != 0) {
       writeByte((byte)((i & 0x7f) | 0x80));
@@ -96,11 +102,18 @@
     writeByte((byte)i);
   }

+  /** Write a long into the stream. The long can be read by calling 
InputStream.readLong.
+   *  It is stored using 8 bytes.
+   */
   public final void writeLong(long i) throws IOException {
     writeInt((int) (i >> 32));
     writeInt((int) i);
   }

+  /** Write a compressed long into the stream. The long can be read by 
calling
+   *  InputStream.readVLong. It is stored using from one to eight bytes 
depending
+   *  on the value of the long.
+   */
   public final void writeVLong(long i) throws IOException {
     while ((i & ~0x7F) != 0) {
       writeByte((byte)((i & 0x7f) | 0x80));
@@ -109,12 +122,20 @@
     writeByte((byte)i);
   }

+  /** Write a string into the stream. The string can be read by calling
+   *  InputStream.readString. It is stored as a VInt representing the 
number of
+   *  characters, followed by that many characters (see writeChars).
+   */
   public final void writeString(String s) throws IOException {
     int length = s.length();
     writeVInt(length);
     writeChars(s, 0, length);
   }

+  /** Write an array of characters into the stream. The array can be 
read by
+   *  calling InputStream.readChars. Each character is stored using 
from one to
+   *  three bytes depending on the value of the character.
+   */
   public final void writeChars(String s, int start, int length)
        throws IOException {
     final int end = start + length;
@@ -141,6 +162,7 @@

   abstract protected void flushBuffer(byte[] b, int len) throws 
IOException;

+  /** Flush and close the stream. */
   public void close() throws IOException {
     flush();
   }
@@ -150,11 +172,13 @@
     return bufferStart + bufferPosition;
   }

+  /** RandomAccessFile-like methods @see java.io.RandomAccessFile */
   public void seek(long pos) throws IOException {
     flush();
     bufferStart = pos;
   }

+  /** RandomAccessFile-like methods @see java.io.RandomAccessFile */
   abstract public long length() throws IOException;

Re: Added comments to InputStream and OutputStrea m

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

>
>
>I'm a bit confused about this discussion; Java does a great job of
>hiding character encodings from you. Is Lucene turning byte arrays
>into character arrays somewhere?
>
Yes, but I think it does a good job.
See org.apache.lucene.store.{InputStream,OutputStream}.

>
>                                                     nelson@monkey.org
>.       .      .     .    .   .  . . http://www.media.mit.edu/~nelson/
>

Re: Re:Added comments to InputStream and OutputStrea m

Posted by Nelson Minar <ne...@monkey.org>.

>Unicode is 16 bits.  UTF-8 needs 1 byte for a 7-bit character (ASCII),
>2 bytes for an 11-bit character (including ISO-8859-1), and 3 bytes for
>a 16-bit character.

This is partly true. Unicode itself is coding independent. I believe
Unicode is currently defined as having up to 2^31 positions, although
the current plan is for somewhere between 2^20 and 2^21 characters.
(2^16 characters was the old Unicode standard - dropped when someone
pointed out that Chinese alone has more than 2^16 characters).

Unicode needs to be encoded somehow as a sequence of words. UTF-8
encodes Unicode as sequences of 8 bit words - either 1, 2, or 3
depending on the character. UTF-16 encodes it as a sequence of 16 bit
words: 1 or 2. UTF-32 encodes it as a sequence of 32 bit words, always
1 per character.

UTF-8 is the most common encoding. It handles ISO-Latin-1 easily (fits
in 1 word).

Unicode is cool - if you want to learn more, see
  http://www.unicode.org/
  http://www.unicode.org/unicode/faq/utf_bom.html


I'm a bit confused about this discussion; Java does a great job of
hiding character encodings from you. Is Lucene turning byte arrays
into character arrays somewhere?

                                                     nelson@monkey.org
.       .      .     .    .   .  . . http://www.media.mit.edu/~nelson/

Re:Added comments to InputStream and OutputStrea m

Posted by jo...@teamware.co.uk.

I asked my colleague your question on Unicode & bytes - this was his reply :

Unicode is 16 bits.  UTF-8 needs 1 byte for a 7-bit character (ASCII),
2 bytes for an 11-bit character (including ISO-8859-1), and 3 bytes for
a 16-bit character.

       DaveS


Joanne



Dmitry Serebrennikov  (11/10/2001  18:44):
>I figured that I might as well be adding comments as I am reading and
>figuring out the code.
>One thing I was not clear on - characters are stored with 1 to 3 bytes.
>Is that sufficient to represent all Unicode characters? I thought
>Unicode was four bytes.
>
>Index: InputStream.java
>===================================================================
>RCS file:
>/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/store/InputStream.
java,v
>retrieving revision 1.1.1.1