You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Bernhard Messer <Be...@intrafind.de> on 2004/08/06 09:52:57 UTC

optimized disk usage when creating a compound index

hi developers,

i made some measurements on lucene disk usage during index creation. 
It's no surprise that during index creation,  within the index 
optimization, more disk space is necessary than the final index size 
will reach. What i didn't expect is such a high difference in disk size 
usage, switching the compound file option on or off. Using the compound 
file option, the disk usage during index creation is more than 3 times 
higher than the final index size. This could be a pain in the neck, 
running projects like nutch, where huge datasets will be indexed. The 
grow rate relies on the fact that SegmentMerger creates the fully 
compound file first, before deleting the original, unused files.
So i patched SegmentMerger and CompoundFileWriter classes in a way, that 
they will delete the file immediatly after copying the data within the 
compound. The result was, that we could reduce the necessary disk space 
from factor 3 to 2.
The change forces to make some modifications within the TestCompoundFile 
class also. In several test methods the original file was compared to 
it's compound part. Using the modified SegmentMerger and 
CompoundFileWriter, the file was already deleted and couldn't be opened.

Here are some statistics about disk usage during index creation:

compound option is off:
final index size: 380 KB           max. diskspace used: 408 KB
final index size: 11079 KB       max. diskspace used: 11381 KB
final index size: 204148 KB      max. diskspace used: 20739 KB

using compound index:
final index size: 380 KB           max. diskspace used: 1145 KB
final index size: 11079 KB       max. diskspace used: 33544 KB
final index size: 204148 KB      max. diskspace used: 614977 KB

using compound index with patch:
final index size: 380 KB           max. diskspace used: 777 KB
final index size: 11079 KB       max. diskspace used: 22464 KB
final index size: 204148 KB      max. diskspace used: 410829

The change was tested under windows and linux without any negativ side 
effects. All JUnit test cases work fine. In the attachment you'll find 
all the necessary files:

SegmentMerger.java
CompoundFileWriter.java
TestCompoundFile.java

SegmentMerger.diff
CompoundFileWriter.diff
TestCompoundFile.diff

keep moving
Bernhard

Re: optimized disk usage when creating a compound index

Posted by Christoph Goller <go...@detego-software.de>.

Daniel Naber wrote:
> On Sunday 08 August 2004 15:16, Christoph Goller wrote:
> 
> 
>>If compound files are used, Lucene needs up to 3 times the disk space
>>(during indexing) that is required by the final index.
> 
> 
> Talking about compound index, there's a variable "open" in 
> CompoundFileReader with the comment "reference count" but it's never used. 
> I assume this variable can just be removed?

Yes, looks like it can be removed.

Christoph



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Daniel Naber <da...@t-online.de>.

On Sunday 08 August 2004 15:16, Christoph Goller wrote:

> If compound files are used, Lucene needs up to 3 times the disk space
> (during indexing) that is required by the final index.

Talking about compound index, there's a variable "open" in 
CompoundFileReader with the comment "reference count" but it's never used. 
I assume this variable can just be removed?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Christoph Goller wrote:

> Bernhard Messer wrote:
>
>> Hi Christoph,
>>
>> just reviewed the TestCompoundFile.java and you where absolutly right 
>> when saying that the test will fail on windows.  No the test is 
>> changed in a way that a second file with identical data is created. 
>> This file can be used in the testcases to make the comparisons 
>> against the compound store. Now the modified test runs fine on 
>> Microsoft and Linux platforms.
>>
>> In the attachment you'll find the new TestCompoundFile source.
>>
>> hope this helps
>> Bernhard
>
>
> Hi Bernhard,
>
> I reconsidered your chances again.
> The problem that is solved is the following:
>
> If compound files are used, Lucene needs up to 3 times the disk space 
> (during
> indexing) that is required by the final index. The reason is that 
> during a
> merge of mergeFactor segments, these segments are doubled by merging 
> them into a
> new one and then the new segment is doubled again while generating its 
> compound
> file.
>
> You solved the problem by deleting individual files from a segment 
> earlier while
> building the compound file. However, this means that the 
> CompoundFileWriter in
> its close operation now deletes files. This is not necessarily what 
> one expects
> if one uses a CompoundFileWriter. It should only generate a compound 
> file, not delete the original files. Therefore you had to change 
> CompoundFileWriter tests
> accordingly!

I'm sorry for juping into this late, but my impression was that the 
files being deleted were of the new segment, not the files for segments 
being merged. This, I think, would be ok, because if the operation 
fails, the old files are still there and the new segment is never 
entered into the "segments" file and thus the index remains uncorrupted. 
However, if we delete the previous segments first, we'd have no way of 
recovering from failure during the merge process.

>
> My idea now is to change IndexWriter so that during merge all old 
> segments are
> deleted before the compound file is generated. I think that I also 
> avoid the
> factor of 3 and get a maximum factor of 2 concerning disk space. I 
> committed my
> changes. Could you do a test as you did with your patch to verify if 
> my changes have the desired outcome too? That would be great,

    I'm sory, Christoph, but I don't think these changes will work 
right... I just looked through the current CVS and it seems to me that 
there is a problem because segmentInfos.write() calls in the IndexWriter 
end up replacing "segments" file with a new one that puts the newly 
created segment on-line. Now, if writing of the compound file fails, we 
end up with a corrupt index.
    Another problem is that the writing of the compound file now happens 
under the commit.lock, whereas before it happened outside of it. This is 
potentially a very lengthy operation and will prevent any new 
IndexReaders from being created for a long time, possibly minutes!
    And taking the new call to createCompoundFile() out of the lock 
won't do either because that would mean that IndexReaders could be 
created during this time, but they would be confused since they will go 
after the new segment and try to open a half-constructed "cfs" file.

    Again, I'm sorry, but I think I have to -1 these changes.

    -1.



Dmitry.

>
> Christoph
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Bernhard Messer <Be...@intrafind.de>.

Christoph,

very clever implementation and bad news for all disk manufacturer ;-). 
The patch works as expected and reduces the max. disk usage the same way 
announced in the first message introducing this patch.

thanks
Bernhard

Christoph Goller wrote:

> Bernhard Messer wrote:
>
>> Hi Christoph,
>>
>> just reviewed the TestCompoundFile.java and you where absolutly right 
>> when saying that the test will fail on windows.  No the test is 
>> changed in a way that a second file with identical data is created. 
>> This file can be used in the testcases to make the comparisons 
>> against the compound store. Now the modified test runs fine on 
>> Microsoft and Linux platforms.
>>
>> In the attachment you'll find the new TestCompoundFile source.
>>
>> hope this helps
>> Bernhard
>
>
> Hi Bernhard,
>
> I reconsidered your chances again.
> The problem that is solved is the following:
>
> If compound files are used, Lucene needs up to 3 times the disk space 
> (during
> indexing) that is required by the final index. The reason is that 
> during a
> merge of mergeFactor segments, these segments are doubled by merging 
> them into a
> new one and then the new segment is doubled again while generating its 
> compound
> file.
>
> You solved the problem by deleting individual files from a segment 
> earlier while
> building the compound file. However, this means that the 
> CompoundFileWriter in
> its close operation now deletes files. This is not necessarily what 
> one expects
> if one uses a CompoundFileWriter. It should only generate a compound 
> file, not delete the original files. Therefore you had to change 
> CompoundFileWriter tests
> accordingly!
>
> My idea now is to change IndexWriter so that during merge all old 
> segments are
> deleted before the compound file is generated. I think that I also 
> avoid the
> factor of 3 and get a maximum factor of 2 concerning disk space. I 
> committed my
> changes. Could you do a test as you did with your patch to verify if 
> my changes have the desired outcome too? That would be great,
>
> Christoph
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Christoph Goller <go...@detego-software.de>.

Bernhard Messer wrote:
> Hi Christoph,
> 
> just reviewed the TestCompoundFile.java and you where absolutly right 
> when saying that the test will fail on windows.  No the test is changed 
> in a way that a second file with identical data is created. This file 
> can be used in the testcases to make the comparisons against the 
> compound store. Now the modified test runs fine on Microsoft and Linux 
> platforms.
> 
> In the attachment you'll find the new TestCompoundFile source.
> 
> hope this helps
> Bernhard

Hi Bernhard,

I reconsidered your chances again.
The problem that is solved is the following:

If compound files are used, Lucene needs up to 3 times the disk space (during
indexing) that is required by the final index. The reason is that during a
merge of mergeFactor segments, these segments are doubled by merging them into a
new one and then the new segment is doubled again while generating its compound
file.

You solved the problem by deleting individual files from a segment earlier while
building the compound file. However, this means that the CompoundFileWriter in
its close operation now deletes files. This is not necessarily what one expects
if one uses a CompoundFileWriter. It should only generate a compound file, not 
delete the original files. Therefore you had to change CompoundFileWriter tests
accordingly!

My idea now is to change IndexWriter so that during merge all old segments are
deleted before the compound file is generated. I think that I also avoid the
factor of 3 and get a maximum factor of 2 concerning disk space. I committed my
changes. Could you do a test as you did with your patch to verify if my changes 
have the desired outcome too? That would be great,

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Bernhard Messer <Be...@intrafind.de>.

Hi Christoph,

just reviewed the TestCompoundFile.java and you where absolutly right 
when saying that the test will fail on windows.  No the test is changed 
in a way that a second file with identical data is created. This file 
can be used in the testcases to make the comparisons against the 
compound store. Now the modified test runs fine on Microsoft and Linux 
platforms.

In the attachment you'll find the new TestCompoundFile source.

hope this helps
Bernhard

Christoph Goller wrote:

> It will not be lost. I have already reviewed it.
> There are some open issues concerning the changes in
> TestCompoundFile, that I want to discuss with Bernhard
> and then (hopefully next week) I will commit it.
>
> Christoph
>
> Erik Hatcher wrote:
>
>> Bernhard,
>>
>> Impressive work.  In order to prevent this from being lost in 
>> e-mail,  could you please create a new Bugzilla issue for each of 
>> your great  patches and attach the differences as CVS patches (cvs 
>> diff -Nu)?
>>
>> Many thanks for these contributions.
>>
>>     Erik
>>
>> On Aug 6, 2004, at 3:52 AM, Bernhard Messer wrote:
>>
>>> hi developers,
>>>
>>> i made some measurements on lucene disk usage during index 
>>> creation.  It's no surprise that during index creation,  within the 
>>> index  optimization, more disk space is necessary than the final 
>>> index size  will reach. What i didn't expect is such a high 
>>> difference in disk  size usage, switching the compound file option 
>>> on or off. Using the  compound file option, the disk usage during 
>>> index creation is more  than 3 times higher than the final index 
>>> size. This could be a pain in  the neck, running projects like 
>>> nutch, where huge datasets will be  indexed. The grow rate relies on 
>>> the fact that SegmentMerger creates  the fully compound file first, 
>>> before deleting the original, unused  files.
>>> So i patched SegmentMerger and CompoundFileWriter classes in a way,  
>>> that they will delete the file immediatly after copying the data  
>>> within the compound. The result was, that we could reduce the  
>>> necessary disk space from factor 3 to 2.
>>> The change forces to make some modifications within the  
>>> TestCompoundFile class also. In several test methods the original 
>>> file  was compared to it's compound part. Using the modified 
>>> SegmentMerger  and CompoundFileWriter, the file was already deleted 
>>> and couldn't be  opened.
>>>
>>> Here are some statistics about disk usage during index creation:
>>>
>>> compound option is off:
>>> final index size: 380 KB           max. diskspace used: 408 KB
>>> final index size: 11079 KB       max. diskspace used: 11381 KB
>>> final index size: 204148 KB      max. diskspace used: 20739 KB
>>>
>>> using compound index:
>>> final index size: 380 KB           max. diskspace used: 1145 KB
>>> final index size: 11079 KB       max. diskspace used: 33544 KB
>>> final index size: 204148 KB      max. diskspace used: 614977 KB
>>>
>>> using compound index with patch:
>>> final index size: 380 KB           max. diskspace used: 777 KB
>>> final index size: 11079 KB       max. diskspace used: 22464 KB
>>> final index size: 204148 KB      max. diskspace used: 410829
>>>
>>> The change was tested under windows and linux without any negativ 
>>> side  effects. All JUnit test cases work fine. In the attachment 
>>> you'll find  all the necessary files:
>>>
>>> SegmentMerger.java
>>> CompoundFileWriter.java
>>> TestCompoundFile.java
>>>
>>> SegmentMerger.diff
>>> CompoundFileWriter.diff
>>> TestCompoundFile.diff
>>>
>>> keep moving
>>> Bernhard
>>>
>>>
>>> Index: src/java/org/apache/lucene/index/CompoundFileWriter.java
>>> ===================================================================
>>> RCS file:  
>>> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
>>> CompoundFileWriter.java,v
>>> retrieving revision 1.3
>>> diff -r1.3 CompoundFileWriter.java
>>> 163a164,166
>>>
>>>>
>>>>                 // immediatly delete the copied file to safe  
>>>> disk-space
>>>>                 directory.deleteFile((String) fe.file);
>>>
>>>
>>> package org.apache.lucene.index;
>>>
>>> /**
>>>  * Copyright 2004 The Apache Software Foundation
>>>  *
>>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>>  * you may not use this file except in compliance with the License.
>>>  * You may obtain a copy of the License at
>>>  *
>>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>>  *
>>>  * Unless required by applicable law or agreed to in writing, software
>>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>>> implied.
>>>  * See the License for the specific language governing permissions and
>>>  * limitations under the License.
>>>  */
>>>
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.OutputStream;
>>> import org.apache.lucene.store.InputStream;
>>> import java.util.LinkedList;
>>> import java.util.HashSet;
>>> import java.util.Iterator;
>>> import java.io.IOException;
>>>
>>>
>>> /**
>>>  * Combines multiple files into a single compound file.
>>>  * The file format:<br>
>>>  * <ul>
>>>  *     <li>VInt fileCount</li>
>>>  *     <li>{Directory}
>>>  *         fileCount entries with the following structure:</li>
>>>  *         <ul>
>>>  *             <li>long dataOffset</li>
>>>  *             <li>UTFString extension</li>
>>>  *         </ul>
>>>  *     <li>{File Data}
>>>  *         fileCount entries with the raw data of the corresponding  
>>> file</li>
>>>  * </ul>
>>>  *
>>>  * The fileCount integer indicates how many files are contained in  
>>> this compound
>>>  * file. The {directory} that follows has that many entries. Each  
>>> directory entry
>>>  * contains an encoding identifier, an long pointer to the start of  
>>> this file's
>>>  * data section, and a UTF String with that file's extension.
>>>  *
>>>  * @author Dmitry Serebrennikov
>>>  * @version $Id: CompoundFileWriter.java,v 1.3 2004/03/29 22:48:02  
>>> cutting Exp $
>>>  */
>>> final class CompoundFileWriter {
>>>
>>>     private static final class FileEntry {
>>>         /** source file */
>>>         String file;
>>>
>>>         /** temporary holder for the start of directory entry for 
>>> this  file */
>>>         long directoryOffset;
>>>
>>>         /** temporary holder for the start of this file's data 
>>> section  */
>>>         long dataOffset;
>>>     }
>>>
>>>
>>>     private Directory directory;
>>>     private String fileName;
>>>     private HashSet ids;
>>>     private LinkedList entries;
>>>     private boolean merged = false;
>>>
>>>
>>>     /** Create the compound stream in the specified file. The file  
>>> name is the
>>>      *  entire name (no extensions are added).
>>>      */
>>>     public CompoundFileWriter(Directory dir, String name) {
>>>         if (dir == null)
>>>             throw new IllegalArgumentException("Missing directory");
>>>         if (name == null)
>>>             throw new IllegalArgumentException("Missing name");
>>>
>>>         directory = dir;
>>>         fileName = name;
>>>         ids = new HashSet();
>>>         entries = new LinkedList();
>>>     }
>>>
>>>     /** Returns the directory of the compound file. */
>>>     public Directory getDirectory() {
>>>         return directory;
>>>     }
>>>
>>>     /** Returns the name of the compound file. */
>>>     public String getName() {
>>>         return fileName;
>>>     }
>>>
>>>     /** Add a source stream. If sourceDir is null, it is set to the
>>>      *  same value as the directory where this compound stream exists.
>>>      *  The id is the string by which the sub-stream will be know 
>>> in  the
>>>      *  compound stream. The caller must ensure that the ID is 
>>> unique.  If the
>>>      *  id is null, it is set to the name of the source file.
>>>      */
>>>     public void addFile(String file) {
>>>         if (merged)
>>>             throw new IllegalStateException(
>>>                 "Can't add extensions after merge has been called");
>>>
>>>         if (file == null)
>>>             throw new IllegalArgumentException(
>>>                 "Missing source file");
>>>
>>>         if (! ids.add(file))
>>>             throw new IllegalArgumentException(
>>>                 "File " + file + " already added");
>>>
>>>         FileEntry entry = new FileEntry();
>>>         entry.file = file;
>>>         entries.add(entry);
>>>     }
>>>
>>>     /** Merge files with the extensions added up to now.
>>>      *  All files with these extensions are combined sequentially 
>>> into  the
>>>      *  compound stream. After successful merge, the source files
>>>      *  are deleted.
>>>      */
>>>     public void close() throws IOException {
>>>         if (merged)
>>>             throw new IllegalStateException(
>>>                 "Merge already performed");
>>>
>>>         if (entries.isEmpty())
>>>             throw new IllegalStateException(
>>>                 "No entries to merge have been defined");
>>>
>>>         merged = true;
>>>
>>>         // open the compound stream
>>>         OutputStream os = null;
>>>         try {
>>>             os = directory.createFile(fileName);
>>>
>>>             // Write the number of entries
>>>             os.writeVInt(entries.size());
>>>
>>>             // Write the directory with all offsets at 0.
>>>             // Remember the positions of directory entries so that 
>>> we  can
>>>             // adjust the offsets later
>>>             Iterator it = entries.iterator();
>>>             while(it.hasNext()) {
>>>                 FileEntry fe = (FileEntry) it.next();
>>>                 fe.directoryOffset = os.getFilePointer();
>>>                 os.writeLong(0);    // for now
>>>                 os.writeString(fe.file);
>>>             }
>>>
>>>             // Open the files and copy their data into the stream.
>>>             // Remeber the locations of each file's data section.
>>>             byte buffer[] = new byte[1024];
>>>             it = entries.iterator();
>>>             while(it.hasNext()) {
>>>                 FileEntry fe = (FileEntry) it.next();
>>>                 fe.dataOffset = os.getFilePointer();
>>>                 copyFile(fe, os, buffer);
>>>
>>>                 // immediatly delete the copied file to safe disk-space
>>>                 directory.deleteFile((String) fe.file);
>>>             }
>>>
>>>             // Write the data offsets into the directory of the  
>>> compound stream
>>>             it = entries.iterator();
>>>             while(it.hasNext()) {
>>>                 FileEntry fe = (FileEntry) it.next();
>>>                 os.seek(fe.directoryOffset);
>>>                 os.writeLong(fe.dataOffset);
>>>             }
>>>
>>>             // Close the output stream. Set the os to null before  
>>> trying to
>>>             // close so that if an exception occurs during the 
>>> close,  the
>>>             // finally clause below will not attempt to close the  
>>> stream
>>>             // the second time.
>>>             OutputStream tmp = os;
>>>             os = null;
>>>             tmp.close();
>>>
>>>         } finally {
>>>             if (os != null) try { os.close(); } catch (IOException 
>>> e)  { }
>>>         }
>>>     }
>>>
>>>     /** Copy the contents of the file with specified extension into the
>>>      *  provided output stream. Use the provided buffer for moving data
>>>      *  to reduce memory allocation.
>>>      */
>>>     private void copyFile(FileEntry source, OutputStream os, byte  
>>> buffer[])
>>>     throws IOException
>>>     {
>>>         InputStream is = null;
>>>         try {
>>>             long startPtr = os.getFilePointer();
>>>
>>>             is = directory.openFile(source.file);
>>>             long length = is.length();
>>>             long remainder = length;
>>>             int chunk = buffer.length;
>>>
>>>             while(remainder > 0) {
>>>                 int len = (int) Math.min(chunk, remainder);
>>>                 is.readBytes(buffer, 0, len);
>>>                 os.writeBytes(buffer, len);
>>>                 remainder -= len;
>>>             }
>>>
>>>             // Verify that remainder is 0
>>>             if (remainder != 0)
>>>                 throw new IOException(
>>>                     "Non-zero remainder length after copying: " +  
>>> remainder
>>>                     + " (id: " + source.file + ", length: " + length
>>>                     + ", buffer size: " + chunk + ")");
>>>
>>>             // Verify that the output length diff is equal to 
>>> original  file
>>>             long endPtr = os.getFilePointer();
>>>             long diff = endPtr - startPtr;
>>>             if (diff != length)
>>>                 throw new IOException(
>>>                     "Difference in the output file offsets " + diff
>>>                     + " does not match the original file length " +  
>>> length);
>>>
>>>         } finally {
>>>             if (is != null) is.close();
>>>         }
>>>     }
>>> }
>>> Index: src/java/org/apache/lucene/index/SegmentMerger.java
>>> ===================================================================
>>> RCS file:  
>>> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
>>> SegmentMerger.java,v
>>> retrieving revision 1.11
>>> diff -r1.11 SegmentMerger.java
>>> 151c151
>>> <     // Perform the merge
>>> ---
>>>
>>>>     // Perform the merge. Files will be deleted within  
>>>> CompoundFileWriter.close()
>>>
>>>
>>> 153,158c153
>>> <
>>> <     // Now delete the source files
>>> <     it = files.iterator();
>>> <     while (it.hasNext()) {
>>> <       directory.deleteFile((String) it.next());
>>> <     }
>>> ---
>>>
>>>>
>>> package org.apache.lucene.index;
>>>
>>> /**
>>>  * Copyright 2004 The Apache Software Foundation
>>>  *
>>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>>  * you may not use this file except in compliance with the License.
>>>  * You may obtain a copy of the License at
>>>  *
>>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>>  *
>>>  * Unless required by applicable law or agreed to in writing, software
>>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>>> implied.
>>>  * See the License for the specific language governing permissions and
>>>  * limitations under the License.
>>>  */
>>>
>>> import java.util.Vector;
>>> import java.util.ArrayList;
>>> import java.util.Iterator;
>>> import java.io.IOException;
>>>
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.OutputStream;
>>> import org.apache.lucene.store.RAMOutputStream;
>>>
>>> /**
>>>  * The SegmentMerger class combines two or more Segments, 
>>> represented  by an IndexReader ({@link #add},
>>>  * into a single Segment.  After adding the appropriate readers, 
>>> call  the merge method to combine the
>>>  * segments.
>>>  *<P>
>>>  * If the compoundFile flag is set, then the segments will be 
>>> merged  into a compound file.
>>>  *
>>>  *
>>>  * @see #merge
>>>  * @see #add
>>>  */
>>> final class SegmentMerger {
>>>   private boolean useCompoundFile;
>>>   private Directory directory;
>>>   private String segment;
>>>
>>>   private Vector readers = new Vector();
>>>   private FieldInfos fieldInfos;
>>>
>>>   // File extensions of old-style index files
>>>   private static final String COMPOUND_EXTENSIONS[] = new String[] {
>>>     "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
>>>   };
>>>   private static final String VECTOR_EXTENSIONS[] = new String[] {
>>>     "tvx", "tvd", "tvf"
>>>   };
>>>
>>>   /**
>>>    *
>>>    * @param dir The Directory to merge the other segments into
>>>    * @param name The name of the new segment
>>>    * @param compoundFile true if the new segment should use a  
>>> compoundFile
>>>    */
>>>   SegmentMerger(Directory dir, String name, boolean compoundFile) {
>>>     directory = dir;
>>>     segment = name;
>>>     useCompoundFile = compoundFile;
>>>   }
>>>
>>>   /**
>>>    * Add an IndexReader to the collection of readers that are to be  
>>> merged
>>>    * @param reader
>>>    */
>>>   final void add(IndexReader reader) {
>>>     readers.addElement(reader);
>>>   }
>>>
>>>   /**
>>>    *
>>>    * @param i The index of the reader to return
>>>    * @return The ith reader to be merged
>>>    */
>>>   final IndexReader segmentReader(int i) {
>>>     return (IndexReader) readers.elementAt(i);
>>>   }
>>>
>>>   /**
>>>    * Merges the readers specified by the {@link #add} method into 
>>> the  directory passed to the constructor
>>>    * @return The number of documents that were merged
>>>    * @throws IOException
>>>    */
>>>   final int merge() throws IOException {
>>>     int value;
>>>
>>>     value = mergeFields();
>>>     mergeTerms();
>>>     mergeNorms();
>>>
>>>     if (fieldInfos.hasVectors())
>>>       mergeVectors();
>>>
>>>     if (useCompoundFile)
>>>       createCompoundFile();
>>>
>>>     return value;
>>>   }
>>>
>>>   /**
>>>    * close all IndexReaders that have been added.
>>>    * Should not be called before merge().
>>>    * @throws IOException
>>>    */
>>>   final void closeReaders() throws IOException {
>>>     for (int i = 0; i < readers.size(); i++) {  // close readers
>>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>>       reader.close();
>>>     }
>>>   }
>>>
>>>   private final void createCompoundFile()
>>>           throws IOException {
>>>     CompoundFileWriter cfsWriter =
>>>             new CompoundFileWriter(directory, segment + ".cfs");
>>>
>>>     ArrayList files =
>>>       new ArrayList(COMPOUND_EXTENSIONS.length + fieldInfos.size());
>>>
>>>     // Basic files
>>>     for (int i = 0; i < COMPOUND_EXTENSIONS.length; i++) {
>>>       files.add(segment + "." + COMPOUND_EXTENSIONS[i]);
>>>     }
>>>
>>>     // Field norm files
>>>     for (int i = 0; i < fieldInfos.size(); i++) {
>>>       FieldInfo fi = fieldInfos.fieldInfo(i);
>>>       if (fi.isIndexed) {
>>>         files.add(segment + ".f" + i);
>>>       }
>>>     }
>>>
>>>     // Vector files
>>>     if (fieldInfos.hasVectors()) {
>>>       for (int i = 0; i < VECTOR_EXTENSIONS.length; i++) {
>>>         files.add(segment + "." + VECTOR_EXTENSIONS[i]);
>>>       }
>>>     }
>>>
>>>     // Now merge all added files
>>>     Iterator it = files.iterator();
>>>     while (it.hasNext()) {
>>>       cfsWriter.addFile((String) it.next());
>>>     }
>>>
>>>     // Perform the merge. Files will be deleted within  
>>> CompoundFileWriter.close()
>>>     cfsWriter.close();
>>>
>>>   }
>>>
>>>   /**
>>>    *
>>>    * @return The number of documents in all of the readers
>>>    * @throws IOException
>>>    */
>>>   private final int mergeFields() throws IOException {
>>>     fieldInfos = new FieldInfos();          // merge field names
>>>     int docCount = 0;
>>>     for (int i = 0; i < readers.size(); i++) {
>>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>>       fieldInfos.addIndexed(reader.getIndexedFieldNames(true), true);
>>>       fieldInfos.addIndexed(reader.getIndexedFieldNames(false), false);
>>>       fieldInfos.add(reader.getFieldNames(false), false);
>>>     }
>>>     fieldInfos.write(directory, segment + ".fnm");
>>>
>>>     FieldsWriter fieldsWriter = // merge field values
>>>             new FieldsWriter(directory, segment, fieldInfos);
>>>     try {
>>>       for (int i = 0; i < readers.size(); i++) {
>>>         IndexReader reader = (IndexReader) readers.elementAt(i);
>>>         int maxDoc = reader.maxDoc();
>>>         for (int j = 0; j < maxDoc; j++)
>>>           if (!reader.isDeleted(j)) {               // skip deleted  
>>> docs
>>>             fieldsWriter.addDocument(reader.document(j));
>>>             docCount++;
>>>           }
>>>       }
>>>     } finally {
>>>       fieldsWriter.close();
>>>     }
>>>     return docCount;
>>>   }
>>>
>>>   /**
>>>    * Merge the TermVectors from each of the segments into the new one.
>>>    * @throws IOException
>>>    */
>>>   private final void mergeVectors() throws IOException {
>>>     TermVectorsWriter termVectorsWriter =
>>>       new TermVectorsWriter(directory, segment, fieldInfos);
>>>
>>>     try {
>>>       for (int r = 0; r < readers.size(); r++) {
>>>         IndexReader reader = (IndexReader) readers.elementAt(r);
>>>         int maxDoc = reader.maxDoc();
>>>         for (int docNum = 0; docNum < maxDoc; docNum++) {
>>>           // skip deleted docs
>>>           if (reader.isDeleted(docNum)) {
>>>             continue;
>>>           }
>>>           termVectorsWriter.openDocument();
>>>
>>>           // get all term vectors
>>>           TermFreqVector[] sourceTermVector =
>>>             reader.getTermFreqVectors(docNum);
>>>
>>>           if (sourceTermVector != null) {
>>>             for (int f = 0; f < sourceTermVector.length; f++) {
>>>               // translate field numbers
>>>               TermFreqVector termVector = sourceTermVector[f];
>>>               termVectorsWriter.openField(termVector.getField());
>>>               String [] terms = termVector.getTerms();
>>>               int [] freqs = termVector.getTermFrequencies();
>>>
>>>               for (int t = 0; t < terms.length; t++) {
>>>                 termVectorsWriter.addTerm(terms[t], freqs[t]);
>>>               }
>>>             }
>>>             termVectorsWriter.closeDocument();
>>>           }
>>>         }
>>>       }
>>>     } finally {
>>>       termVectorsWriter.close();
>>>     }
>>>   }
>>>
>>>   private OutputStream freqOutput = null;
>>>   private OutputStream proxOutput = null;
>>>   private TermInfosWriter termInfosWriter = null;
>>>   private int skipInterval;
>>>   private SegmentMergeQueue queue = null;
>>>
>>>   private final void mergeTerms() throws IOException {
>>>     try {
>>>       freqOutput = directory.createFile(segment + ".frq");
>>>       proxOutput = directory.createFile(segment + ".prx");
>>>       termInfosWriter =
>>>               new TermInfosWriter(directory, segment, fieldInfos);
>>>       skipInterval = termInfosWriter.skipInterval;
>>>       queue = new SegmentMergeQueue(readers.size());
>>>
>>>       mergeTermInfos();
>>>
>>>     } finally {
>>>       if (freqOutput != null) freqOutput.close();
>>>       if (proxOutput != null) proxOutput.close();
>>>       if (termInfosWriter != null) termInfosWriter.close();
>>>       if (queue != null) queue.close();
>>>     }
>>>   }
>>>
>>>   private final void mergeTermInfos() throws IOException {
>>>     int base = 0;
>>>     for (int i = 0; i < readers.size(); i++) {
>>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>>       TermEnum termEnum = reader.terms();
>>>       SegmentMergeInfo smi = new SegmentMergeInfo(base, termEnum,  
>>> reader);
>>>       base += reader.numDocs();
>>>       if (smi.next())
>>>         queue.put(smi);                  // initialize queue
>>>       else
>>>         smi.close();
>>>     }
>>>
>>>     SegmentMergeInfo[] match = new SegmentMergeInfo[readers.size()];
>>>
>>>     while (queue.size() > 0) {
>>>       int matchSize = 0;              // pop matching terms
>>>       match[matchSize++] = (SegmentMergeInfo) queue.pop();
>>>       Term term = match[0].term;
>>>       SegmentMergeInfo top = (SegmentMergeInfo) queue.top();
>>>
>>>       while (top != null && term.compareTo(top.term) == 0) {
>>>         match[matchSize++] = (SegmentMergeInfo) queue.pop();
>>>         top = (SegmentMergeInfo) queue.top();
>>>       }
>>>
>>>       mergeTermInfo(match, matchSize);          // add new TermInfo
>>>
>>>       while (matchSize > 0) {
>>>         SegmentMergeInfo smi = match[--matchSize];
>>>         if (smi.next())
>>>           queue.put(smi);              // restore queue
>>>         else
>>>           smi.close();                  // done with a segment
>>>       }
>>>     }
>>>   }
>>>
>>>   private final TermInfo termInfo = new TermInfo(); // minimize consing
>>>
>>>   /** Merge one term found in one or more segments. The array  
>>> <code>smis</code>
>>>    *  contains segments that are positioned at the same term.  
>>> <code>N</code>
>>>    *  is the number of cells in the array actually occupied.
>>>    *
>>>    * @param smis array of segments
>>>    * @param n number of cells in the array actually occupied
>>>    */
>>>   private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
>>>           throws IOException {
>>>     long freqPointer = freqOutput.getFilePointer();
>>>     long proxPointer = proxOutput.getFilePointer();
>>>
>>>     int df = appendPostings(smis, n);          // append posting data
>>>
>>>     long skipPointer = writeSkip();
>>>
>>>     if (df > 0) {
>>>       // add an entry to the dictionary with pointers to prox and 
>>> freq  files
>>>       termInfo.set(df, freqPointer, proxPointer, (int) (skipPointer 
>>> -  freqPointer));
>>>       termInfosWriter.add(smis[0].term, termInfo);
>>>     }
>>>   }
>>>
>>>   /** Process postings from multiple segments all positioned on the
>>>    *  same term. Writes out merged entries into freqOutput and
>>>    *  the proxOutput streams.
>>>    *
>>>    * @param smis array of segments
>>>    * @param n number of cells in the array actually occupied
>>>    * @return number of documents across all segments where this 
>>> term  was found
>>>    */
>>>   private final int appendPostings(SegmentMergeInfo[] smis, int n)
>>>           throws IOException {
>>>     int lastDoc = 0;
>>>     int df = 0;                      // number of docs w/ term
>>>     resetSkip();
>>>     for (int i = 0; i < n; i++) {
>>>       SegmentMergeInfo smi = smis[i];
>>>       TermPositions postings = smi.postings;
>>>       int base = smi.base;
>>>       int[] docMap = smi.docMap;
>>>       postings.seek(smi.termEnum);
>>>       while (postings.next()) {
>>>         int doc = postings.doc();
>>>         if (docMap != null)
>>>           doc = docMap[doc];                      // map around  
>>> deletions
>>>         doc += base;                              // convert to 
>>> merged  space
>>>
>>>         if (doc < lastDoc)
>>>           throw new IllegalStateException("docs out of order");
>>>
>>>         df++;
>>>
>>>         if ((df % skipInterval) == 0) {
>>>           bufferSkip(lastDoc);
>>>         }
>>>
>>>         int docCode = (doc - lastDoc) << 1;      // use low bit to 
>>> flag  freq=1
>>>         lastDoc = doc;
>>>
>>>         int freq = postings.freq();
>>>         if (freq == 1) {
>>>           freqOutput.writeVInt(docCode | 1);      // write doc & freq=1
>>>         } else {
>>>           freqOutput.writeVInt(docCode);      // write doc
>>>           freqOutput.writeVInt(freq);          // write frequency in 
>>> doc
>>>         }
>>>
>>>         int lastPosition = 0;              // write position deltas
>>>         for (int j = 0; j < freq; j++) {
>>>           int position = postings.nextPosition();
>>>           proxOutput.writeVInt(position - lastPosition);
>>>           lastPosition = position;
>>>         }
>>>       }
>>>     }
>>>     return df;
>>>   }
>>>
>>>   private RAMOutputStream skipBuffer = new RAMOutputStream();
>>>   private int lastSkipDoc;
>>>   private long lastSkipFreqPointer;
>>>   private long lastSkipProxPointer;
>>>
>>>   private void resetSkip() throws IOException {
>>>     skipBuffer.reset();
>>>     lastSkipDoc = 0;
>>>     lastSkipFreqPointer = freqOutput.getFilePointer();
>>>     lastSkipProxPointer = proxOutput.getFilePointer();
>>>   }
>>>
>>>   private void bufferSkip(int doc) throws IOException {
>>>     long freqPointer = freqOutput.getFilePointer();
>>>     long proxPointer = proxOutput.getFilePointer();
>>>
>>>     skipBuffer.writeVInt(doc - lastSkipDoc);
>>>     skipBuffer.writeVInt((int) (freqPointer - lastSkipFreqPointer));
>>>     skipBuffer.writeVInt((int) (proxPointer - lastSkipProxPointer));
>>>
>>>     lastSkipDoc = doc;
>>>     lastSkipFreqPointer = freqPointer;
>>>     lastSkipProxPointer = proxPointer;
>>>   }
>>>
>>>   private long writeSkip() throws IOException {
>>>     long skipPointer = freqOutput.getFilePointer();
>>>     skipBuffer.writeTo(freqOutput);
>>>     return skipPointer;
>>>   }
>>>
>>>   private void mergeNorms() throws IOException {
>>>     for (int i = 0; i < fieldInfos.size(); i++) {
>>>       FieldInfo fi = fieldInfos.fieldInfo(i);
>>>       if (fi.isIndexed) {
>>>         OutputStream output = directory.createFile(segment + ".f" + i);
>>>         try {
>>>           for (int j = 0; j < readers.size(); j++) {
>>>             IndexReader reader = (IndexReader) readers.elementAt(j);
>>>             byte[] input = reader.norms(fi.name);
>>>             int maxDoc = reader.maxDoc();
>>>             for (int k = 0; k < maxDoc; k++) {
>>>               byte norm = input != null ? input[k] : (byte) 0;
>>>               if (!reader.isDeleted(k)) {
>>>                 output.writeByte(norm);
>>>               }
>>>             }
>>>           }
>>>         } finally {
>>>           output.close();
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>> Index: src/test/org/apache/lucene/index/TestCompoundFile.java
>>> ===================================================================
>>> RCS file:  
>>> /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/ 
>>> TestCompoundFile.java,v
>>> retrieving revision 1.5
>>> diff -r1.5 TestCompoundFile.java
>>> 20a21,24
>>>
>>>> import java.util.Collection;
>>>> import java.util.HashMap;
>>>> import java.util.Iterator;
>>>> import java.util.Map;
>>>
>>>
>>> 197a202,204
>>>
>>>>
>>>>             InputStream expected = dir.openFile(name);
>>>>
>>> 203c210
>>> <             InputStream expected = dir.openFile(name);
>>> ---
>>>
>>>>
>>> 206a214
>>>
>>>>
>>> 220a229,231
>>>
>>>>         InputStream expected1 = dir.openFile("d1");
>>>>         InputStream expected2 = dir.openFile("d2");
>>>>
>>> 227c238
>>> <         InputStream expected = dir.openFile("d1");
>>> ---
>>>
>>>>
>>> 229,231c240,242
>>> <         assertSameStreams("d1", expected, actual);
>>> <         assertSameSeekBehavior("d1", expected, actual);
>>> <         expected.close();
>>> ---
>>>
>>>>         assertSameStreams("d1", expected1, actual);
>>>>         assertSameSeekBehavior("d1", expected1, actual);
>>>>         expected1.close();
>>>
>>>
>>> 234c245
>>> <         expected = dir.openFile("d2");
>>> ---
>>>
>>>>
>>> 236,238c247,249
>>> <         assertSameStreams("d2", expected, actual);
>>> <         assertSameSeekBehavior("d2", expected, actual);
>>> <         expected.close();
>>> ---
>>>
>>>>         assertSameStreams("d2", expected2, actual);
>>>>         assertSameSeekBehavior("d2", expected2, actual);
>>>>         expected2.close();
>>>
>>>
>>> 270,271d280
>>> <         // Now test
>>> <         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>>> "test.cfs");
>>> 275a285,292
>>>
>>>>
>>>>         InputStream[] check = new InputStream[data.length];
>>>>         for (int i=0; i<data.length; i++) {
>>>>            check[i] = dir.openFile(segment + data[i]);
>>>>         }
>>>>
>>>>         // Now test
>>>>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>>>> "test.cfs");
>>>
>>>
>>> 283d299
>>> <             InputStream check = dir.openFile(segment + data[i]);
>>> 285,286c301,302
>>> <             assertSameStreams(data[i], check, test);
>>> <             assertSameSeekBehavior(data[i], check, test);
>>> ---
>>>
>>>>             assertSameStreams(data[i], check[i], test);
>>>>             assertSameSeekBehavior(data[i], check[i], test);
>>>
>>>
>>> 288c304
>>> <             check.close();
>>> ---
>>>
>>>>             check[i].close();
>>>
>>>
>>> 299c315,316
>>> <     private void setUp_2() throws IOException {
>>> ---
>>>
>>>>     private Map setUp_2() throws IOException {
>>>>             Map streams = new HashMap(20);
>>>
>>>
>>> 303a321,322
>>>
>>>>
>>>>             streams.put("f" + i, dir.openFile("f" + i));
>>>
>>>
>>> 305a325,326
>>>
>>>>
>>>>         return streams;
>>>
>>>
>>> 308c329,336
>>> <
>>> ---
>>>
>>>>     private void closeUp(Map streams) throws IOException {
>>>>         Iterator it = streams.values().iterator();
>>>>         while (it.hasNext()) {
>>>>             InputStream stream = (InputStream)it.next();
>>>>             stream.close();
>>>>         }
>>>>     }
>>>>
>>> 364c392
>>> <         setUp_2();
>>> ---
>>>
>>>>         Map streams = setUp_2();
>>>
>>>
>>> 368c396
>>> <         InputStream expected = dir.openFile("f11");
>>> ---
>>>
>>>>         InputStream expected = (InputStream)streams.get("f11");
>>>
>>>
>>> 410c438,439
>>> <         expected.close();
>>> ---
>>>
>>>>         closeUp(streams);
>>>>
>>> 418c447
>>> <         setUp_2();
>>> ---
>>>
>>>>         Map streams = setUp_2();
>>>
>>>
>>> 422,423c451,452
>>> <         InputStream e1 = dir.openFile("f11");
>>> <         InputStream e2 = dir.openFile("f3");
>>> ---
>>>
>>>>         InputStream e1 = (InputStream)streams.get("f11");
>>>>         InputStream e2 = (InputStream)streams.get("f3");
>>>
>>>
>>> 426c455
>>> <         InputStream a2 = dir.openFile("f3");
>>> ---
>>>
>>>>         InputStream a2 = cr.openFile("f3");
>>>
>>>
>>> 486,487d514
>>> <         e1.close();
>>> <         e2.close();
>>> 490a518,519
>>>
>>>>
>>>>         closeUp(streams);
>>>
>>>
>>> 497c526
>>> <         setUp_2();
>>> ---
>>>
>>>>         Map streams = setUp_2();
>>>
>>>
>>> 569a599,600
>>>
>>>>
>>>>         closeUp(streams);
>>>
>>>
>>> 574c605
>>> <         setUp_2();
>>> ---
>>>
>>>>         Map streams = setUp_2();
>>>
>>>
>>> 587a619,620
>>>
>>>>
>>>>         closeUp(streams);
>>>
>>>
>>> 592c625
>>> <         setUp_2();
>>> ---
>>>
>>>>         Map streams = setUp_2();
>>>
>>>
>>> 617a651,652
>>>
>>>>
>>>>         closeUp(streams);
>>>
>>>
>>> package org.apache.lucene.index;
>>>
>>> /**
>>>  * Copyright 2004 The Apache Software Foundation
>>>  *
>>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>>  * you may not use this file except in compliance with the License.
>>>  * You may obtain a copy of the License at
>>>  *
>>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>>  *
>>>  * Unless required by applicable law or agreed to in writing, software
>>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>>> implied.
>>>  * See the License for the specific language governing permissions and
>>>  * limitations under the License.
>>>  */
>>>
>>> import java.io.IOException;
>>> import java.io.File;
>>> import java.util.Collection;
>>> import java.util.HashMap;
>>> import java.util.Iterator;
>>> import java.util.Map;
>>>
>>> import junit.framework.TestCase;
>>> import junit.framework.TestSuite;
>>> import junit.textui.TestRunner;
>>> import org.apache.lucene.store.OutputStream;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.InputStream;
>>> import org.apache.lucene.store.FSDirectory;
>>> import org.apache.lucene.store.RAMDirectory;
>>> import org.apache.lucene.store._TestHelper;
>>>
>>>
>>> /**
>>>  * @author dmitrys@earthlink.net
>>>  * @version $Id: TestCompoundFile.java,v 1.5 2004/03/29 22:48:06  
>>> cutting Exp $
>>>  */
>>> public class TestCompoundFile extends TestCase
>>> {
>>>     /** Main for running test case by itself. */
>>>     public static void main(String args[]) {
>>>         TestRunner.run (new TestSuite(TestCompoundFile.class));
>>> //        TestRunner.run (new TestCompoundFile("testSingleFile"));
>>> //        TestRunner.run (new TestCompoundFile("testTwoFiles"));
>>> //        TestRunner.run (new TestCompoundFile("testRandomFiles"));
>>> //        TestRunner.run (new  
>>> TestCompoundFile("testClonedStreamsClosing"));
>>> //        TestRunner.run (new TestCompoundFile("testReadAfterClose"));
>>> //        TestRunner.run (new TestCompoundFile("testRandomAccess"));
>>> //        TestRunner.run (new  
>>> TestCompoundFile("testRandomAccessClones"));
>>> //        TestRunner.run (new TestCompoundFile("testFileNotFound"));
>>> //        TestRunner.run (new TestCompoundFile("testReadPastEOF"));
>>>
>>> //        TestRunner.run (new TestCompoundFile("testIWCreate"));
>>>
>>>     }
>>>
>>>
>>>     private Directory dir;
>>>
>>>
>>>     public void setUp() throws IOException {
>>>         //dir = new RAMDirectory();
>>>         dir = FSDirectory.getDirectory(new  
>>> File(System.getProperty("tempDir"), "testIndex"), true);
>>>     }
>>>
>>>
>>>     /** Creates a file of the specified size with random data. */
>>>     private void createRandomFile(Directory dir, String name, int size)
>>>     throws IOException
>>>     {
>>>         OutputStream os = dir.createFile(name);
>>>         for (int i=0; i<size; i++) {
>>>             byte b = (byte) (Math.random() * 256);
>>>             os.writeByte(b);
>>>         }
>>>         os.close();
>>>     }
>>>
>>>     /** Creates a file of the specified size with sequential data. 
>>> The  first
>>>      *  byte is written as the start byte provided. All subsequent  
>>> bytes are
>>>      *  computed as start + offset where offset is the number of 
>>> the  byte.
>>>      */
>>>     private void createSequenceFile(Directory dir,
>>>                                     String name,
>>>                                     byte start,
>>>                                     int size)
>>>     throws IOException
>>>     {
>>>         OutputStream os = dir.createFile(name);
>>>         for (int i=0; i < size; i++) {
>>>             os.writeByte(start);
>>>             start ++;
>>>         }
>>>         os.close();
>>>     }
>>>
>>>
>>>     private void assertSameStreams(String msg,
>>>                                    InputStream expected,
>>>                                    InputStream test)
>>>     throws IOException
>>>     {
>>>         assertNotNull(msg + " null expected", expected);
>>>         assertNotNull(msg + " null test", test);
>>>         assertEquals(msg + " length", expected.length(),  
>>> test.length());
>>>         assertEquals(msg + " position", expected.getFilePointer(),
>>>                                         test.getFilePointer());
>>>
>>>         byte expectedBuffer[] = new byte[512];
>>>         byte testBuffer[] = new byte[expectedBuffer.length];
>>>
>>>         long remainder = expected.length() - expected.getFilePointer();
>>>         while(remainder > 0) {
>>>             int readLen = (int) Math.min(remainder,  
>>> expectedBuffer.length);
>>>             expected.readBytes(expectedBuffer, 0, readLen);
>>>             test.readBytes(testBuffer, 0, readLen);
>>>             assertEqualArrays(msg + ", remainder " + remainder,  
>>> expectedBuffer,
>>>                 testBuffer, 0, readLen);
>>>             remainder -= readLen;
>>>         }
>>>     }
>>>
>>>
>>>     private void assertSameStreams(String msg,
>>>                                    InputStream expected,
>>>                                    InputStream actual,
>>>                                    long seekTo)
>>>     throws IOException
>>>     {
>>>         if(seekTo >= 0 && seekTo < expected.length())
>>>         {
>>>             expected.seek(seekTo);
>>>             actual.seek(seekTo);
>>>             assertSameStreams(msg + ", seek(mid)", expected, actual);
>>>         }
>>>     }
>>>
>>>
>>>
>>>     private void assertSameSeekBehavior(String msg,
>>>                                         InputStream expected,
>>>                                         InputStream actual)
>>>     throws IOException
>>>     {
>>>         // seek to 0
>>>         long point = 0;
>>>         assertSameStreams(msg + ", seek(0)", expected, actual, point);
>>>
>>>         // seek to middle
>>>         point = expected.length() / 2l;
>>>         assertSameStreams(msg + ", seek(mid)", expected, actual,  
>>> point);
>>>
>>>         // seek to end - 2
>>>         point = expected.length() - 2;
>>>         assertSameStreams(msg + ", seek(end-2)", expected, actual,  
>>> point);
>>>
>>>         // seek to end - 1
>>>         point = expected.length() - 1;
>>>         assertSameStreams(msg + ", seek(end-1)", expected, actual,  
>>> point);
>>>
>>>         // seek to the end
>>>         point = expected.length();
>>>         assertSameStreams(msg + ", seek(end)", expected, actual,  
>>> point);
>>>
>>>         // seek past end
>>>         point = expected.length() + 1;
>>>         assertSameStreams(msg + ", seek(end+1)", expected, actual,  
>>> point);
>>>     }
>>>
>>>
>>>     private void assertEqualArrays(String msg,
>>>                                    byte[] expected,
>>>                                    byte[] test,
>>>                                    int start,
>>>                                    int len)
>>>     {
>>>         assertNotNull(msg + " null expected", expected);
>>>         assertNotNull(msg + " null test", test);
>>>
>>>         for (int i=start; i<len; i++) {
>>>             assertEquals(msg + " " + i, expected[i], test[i]);
>>>         }
>>>     }
>>>
>>>
>>>     // ===========================================================
>>>     //  Tests of the basic CompoundFile functionality
>>>     // ===========================================================
>>>
>>>
>>>     /** This test creates compound file based on a single file.
>>>      *  Files of different sizes are tested: 0, 1, 10, 100 bytes.
>>>      */
>>>     public void testSingleFile() throws IOException {
>>>         int data[] = new int[] { 0, 1, 10, 100 };
>>>         for (int i=0; i<data.length; i++) {
>>>             String name = "t" + data[i];
>>>             createSequenceFile(dir, name, (byte) 0, data[i]);
>>>
>>>             InputStream expected = dir.openFile(name);
>>>
>>>             CompoundFileWriter csw = new CompoundFileWriter(dir, 
>>> name  + ".cfs");
>>>             csw.addFile(name);
>>>             csw.close();
>>>
>>>             CompoundFileReader csr = new CompoundFileReader(dir, 
>>> name  + ".cfs");
>>>
>>>             InputStream actual = csr.openFile(name);
>>>             assertSameStreams(name, expected, actual);
>>>             assertSameSeekBehavior(name, expected, actual);
>>>
>>>             expected.close();
>>>             actual.close();
>>>             csr.close();
>>>         }
>>>     }
>>>
>>>
>>>     /** This test creates compound file based on two files.
>>>      *
>>>      */
>>>     public void testTwoFiles() throws IOException {
>>>         createSequenceFile(dir, "d1", (byte) 0, 15);
>>>         createSequenceFile(dir, "d2", (byte) 0, 114);
>>>
>>>         InputStream expected1 = dir.openFile("d1");
>>>         InputStream expected2 = dir.openFile("d2");
>>>
>>>         CompoundFileWriter csw = new CompoundFileWriter(dir, "d.csf");
>>>         csw.addFile("d1");
>>>         csw.addFile("d2");
>>>         csw.close();
>>>
>>>         CompoundFileReader csr = new CompoundFileReader(dir, "d.csf");
>>>
>>>         InputStream actual = csr.openFile("d1");
>>>         assertSameStreams("d1", expected1, actual);
>>>         assertSameSeekBehavior("d1", expected1, actual);
>>>         expected1.close();
>>>         actual.close();
>>>
>>>
>>>         actual = csr.openFile("d2");
>>>         assertSameStreams("d2", expected2, actual);
>>>         assertSameSeekBehavior("d2", expected2, actual);
>>>         expected2.close();
>>>         actual.close();
>>>         csr.close();
>>>     }
>>>
>>>     /** This test creates a compound file based on a large number 
>>> of  files of
>>>      *  various length. The file content is generated randomly. The  
>>> sizes range
>>>      *  from 0 to 1Mb. Some of the sizes are selected to test the  
>>> buffering
>>>      *  logic in the file reading code. For this the chunk variable 
>>> is  set to
>>>      *  the length of the buffer used internally by the compound 
>>> file  logic.
>>>      */
>>>     public void testRandomFiles() throws IOException {
>>>         // Setup the test segment
>>>         String segment = "test";
>>>         int chunk = 1024; // internal buffer size used by the stream
>>>         createRandomFile(dir, segment + ".zero", 0);
>>>         createRandomFile(dir, segment + ".one", 1);
>>>         createRandomFile(dir, segment + ".ten", 10);
>>>         createRandomFile(dir, segment + ".hundred", 100);
>>>         createRandomFile(dir, segment + ".big1", chunk);
>>>         createRandomFile(dir, segment + ".big2", chunk - 1);
>>>         createRandomFile(dir, segment + ".big3", chunk + 1);
>>>         createRandomFile(dir, segment + ".big4", 3 * chunk);
>>>         createRandomFile(dir, segment + ".big5", 3 * chunk - 1);
>>>         createRandomFile(dir, segment + ".big6", 3 * chunk + 1);
>>>         createRandomFile(dir, segment + ".big7", 1000 * chunk);
>>>
>>>         // Setup extraneous files
>>>         createRandomFile(dir, "onetwothree", 100);
>>>         createRandomFile(dir, segment + ".notIn", 50);
>>>         createRandomFile(dir, segment + ".notIn2", 51);
>>>
>>>         final String data[] = new String[] {
>>>             ".zero", ".one", ".ten", ".hundred", ".big1", ".big2",  
>>> ".big3",
>>>             ".big4", ".big5", ".big6", ".big7"
>>>         };
>>>
>>>         InputStream[] check = new InputStream[data.length];
>>>         for (int i=0; i<data.length; i++) {
>>>            check[i] = dir.openFile(segment + data[i]);
>>>         }
>>>
>>>         // Now test
>>>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>>> "test.cfs");
>>>         for (int i=0; i<data.length; i++) {
>>>             csw.addFile(segment + data[i]);
>>>         }
>>>         csw.close();
>>>
>>>         CompoundFileReader csr = new CompoundFileReader(dir,  
>>> "test.cfs");
>>>         for (int i=0; i<data.length; i++) {
>>>             InputStream test = csr.openFile(segment + data[i]);
>>>             assertSameStreams(data[i], check[i], test);
>>>             assertSameSeekBehavior(data[i], check[i], test);
>>>             test.close();
>>>             check[i].close();
>>>         }
>>>         csr.close();
>>>     }
>>>
>>>
>>>     /** Setup a larger compound file with a number of components, 
>>> each  of
>>>      *  which is a sequential file (so that we can easily tell that 
>>> we  are
>>>      *  reading in the right byte). The methods sets up 20 files - 
>>> f0  to f19,
>>>      *  the size of each file is 1000 bytes.
>>>      */
>>>     private Map setUp_2() throws IOException {
>>>             Map streams = new HashMap(20);
>>>         CompoundFileWriter cw = new CompoundFileWriter(dir, "f.comp");
>>>         for (int i=0; i<20; i++) {
>>>             createSequenceFile(dir, "f" + i, (byte) 0, 2000);
>>>             cw.addFile("f" + i);
>>>
>>>             streams.put("f" + i, dir.openFile("f" + i));
>>>         }
>>>         cw.close();
>>>
>>>         return streams;
>>>     }
>>>
>>>     private void closeUp(Map streams) throws IOException {
>>>         Iterator it = streams.values().iterator();
>>>         while (it.hasNext()) {
>>>             InputStream stream = (InputStream)it.next();
>>>             stream.close();
>>>         }
>>>     }
>>>
>>>     public void testReadAfterClose() throws IOException {
>>>         demo_FSInputStreamBug((FSDirectory) dir, "test");
>>>     }
>>>
>>>     private void demo_FSInputStreamBug(FSDirectory fsdir, String file)
>>>     throws IOException
>>>     {
>>>         // Setup the test file - we need more than 1024 bytes
>>>         OutputStream os = fsdir.createFile(file);
>>>         for(int i=0; i<2000; i++) {
>>>             os.writeByte((byte) i);
>>>         }
>>>         os.close();
>>>
>>>         InputStream in = fsdir.openFile(file);
>>>
>>>         // This read primes the buffer in InputStream
>>>         byte b = in.readByte();
>>>
>>>         // Close the file
>>>         in.close();
>>>
>>>         // ERROR: this call should fail, but succeeds because the  
>>> buffer
>>>         // is still filled
>>>         b = in.readByte();
>>>
>>>         // ERROR: this call should fail, but succeeds for some 
>>> reason  as well
>>>         in.seek(1099);
>>>
>>>         try {
>>>             // OK: this call correctly fails. We are now past the 
>>> 1024  internal
>>>             // buffer, so an actual IO is attempted, which fails
>>>             b = in.readByte();
>>>         } catch (IOException e) {
>>>         }
>>>     }
>>>
>>>
>>>     static boolean isCSInputStream(InputStream is) {
>>>         return is instanceof CompoundFileReader.CSInputStream;
>>>     }
>>>
>>>     static boolean isCSInputStreamOpen(InputStream is) throws  
>>> IOException {
>>>         if (isCSInputStream(is)) {
>>>             CompoundFileReader.CSInputStream cis =
>>>             (CompoundFileReader.CSInputStream) is;
>>>
>>>             return _TestHelper.isFSInputStreamOpen(cis.base);
>>>         } else {
>>>             return false;
>>>         }
>>>     }
>>>
>>>
>>>     public void testClonedStreamsClosing() throws IOException {
>>>         Map streams = setUp_2();
>>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>>
>>>         // basic clone
>>>         InputStream expected = (InputStream)streams.get("f11");
>>>         assertTrue(_TestHelper.isFSInputStreamOpen(expected));
>>>
>>>         InputStream one = cr.openFile("f11");
>>>         assertTrue(isCSInputStreamOpen(one));
>>>
>>>         InputStream two = (InputStream) one.clone();
>>>         assertTrue(isCSInputStreamOpen(two));
>>>
>>>         assertSameStreams("basic clone one", expected, one);
>>>         expected.seek(0);
>>>         assertSameStreams("basic clone two", expected, two);
>>>
>>>         // Now close the first stream
>>>         one.close();
>>>         assertTrue("Only close when cr is closed",  
>>> isCSInputStreamOpen(one));
>>>
>>>         // The following should really fail since we couldn't expect to
>>>         // access a file once close has been called on it 
>>> (regardless  of
>>>         // buffering and/or clone magic)
>>>         expected.seek(0);
>>>         two.seek(0);
>>>         assertSameStreams("basic clone two/2", expected, two);
>>>
>>>
>>>         // Now close the compound reader
>>>         cr.close();
>>>         assertFalse("Now closed one", isCSInputStreamOpen(one));
>>>         assertFalse("Now closed two", isCSInputStreamOpen(two));
>>>
>>>         // The following may also fail since the compound stream is  
>>> closed
>>>         expected.seek(0);
>>>         two.seek(0);
>>>         //assertSameStreams("basic clone two/3", expected, two);
>>>
>>>
>>>         // Now close the second clone
>>>         two.close();
>>>         expected.seek(0);
>>>         two.seek(0);
>>>         //assertSameStreams("basic clone two/4", expected, two);
>>>
>>>         closeUp(streams);
>>>
>>>     }
>>>
>>>
>>>     /** This test opens two files from a compound stream and 
>>> verifies  that
>>>      *  their file positions are independent of each other.
>>>      */
>>>     public void testRandomAccess() throws IOException {
>>>         Map streams = setUp_2();
>>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>>
>>>         // Open two files
>>>         InputStream e1 = (InputStream)streams.get("f11");
>>>         InputStream e2 = (InputStream)streams.get("f3");
>>>
>>>         InputStream a1 = cr.openFile("f11");
>>>         InputStream a2 = cr.openFile("f3");
>>>
>>>         // Seek the first pair
>>>         e1.seek(100);
>>>         a1.seek(100);
>>>         assertEquals(100, e1.getFilePointer());
>>>         assertEquals(100, a1.getFilePointer());
>>>         byte be1 = e1.readByte();
>>>         byte ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now seek the second pair
>>>         e2.seek(1027);
>>>         a2.seek(1027);
>>>         assertEquals(1027, e2.getFilePointer());
>>>         assertEquals(1027, a2.getFilePointer());
>>>         byte be2 = e2.readByte();
>>>         byte ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Now make sure the first one didn't move
>>>         assertEquals(101, e1.getFilePointer());
>>>         assertEquals(101, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now more the first one again, past the buffer length
>>>         e1.seek(1910);
>>>         a1.seek(1910);
>>>         assertEquals(1910, e1.getFilePointer());
>>>         assertEquals(1910, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now make sure the second set didn't move
>>>         assertEquals(1028, e2.getFilePointer());
>>>         assertEquals(1028, a2.getFilePointer());
>>>         be2 = e2.readByte();
>>>         ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Move the second set back, again cross the buffer size
>>>         e2.seek(17);
>>>         a2.seek(17);
>>>         assertEquals(17, e2.getFilePointer());
>>>         assertEquals(17, a2.getFilePointer());
>>>         be2 = e2.readByte();
>>>         ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Finally, make sure the first set didn't move
>>>         // Now make sure the first one didn't move
>>>         assertEquals(1911, e1.getFilePointer());
>>>         assertEquals(1911, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         a1.close();
>>>         a2.close();
>>>         cr.close();
>>>
>>>         closeUp(streams);
>>>     }
>>>
>>>     /** This test opens two files from a compound stream and 
>>> verifies  that
>>>      *  their file positions are independent of each other.
>>>      */
>>>     public void testRandomAccessClones() throws IOException {
>>>         Map streams = setUp_2();
>>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>>
>>>         // Open two files
>>>         InputStream e1 = cr.openFile("f11");
>>>         InputStream e2 = cr.openFile("f3");
>>>
>>>         InputStream a1 = (InputStream) e1.clone();
>>>         InputStream a2 = (InputStream) e2.clone();
>>>
>>>         // Seek the first pair
>>>         e1.seek(100);
>>>         a1.seek(100);
>>>         assertEquals(100, e1.getFilePointer());
>>>         assertEquals(100, a1.getFilePointer());
>>>         byte be1 = e1.readByte();
>>>         byte ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now seek the second pair
>>>         e2.seek(1027);
>>>         a2.seek(1027);
>>>         assertEquals(1027, e2.getFilePointer());
>>>         assertEquals(1027, a2.getFilePointer());
>>>         byte be2 = e2.readByte();
>>>         byte ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Now make sure the first one didn't move
>>>         assertEquals(101, e1.getFilePointer());
>>>         assertEquals(101, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now more the first one again, past the buffer length
>>>         e1.seek(1910);
>>>         a1.seek(1910);
>>>         assertEquals(1910, e1.getFilePointer());
>>>         assertEquals(1910, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         // Now make sure the second set didn't move
>>>         assertEquals(1028, e2.getFilePointer());
>>>         assertEquals(1028, a2.getFilePointer());
>>>         be2 = e2.readByte();
>>>         ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Move the second set back, again cross the buffer size
>>>         e2.seek(17);
>>>         a2.seek(17);
>>>         assertEquals(17, e2.getFilePointer());
>>>         assertEquals(17, a2.getFilePointer());
>>>         be2 = e2.readByte();
>>>         ba2 = a2.readByte();
>>>         assertEquals(be2, ba2);
>>>
>>>         // Finally, make sure the first set didn't move
>>>         // Now make sure the first one didn't move
>>>         assertEquals(1911, e1.getFilePointer());
>>>         assertEquals(1911, a1.getFilePointer());
>>>         be1 = e1.readByte();
>>>         ba1 = a1.readByte();
>>>         assertEquals(be1, ba1);
>>>
>>>         e1.close();
>>>         e2.close();
>>>         a1.close();
>>>         a2.close();
>>>         cr.close();
>>>
>>>         closeUp(streams);
>>>     }
>>>
>>>
>>>     public void testFileNotFound() throws IOException {
>>>         Map streams = setUp_2();
>>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>>
>>>         // Open two files
>>>         try {
>>>             InputStream e1 = cr.openFile("bogus");
>>>             fail("File not found");
>>>
>>>         } catch (IOException e) {
>>>             /* success */
>>>             //System.out.println("SUCCESS: File Not Found: " + e);
>>>         }
>>>
>>>         cr.close();
>>>
>>>         closeUp(streams);
>>>     }
>>>
>>>
>>>     public void testReadPastEOF() throws IOException {
>>>         Map streams = setUp_2();
>>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>>         InputStream is = cr.openFile("f2");
>>>         is.seek(is.length() - 10);
>>>         byte b[] = new byte[100];
>>>         is.readBytes(b, 0, 10);
>>>
>>>         try {
>>>             byte test = is.readByte();
>>>             fail("Single byte read past end of file");
>>>         } catch (IOException e) {
>>>             /* success */
>>>             //System.out.println("SUCCESS: single byte read past 
>>> end  of file: " + e);
>>>         }
>>>
>>>         is.seek(is.length() - 10);
>>>         try {
>>>             is.readBytes(b, 0, 50);
>>>             fail("Block read past end of file");
>>>         } catch (IOException e) {
>>>             /* success */
>>>             //System.out.println("SUCCESS: block read past end of  
>>> file: " + e);
>>>         }
>>>
>>>         is.close();
>>>         cr.close();
>>>
>>>         closeUp(streams);
>>>     }
>>> }
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>

Re: optimized disk usage when creating a compound index

Posted by Christoph Goller <go...@detego-software.de>.

It will not be lost. I have already reviewed it.
There are some open issues concerning the changes in
TestCompoundFile, that I want to discuss with Bernhard
and then (hopefully next week) I will commit it.

Christoph

Erik Hatcher wrote:
> Bernhard,
> 
> Impressive work.  In order to prevent this from being lost in e-mail,  
> could you please create a new Bugzilla issue for each of your great  
> patches and attach the differences as CVS patches (cvs diff -Nu)?
> 
> Many thanks for these contributions.
> 
>     Erik
> 
> On Aug 6, 2004, at 3:52 AM, Bernhard Messer wrote:
> 
>> hi developers,
>>
>> i made some measurements on lucene disk usage during index creation.  
>> It's no surprise that during index creation,  within the index  
>> optimization, more disk space is necessary than the final index size  
>> will reach. What i didn't expect is such a high difference in disk  
>> size usage, switching the compound file option on or off. Using the  
>> compound file option, the disk usage during index creation is more  
>> than 3 times higher than the final index size. This could be a pain 
>> in  the neck, running projects like nutch, where huge datasets will 
>> be  indexed. The grow rate relies on the fact that SegmentMerger 
>> creates  the fully compound file first, before deleting the original, 
>> unused  files.
>> So i patched SegmentMerger and CompoundFileWriter classes in a way,  
>> that they will delete the file immediatly after copying the data  
>> within the compound. The result was, that we could reduce the  
>> necessary disk space from factor 3 to 2.
>> The change forces to make some modifications within the  
>> TestCompoundFile class also. In several test methods the original 
>> file  was compared to it's compound part. Using the modified 
>> SegmentMerger  and CompoundFileWriter, the file was already deleted 
>> and couldn't be  opened.
>>
>> Here are some statistics about disk usage during index creation:
>>
>> compound option is off:
>> final index size: 380 KB           max. diskspace used: 408 KB
>> final index size: 11079 KB       max. diskspace used: 11381 KB
>> final index size: 204148 KB      max. diskspace used: 20739 KB
>>
>> using compound index:
>> final index size: 380 KB           max. diskspace used: 1145 KB
>> final index size: 11079 KB       max. diskspace used: 33544 KB
>> final index size: 204148 KB      max. diskspace used: 614977 KB
>>
>> using compound index with patch:
>> final index size: 380 KB           max. diskspace used: 777 KB
>> final index size: 11079 KB       max. diskspace used: 22464 KB
>> final index size: 204148 KB      max. diskspace used: 410829
>>
>> The change was tested under windows and linux without any negativ 
>> side  effects. All JUnit test cases work fine. In the attachment 
>> you'll find  all the necessary files:
>>
>> SegmentMerger.java
>> CompoundFileWriter.java
>> TestCompoundFile.java
>>
>> SegmentMerger.diff
>> CompoundFileWriter.diff
>> TestCompoundFile.diff
>>
>> keep moving
>> Bernhard
>>
>>
>> Index: src/java/org/apache/lucene/index/CompoundFileWriter.java
>> ===================================================================
>> RCS file:  
>> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
>> CompoundFileWriter.java,v
>> retrieving revision 1.3
>> diff -r1.3 CompoundFileWriter.java
>> 163a164,166
>>
>>>
>>>                 // immediatly delete the copied file to safe  disk-space
>>>                 directory.deleteFile((String) fe.file);
>>
>> package org.apache.lucene.index;
>>
>> /**
>>  * Copyright 2004 The Apache Software Foundation
>>  *
>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>  * you may not use this file except in compliance with the License.
>>  * You may obtain a copy of the License at
>>  *
>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>  *
>>  * Unless required by applicable law or agreed to in writing, software
>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>> implied.
>>  * See the License for the specific language governing permissions and
>>  * limitations under the License.
>>  */
>>
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.OutputStream;
>> import org.apache.lucene.store.InputStream;
>> import java.util.LinkedList;
>> import java.util.HashSet;
>> import java.util.Iterator;
>> import java.io.IOException;
>>
>>
>> /**
>>  * Combines multiple files into a single compound file.
>>  * The file format:<br>
>>  * <ul>
>>  *     <li>VInt fileCount</li>
>>  *     <li>{Directory}
>>  *         fileCount entries with the following structure:</li>
>>  *         <ul>
>>  *             <li>long dataOffset</li>
>>  *             <li>UTFString extension</li>
>>  *         </ul>
>>  *     <li>{File Data}
>>  *         fileCount entries with the raw data of the corresponding  
>> file</li>
>>  * </ul>
>>  *
>>  * The fileCount integer indicates how many files are contained in  
>> this compound
>>  * file. The {directory} that follows has that many entries. Each  
>> directory entry
>>  * contains an encoding identifier, an long pointer to the start of  
>> this file's
>>  * data section, and a UTF String with that file's extension.
>>  *
>>  * @author Dmitry Serebrennikov
>>  * @version $Id: CompoundFileWriter.java,v 1.3 2004/03/29 22:48:02  
>> cutting Exp $
>>  */
>> final class CompoundFileWriter {
>>
>>     private static final class FileEntry {
>>         /** source file */
>>         String file;
>>
>>         /** temporary holder for the start of directory entry for 
>> this  file */
>>         long directoryOffset;
>>
>>         /** temporary holder for the start of this file's data 
>> section  */
>>         long dataOffset;
>>     }
>>
>>
>>     private Directory directory;
>>     private String fileName;
>>     private HashSet ids;
>>     private LinkedList entries;
>>     private boolean merged = false;
>>
>>
>>     /** Create the compound stream in the specified file. The file  
>> name is the
>>      *  entire name (no extensions are added).
>>      */
>>     public CompoundFileWriter(Directory dir, String name) {
>>         if (dir == null)
>>             throw new IllegalArgumentException("Missing directory");
>>         if (name == null)
>>             throw new IllegalArgumentException("Missing name");
>>
>>         directory = dir;
>>         fileName = name;
>>         ids = new HashSet();
>>         entries = new LinkedList();
>>     }
>>
>>     /** Returns the directory of the compound file. */
>>     public Directory getDirectory() {
>>         return directory;
>>     }
>>
>>     /** Returns the name of the compound file. */
>>     public String getName() {
>>         return fileName;
>>     }
>>
>>     /** Add a source stream. If sourceDir is null, it is set to the
>>      *  same value as the directory where this compound stream exists.
>>      *  The id is the string by which the sub-stream will be know in  the
>>      *  compound stream. The caller must ensure that the ID is 
>> unique.  If the
>>      *  id is null, it is set to the name of the source file.
>>      */
>>     public void addFile(String file) {
>>         if (merged)
>>             throw new IllegalStateException(
>>                 "Can't add extensions after merge has been called");
>>
>>         if (file == null)
>>             throw new IllegalArgumentException(
>>                 "Missing source file");
>>
>>         if (! ids.add(file))
>>             throw new IllegalArgumentException(
>>                 "File " + file + " already added");
>>
>>         FileEntry entry = new FileEntry();
>>         entry.file = file;
>>         entries.add(entry);
>>     }
>>
>>     /** Merge files with the extensions added up to now.
>>      *  All files with these extensions are combined sequentially 
>> into  the
>>      *  compound stream. After successful merge, the source files
>>      *  are deleted.
>>      */
>>     public void close() throws IOException {
>>         if (merged)
>>             throw new IllegalStateException(
>>                 "Merge already performed");
>>
>>         if (entries.isEmpty())
>>             throw new IllegalStateException(
>>                 "No entries to merge have been defined");
>>
>>         merged = true;
>>
>>         // open the compound stream
>>         OutputStream os = null;
>>         try {
>>             os = directory.createFile(fileName);
>>
>>             // Write the number of entries
>>             os.writeVInt(entries.size());
>>
>>             // Write the directory with all offsets at 0.
>>             // Remember the positions of directory entries so that we  
>> can
>>             // adjust the offsets later
>>             Iterator it = entries.iterator();
>>             while(it.hasNext()) {
>>                 FileEntry fe = (FileEntry) it.next();
>>                 fe.directoryOffset = os.getFilePointer();
>>                 os.writeLong(0);    // for now
>>                 os.writeString(fe.file);
>>             }
>>
>>             // Open the files and copy their data into the stream.
>>             // Remeber the locations of each file's data section.
>>             byte buffer[] = new byte[1024];
>>             it = entries.iterator();
>>             while(it.hasNext()) {
>>                 FileEntry fe = (FileEntry) it.next();
>>                 fe.dataOffset = os.getFilePointer();
>>                 copyFile(fe, os, buffer);
>>
>>                 // immediatly delete the copied file to safe disk-space
>>                 directory.deleteFile((String) fe.file);
>>             }
>>
>>             // Write the data offsets into the directory of the  
>> compound stream
>>             it = entries.iterator();
>>             while(it.hasNext()) {
>>                 FileEntry fe = (FileEntry) it.next();
>>                 os.seek(fe.directoryOffset);
>>                 os.writeLong(fe.dataOffset);
>>             }
>>
>>             // Close the output stream. Set the os to null before  
>> trying to
>>             // close so that if an exception occurs during the close,  
>> the
>>             // finally clause below will not attempt to close the  stream
>>             // the second time.
>>             OutputStream tmp = os;
>>             os = null;
>>             tmp.close();
>>
>>         } finally {
>>             if (os != null) try { os.close(); } catch (IOException e)  
>> { }
>>         }
>>     }
>>
>>     /** Copy the contents of the file with specified extension into the
>>      *  provided output stream. Use the provided buffer for moving data
>>      *  to reduce memory allocation.
>>      */
>>     private void copyFile(FileEntry source, OutputStream os, byte  
>> buffer[])
>>     throws IOException
>>     {
>>         InputStream is = null;
>>         try {
>>             long startPtr = os.getFilePointer();
>>
>>             is = directory.openFile(source.file);
>>             long length = is.length();
>>             long remainder = length;
>>             int chunk = buffer.length;
>>
>>             while(remainder > 0) {
>>                 int len = (int) Math.min(chunk, remainder);
>>                 is.readBytes(buffer, 0, len);
>>                 os.writeBytes(buffer, len);
>>                 remainder -= len;
>>             }
>>
>>             // Verify that remainder is 0
>>             if (remainder != 0)
>>                 throw new IOException(
>>                     "Non-zero remainder length after copying: " +  
>> remainder
>>                     + " (id: " + source.file + ", length: " + length
>>                     + ", buffer size: " + chunk + ")");
>>
>>             // Verify that the output length diff is equal to 
>> original  file
>>             long endPtr = os.getFilePointer();
>>             long diff = endPtr - startPtr;
>>             if (diff != length)
>>                 throw new IOException(
>>                     "Difference in the output file offsets " + diff
>>                     + " does not match the original file length " +  
>> length);
>>
>>         } finally {
>>             if (is != null) is.close();
>>         }
>>     }
>> }
>> Index: src/java/org/apache/lucene/index/SegmentMerger.java
>> ===================================================================
>> RCS file:  
>> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
>> SegmentMerger.java,v
>> retrieving revision 1.11
>> diff -r1.11 SegmentMerger.java
>> 151c151
>> <     // Perform the merge
>> ---
>>
>>>     // Perform the merge. Files will be deleted within  
>>> CompoundFileWriter.close()
>>
>> 153,158c153
>> <
>> <     // Now delete the source files
>> <     it = files.iterator();
>> <     while (it.hasNext()) {
>> <       directory.deleteFile((String) it.next());
>> <     }
>> ---
>>
>>>
>> package org.apache.lucene.index;
>>
>> /**
>>  * Copyright 2004 The Apache Software Foundation
>>  *
>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>  * you may not use this file except in compliance with the License.
>>  * You may obtain a copy of the License at
>>  *
>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>  *
>>  * Unless required by applicable law or agreed to in writing, software
>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>> implied.
>>  * See the License for the specific language governing permissions and
>>  * limitations under the License.
>>  */
>>
>> import java.util.Vector;
>> import java.util.ArrayList;
>> import java.util.Iterator;
>> import java.io.IOException;
>>
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.OutputStream;
>> import org.apache.lucene.store.RAMOutputStream;
>>
>> /**
>>  * The SegmentMerger class combines two or more Segments, represented  
>> by an IndexReader ({@link #add},
>>  * into a single Segment.  After adding the appropriate readers, call  
>> the merge method to combine the
>>  * segments.
>>  *<P>
>>  * If the compoundFile flag is set, then the segments will be merged  
>> into a compound file.
>>  *
>>  *
>>  * @see #merge
>>  * @see #add
>>  */
>> final class SegmentMerger {
>>   private boolean useCompoundFile;
>>   private Directory directory;
>>   private String segment;
>>
>>   private Vector readers = new Vector();
>>   private FieldInfos fieldInfos;
>>
>>   // File extensions of old-style index files
>>   private static final String COMPOUND_EXTENSIONS[] = new String[] {
>>     "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
>>   };
>>   private static final String VECTOR_EXTENSIONS[] = new String[] {
>>     "tvx", "tvd", "tvf"
>>   };
>>
>>   /**
>>    *
>>    * @param dir The Directory to merge the other segments into
>>    * @param name The name of the new segment
>>    * @param compoundFile true if the new segment should use a  
>> compoundFile
>>    */
>>   SegmentMerger(Directory dir, String name, boolean compoundFile) {
>>     directory = dir;
>>     segment = name;
>>     useCompoundFile = compoundFile;
>>   }
>>
>>   /**
>>    * Add an IndexReader to the collection of readers that are to be  
>> merged
>>    * @param reader
>>    */
>>   final void add(IndexReader reader) {
>>     readers.addElement(reader);
>>   }
>>
>>   /**
>>    *
>>    * @param i The index of the reader to return
>>    * @return The ith reader to be merged
>>    */
>>   final IndexReader segmentReader(int i) {
>>     return (IndexReader) readers.elementAt(i);
>>   }
>>
>>   /**
>>    * Merges the readers specified by the {@link #add} method into the  
>> directory passed to the constructor
>>    * @return The number of documents that were merged
>>    * @throws IOException
>>    */
>>   final int merge() throws IOException {
>>     int value;
>>
>>     value = mergeFields();
>>     mergeTerms();
>>     mergeNorms();
>>
>>     if (fieldInfos.hasVectors())
>>       mergeVectors();
>>
>>     if (useCompoundFile)
>>       createCompoundFile();
>>
>>     return value;
>>   }
>>
>>   /**
>>    * close all IndexReaders that have been added.
>>    * Should not be called before merge().
>>    * @throws IOException
>>    */
>>   final void closeReaders() throws IOException {
>>     for (int i = 0; i < readers.size(); i++) {  // close readers
>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>       reader.close();
>>     }
>>   }
>>
>>   private final void createCompoundFile()
>>           throws IOException {
>>     CompoundFileWriter cfsWriter =
>>             new CompoundFileWriter(directory, segment + ".cfs");
>>
>>     ArrayList files =
>>       new ArrayList(COMPOUND_EXTENSIONS.length + fieldInfos.size());
>>
>>     // Basic files
>>     for (int i = 0; i < COMPOUND_EXTENSIONS.length; i++) {
>>       files.add(segment + "." + COMPOUND_EXTENSIONS[i]);
>>     }
>>
>>     // Field norm files
>>     for (int i = 0; i < fieldInfos.size(); i++) {
>>       FieldInfo fi = fieldInfos.fieldInfo(i);
>>       if (fi.isIndexed) {
>>         files.add(segment + ".f" + i);
>>       }
>>     }
>>
>>     // Vector files
>>     if (fieldInfos.hasVectors()) {
>>       for (int i = 0; i < VECTOR_EXTENSIONS.length; i++) {
>>         files.add(segment + "." + VECTOR_EXTENSIONS[i]);
>>       }
>>     }
>>
>>     // Now merge all added files
>>     Iterator it = files.iterator();
>>     while (it.hasNext()) {
>>       cfsWriter.addFile((String) it.next());
>>     }
>>
>>     // Perform the merge. Files will be deleted within  
>> CompoundFileWriter.close()
>>     cfsWriter.close();
>>
>>   }
>>
>>   /**
>>    *
>>    * @return The number of documents in all of the readers
>>    * @throws IOException
>>    */
>>   private final int mergeFields() throws IOException {
>>     fieldInfos = new FieldInfos();          // merge field names
>>     int docCount = 0;
>>     for (int i = 0; i < readers.size(); i++) {
>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>       fieldInfos.addIndexed(reader.getIndexedFieldNames(true), true);
>>       fieldInfos.addIndexed(reader.getIndexedFieldNames(false), false);
>>       fieldInfos.add(reader.getFieldNames(false), false);
>>     }
>>     fieldInfos.write(directory, segment + ".fnm");
>>
>>     FieldsWriter fieldsWriter = // merge field values
>>             new FieldsWriter(directory, segment, fieldInfos);
>>     try {
>>       for (int i = 0; i < readers.size(); i++) {
>>         IndexReader reader = (IndexReader) readers.elementAt(i);
>>         int maxDoc = reader.maxDoc();
>>         for (int j = 0; j < maxDoc; j++)
>>           if (!reader.isDeleted(j)) {               // skip deleted  docs
>>             fieldsWriter.addDocument(reader.document(j));
>>             docCount++;
>>           }
>>       }
>>     } finally {
>>       fieldsWriter.close();
>>     }
>>     return docCount;
>>   }
>>
>>   /**
>>    * Merge the TermVectors from each of the segments into the new one.
>>    * @throws IOException
>>    */
>>   private final void mergeVectors() throws IOException {
>>     TermVectorsWriter termVectorsWriter =
>>       new TermVectorsWriter(directory, segment, fieldInfos);
>>
>>     try {
>>       for (int r = 0; r < readers.size(); r++) {
>>         IndexReader reader = (IndexReader) readers.elementAt(r);
>>         int maxDoc = reader.maxDoc();
>>         for (int docNum = 0; docNum < maxDoc; docNum++) {
>>           // skip deleted docs
>>           if (reader.isDeleted(docNum)) {
>>             continue;
>>           }
>>           termVectorsWriter.openDocument();
>>
>>           // get all term vectors
>>           TermFreqVector[] sourceTermVector =
>>             reader.getTermFreqVectors(docNum);
>>
>>           if (sourceTermVector != null) {
>>             for (int f = 0; f < sourceTermVector.length; f++) {
>>               // translate field numbers
>>               TermFreqVector termVector = sourceTermVector[f];
>>               termVectorsWriter.openField(termVector.getField());
>>               String [] terms = termVector.getTerms();
>>               int [] freqs = termVector.getTermFrequencies();
>>
>>               for (int t = 0; t < terms.length; t++) {
>>                 termVectorsWriter.addTerm(terms[t], freqs[t]);
>>               }
>>             }
>>             termVectorsWriter.closeDocument();
>>           }
>>         }
>>       }
>>     } finally {
>>       termVectorsWriter.close();
>>     }
>>   }
>>
>>   private OutputStream freqOutput = null;
>>   private OutputStream proxOutput = null;
>>   private TermInfosWriter termInfosWriter = null;
>>   private int skipInterval;
>>   private SegmentMergeQueue queue = null;
>>
>>   private final void mergeTerms() throws IOException {
>>     try {
>>       freqOutput = directory.createFile(segment + ".frq");
>>       proxOutput = directory.createFile(segment + ".prx");
>>       termInfosWriter =
>>               new TermInfosWriter(directory, segment, fieldInfos);
>>       skipInterval = termInfosWriter.skipInterval;
>>       queue = new SegmentMergeQueue(readers.size());
>>
>>       mergeTermInfos();
>>
>>     } finally {
>>       if (freqOutput != null) freqOutput.close();
>>       if (proxOutput != null) proxOutput.close();
>>       if (termInfosWriter != null) termInfosWriter.close();
>>       if (queue != null) queue.close();
>>     }
>>   }
>>
>>   private final void mergeTermInfos() throws IOException {
>>     int base = 0;
>>     for (int i = 0; i < readers.size(); i++) {
>>       IndexReader reader = (IndexReader) readers.elementAt(i);
>>       TermEnum termEnum = reader.terms();
>>       SegmentMergeInfo smi = new SegmentMergeInfo(base, termEnum,  
>> reader);
>>       base += reader.numDocs();
>>       if (smi.next())
>>         queue.put(smi);                  // initialize queue
>>       else
>>         smi.close();
>>     }
>>
>>     SegmentMergeInfo[] match = new SegmentMergeInfo[readers.size()];
>>
>>     while (queue.size() > 0) {
>>       int matchSize = 0;              // pop matching terms
>>       match[matchSize++] = (SegmentMergeInfo) queue.pop();
>>       Term term = match[0].term;
>>       SegmentMergeInfo top = (SegmentMergeInfo) queue.top();
>>
>>       while (top != null && term.compareTo(top.term) == 0) {
>>         match[matchSize++] = (SegmentMergeInfo) queue.pop();
>>         top = (SegmentMergeInfo) queue.top();
>>       }
>>
>>       mergeTermInfo(match, matchSize);          // add new TermInfo
>>
>>       while (matchSize > 0) {
>>         SegmentMergeInfo smi = match[--matchSize];
>>         if (smi.next())
>>           queue.put(smi);              // restore queue
>>         else
>>           smi.close();                  // done with a segment
>>       }
>>     }
>>   }
>>
>>   private final TermInfo termInfo = new TermInfo(); // minimize consing
>>
>>   /** Merge one term found in one or more segments. The array  
>> <code>smis</code>
>>    *  contains segments that are positioned at the same term.  
>> <code>N</code>
>>    *  is the number of cells in the array actually occupied.
>>    *
>>    * @param smis array of segments
>>    * @param n number of cells in the array actually occupied
>>    */
>>   private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
>>           throws IOException {
>>     long freqPointer = freqOutput.getFilePointer();
>>     long proxPointer = proxOutput.getFilePointer();
>>
>>     int df = appendPostings(smis, n);          // append posting data
>>
>>     long skipPointer = writeSkip();
>>
>>     if (df > 0) {
>>       // add an entry to the dictionary with pointers to prox and 
>> freq  files
>>       termInfo.set(df, freqPointer, proxPointer, (int) (skipPointer -  
>> freqPointer));
>>       termInfosWriter.add(smis[0].term, termInfo);
>>     }
>>   }
>>
>>   /** Process postings from multiple segments all positioned on the
>>    *  same term. Writes out merged entries into freqOutput and
>>    *  the proxOutput streams.
>>    *
>>    * @param smis array of segments
>>    * @param n number of cells in the array actually occupied
>>    * @return number of documents across all segments where this term  
>> was found
>>    */
>>   private final int appendPostings(SegmentMergeInfo[] smis, int n)
>>           throws IOException {
>>     int lastDoc = 0;
>>     int df = 0;                      // number of docs w/ term
>>     resetSkip();
>>     for (int i = 0; i < n; i++) {
>>       SegmentMergeInfo smi = smis[i];
>>       TermPositions postings = smi.postings;
>>       int base = smi.base;
>>       int[] docMap = smi.docMap;
>>       postings.seek(smi.termEnum);
>>       while (postings.next()) {
>>         int doc = postings.doc();
>>         if (docMap != null)
>>           doc = docMap[doc];                      // map around  
>> deletions
>>         doc += base;                              // convert to 
>> merged  space
>>
>>         if (doc < lastDoc)
>>           throw new IllegalStateException("docs out of order");
>>
>>         df++;
>>
>>         if ((df % skipInterval) == 0) {
>>           bufferSkip(lastDoc);
>>         }
>>
>>         int docCode = (doc - lastDoc) << 1;      // use low bit to 
>> flag  freq=1
>>         lastDoc = doc;
>>
>>         int freq = postings.freq();
>>         if (freq == 1) {
>>           freqOutput.writeVInt(docCode | 1);      // write doc & freq=1
>>         } else {
>>           freqOutput.writeVInt(docCode);      // write doc
>>           freqOutput.writeVInt(freq);          // write frequency in doc
>>         }
>>
>>         int lastPosition = 0;              // write position deltas
>>         for (int j = 0; j < freq; j++) {
>>           int position = postings.nextPosition();
>>           proxOutput.writeVInt(position - lastPosition);
>>           lastPosition = position;
>>         }
>>       }
>>     }
>>     return df;
>>   }
>>
>>   private RAMOutputStream skipBuffer = new RAMOutputStream();
>>   private int lastSkipDoc;
>>   private long lastSkipFreqPointer;
>>   private long lastSkipProxPointer;
>>
>>   private void resetSkip() throws IOException {
>>     skipBuffer.reset();
>>     lastSkipDoc = 0;
>>     lastSkipFreqPointer = freqOutput.getFilePointer();
>>     lastSkipProxPointer = proxOutput.getFilePointer();
>>   }
>>
>>   private void bufferSkip(int doc) throws IOException {
>>     long freqPointer = freqOutput.getFilePointer();
>>     long proxPointer = proxOutput.getFilePointer();
>>
>>     skipBuffer.writeVInt(doc - lastSkipDoc);
>>     skipBuffer.writeVInt((int) (freqPointer - lastSkipFreqPointer));
>>     skipBuffer.writeVInt((int) (proxPointer - lastSkipProxPointer));
>>
>>     lastSkipDoc = doc;
>>     lastSkipFreqPointer = freqPointer;
>>     lastSkipProxPointer = proxPointer;
>>   }
>>
>>   private long writeSkip() throws IOException {
>>     long skipPointer = freqOutput.getFilePointer();
>>     skipBuffer.writeTo(freqOutput);
>>     return skipPointer;
>>   }
>>
>>   private void mergeNorms() throws IOException {
>>     for (int i = 0; i < fieldInfos.size(); i++) {
>>       FieldInfo fi = fieldInfos.fieldInfo(i);
>>       if (fi.isIndexed) {
>>         OutputStream output = directory.createFile(segment + ".f" + i);
>>         try {
>>           for (int j = 0; j < readers.size(); j++) {
>>             IndexReader reader = (IndexReader) readers.elementAt(j);
>>             byte[] input = reader.norms(fi.name);
>>             int maxDoc = reader.maxDoc();
>>             for (int k = 0; k < maxDoc; k++) {
>>               byte norm = input != null ? input[k] : (byte) 0;
>>               if (!reader.isDeleted(k)) {
>>                 output.writeByte(norm);
>>               }
>>             }
>>           }
>>         } finally {
>>           output.close();
>>         }
>>       }
>>     }
>>   }
>>
>> }
>> Index: src/test/org/apache/lucene/index/TestCompoundFile.java
>> ===================================================================
>> RCS file:  
>> /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/ 
>> TestCompoundFile.java,v
>> retrieving revision 1.5
>> diff -r1.5 TestCompoundFile.java
>> 20a21,24
>>
>>> import java.util.Collection;
>>> import java.util.HashMap;
>>> import java.util.Iterator;
>>> import java.util.Map;
>>
>> 197a202,204
>>
>>>
>>>             InputStream expected = dir.openFile(name);
>>>
>> 203c210
>> <             InputStream expected = dir.openFile(name);
>> ---
>>
>>>
>> 206a214
>>
>>>
>> 220a229,231
>>
>>>         InputStream expected1 = dir.openFile("d1");
>>>         InputStream expected2 = dir.openFile("d2");
>>>
>> 227c238
>> <         InputStream expected = dir.openFile("d1");
>> ---
>>
>>>
>> 229,231c240,242
>> <         assertSameStreams("d1", expected, actual);
>> <         assertSameSeekBehavior("d1", expected, actual);
>> <         expected.close();
>> ---
>>
>>>         assertSameStreams("d1", expected1, actual);
>>>         assertSameSeekBehavior("d1", expected1, actual);
>>>         expected1.close();
>>
>> 234c245
>> <         expected = dir.openFile("d2");
>> ---
>>
>>>
>> 236,238c247,249
>> <         assertSameStreams("d2", expected, actual);
>> <         assertSameSeekBehavior("d2", expected, actual);
>> <         expected.close();
>> ---
>>
>>>         assertSameStreams("d2", expected2, actual);
>>>         assertSameSeekBehavior("d2", expected2, actual);
>>>         expected2.close();
>>
>> 270,271d280
>> <         // Now test
>> <         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>> "test.cfs");
>> 275a285,292
>>
>>>
>>>         InputStream[] check = new InputStream[data.length];
>>>         for (int i=0; i<data.length; i++) {
>>>            check[i] = dir.openFile(segment + data[i]);
>>>         }
>>>
>>>         // Now test
>>>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>>> "test.cfs");
>>
>> 283d299
>> <             InputStream check = dir.openFile(segment + data[i]);
>> 285,286c301,302
>> <             assertSameStreams(data[i], check, test);
>> <             assertSameSeekBehavior(data[i], check, test);
>> ---
>>
>>>             assertSameStreams(data[i], check[i], test);
>>>             assertSameSeekBehavior(data[i], check[i], test);
>>
>> 288c304
>> <             check.close();
>> ---
>>
>>>             check[i].close();
>>
>> 299c315,316
>> <     private void setUp_2() throws IOException {
>> ---
>>
>>>     private Map setUp_2() throws IOException {
>>>             Map streams = new HashMap(20);
>>
>> 303a321,322
>>
>>>
>>>             streams.put("f" + i, dir.openFile("f" + i));
>>
>> 305a325,326
>>
>>>
>>>         return streams;
>>
>> 308c329,336
>> <
>> ---
>>
>>>     private void closeUp(Map streams) throws IOException {
>>>         Iterator it = streams.values().iterator();
>>>         while (it.hasNext()) {
>>>             InputStream stream = (InputStream)it.next();
>>>             stream.close();
>>>         }
>>>     }
>>>
>> 364c392
>> <         setUp_2();
>> ---
>>
>>>         Map streams = setUp_2();
>>
>> 368c396
>> <         InputStream expected = dir.openFile("f11");
>> ---
>>
>>>         InputStream expected = (InputStream)streams.get("f11");
>>
>> 410c438,439
>> <         expected.close();
>> ---
>>
>>>         closeUp(streams);
>>>
>> 418c447
>> <         setUp_2();
>> ---
>>
>>>         Map streams = setUp_2();
>>
>> 422,423c451,452
>> <         InputStream e1 = dir.openFile("f11");
>> <         InputStream e2 = dir.openFile("f3");
>> ---
>>
>>>         InputStream e1 = (InputStream)streams.get("f11");
>>>         InputStream e2 = (InputStream)streams.get("f3");
>>
>> 426c455
>> <         InputStream a2 = dir.openFile("f3");
>> ---
>>
>>>         InputStream a2 = cr.openFile("f3");
>>
>> 486,487d514
>> <         e1.close();
>> <         e2.close();
>> 490a518,519
>>
>>>
>>>         closeUp(streams);
>>
>> 497c526
>> <         setUp_2();
>> ---
>>
>>>         Map streams = setUp_2();
>>
>> 569a599,600
>>
>>>
>>>         closeUp(streams);
>>
>> 574c605
>> <         setUp_2();
>> ---
>>
>>>         Map streams = setUp_2();
>>
>> 587a619,620
>>
>>>
>>>         closeUp(streams);
>>
>> 592c625
>> <         setUp_2();
>> ---
>>
>>>         Map streams = setUp_2();
>>
>> 617a651,652
>>
>>>
>>>         closeUp(streams);
>>
>> package org.apache.lucene.index;
>>
>> /**
>>  * Copyright 2004 The Apache Software Foundation
>>  *
>>  * Licensed under the Apache License, Version 2.0 (the "License");
>>  * you may not use this file except in compliance with the License.
>>  * You may obtain a copy of the License at
>>  *
>>  *     http://www.apache.org/licenses/LICENSE-2.0
>>  *
>>  * Unless required by applicable law or agreed to in writing, software
>>  * distributed under the License is distributed on an "AS IS" BASIS,
>>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
>> implied.
>>  * See the License for the specific language governing permissions and
>>  * limitations under the License.
>>  */
>>
>> import java.io.IOException;
>> import java.io.File;
>> import java.util.Collection;
>> import java.util.HashMap;
>> import java.util.Iterator;
>> import java.util.Map;
>>
>> import junit.framework.TestCase;
>> import junit.framework.TestSuite;
>> import junit.textui.TestRunner;
>> import org.apache.lucene.store.OutputStream;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.InputStream;
>> import org.apache.lucene.store.FSDirectory;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.store._TestHelper;
>>
>>
>> /**
>>  * @author dmitrys@earthlink.net
>>  * @version $Id: TestCompoundFile.java,v 1.5 2004/03/29 22:48:06  
>> cutting Exp $
>>  */
>> public class TestCompoundFile extends TestCase
>> {
>>     /** Main for running test case by itself. */
>>     public static void main(String args[]) {
>>         TestRunner.run (new TestSuite(TestCompoundFile.class));
>> //        TestRunner.run (new TestCompoundFile("testSingleFile"));
>> //        TestRunner.run (new TestCompoundFile("testTwoFiles"));
>> //        TestRunner.run (new TestCompoundFile("testRandomFiles"));
>> //        TestRunner.run (new  
>> TestCompoundFile("testClonedStreamsClosing"));
>> //        TestRunner.run (new TestCompoundFile("testReadAfterClose"));
>> //        TestRunner.run (new TestCompoundFile("testRandomAccess"));
>> //        TestRunner.run (new  
>> TestCompoundFile("testRandomAccessClones"));
>> //        TestRunner.run (new TestCompoundFile("testFileNotFound"));
>> //        TestRunner.run (new TestCompoundFile("testReadPastEOF"));
>>
>> //        TestRunner.run (new TestCompoundFile("testIWCreate"));
>>
>>     }
>>
>>
>>     private Directory dir;
>>
>>
>>     public void setUp() throws IOException {
>>         //dir = new RAMDirectory();
>>         dir = FSDirectory.getDirectory(new  
>> File(System.getProperty("tempDir"), "testIndex"), true);
>>     }
>>
>>
>>     /** Creates a file of the specified size with random data. */
>>     private void createRandomFile(Directory dir, String name, int size)
>>     throws IOException
>>     {
>>         OutputStream os = dir.createFile(name);
>>         for (int i=0; i<size; i++) {
>>             byte b = (byte) (Math.random() * 256);
>>             os.writeByte(b);
>>         }
>>         os.close();
>>     }
>>
>>     /** Creates a file of the specified size with sequential data. 
>> The  first
>>      *  byte is written as the start byte provided. All subsequent  
>> bytes are
>>      *  computed as start + offset where offset is the number of the  
>> byte.
>>      */
>>     private void createSequenceFile(Directory dir,
>>                                     String name,
>>                                     byte start,
>>                                     int size)
>>     throws IOException
>>     {
>>         OutputStream os = dir.createFile(name);
>>         for (int i=0; i < size; i++) {
>>             os.writeByte(start);
>>             start ++;
>>         }
>>         os.close();
>>     }
>>
>>
>>     private void assertSameStreams(String msg,
>>                                    InputStream expected,
>>                                    InputStream test)
>>     throws IOException
>>     {
>>         assertNotNull(msg + " null expected", expected);
>>         assertNotNull(msg + " null test", test);
>>         assertEquals(msg + " length", expected.length(),  test.length());
>>         assertEquals(msg + " position", expected.getFilePointer(),
>>                                         test.getFilePointer());
>>
>>         byte expectedBuffer[] = new byte[512];
>>         byte testBuffer[] = new byte[expectedBuffer.length];
>>
>>         long remainder = expected.length() - expected.getFilePointer();
>>         while(remainder > 0) {
>>             int readLen = (int) Math.min(remainder,  
>> expectedBuffer.length);
>>             expected.readBytes(expectedBuffer, 0, readLen);
>>             test.readBytes(testBuffer, 0, readLen);
>>             assertEqualArrays(msg + ", remainder " + remainder,  
>> expectedBuffer,
>>                 testBuffer, 0, readLen);
>>             remainder -= readLen;
>>         }
>>     }
>>
>>
>>     private void assertSameStreams(String msg,
>>                                    InputStream expected,
>>                                    InputStream actual,
>>                                    long seekTo)
>>     throws IOException
>>     {
>>         if(seekTo >= 0 && seekTo < expected.length())
>>         {
>>             expected.seek(seekTo);
>>             actual.seek(seekTo);
>>             assertSameStreams(msg + ", seek(mid)", expected, actual);
>>         }
>>     }
>>
>>
>>
>>     private void assertSameSeekBehavior(String msg,
>>                                         InputStream expected,
>>                                         InputStream actual)
>>     throws IOException
>>     {
>>         // seek to 0
>>         long point = 0;
>>         assertSameStreams(msg + ", seek(0)", expected, actual, point);
>>
>>         // seek to middle
>>         point = expected.length() / 2l;
>>         assertSameStreams(msg + ", seek(mid)", expected, actual,  point);
>>
>>         // seek to end - 2
>>         point = expected.length() - 2;
>>         assertSameStreams(msg + ", seek(end-2)", expected, actual,  
>> point);
>>
>>         // seek to end - 1
>>         point = expected.length() - 1;
>>         assertSameStreams(msg + ", seek(end-1)", expected, actual,  
>> point);
>>
>>         // seek to the end
>>         point = expected.length();
>>         assertSameStreams(msg + ", seek(end)", expected, actual,  point);
>>
>>         // seek past end
>>         point = expected.length() + 1;
>>         assertSameStreams(msg + ", seek(end+1)", expected, actual,  
>> point);
>>     }
>>
>>
>>     private void assertEqualArrays(String msg,
>>                                    byte[] expected,
>>                                    byte[] test,
>>                                    int start,
>>                                    int len)
>>     {
>>         assertNotNull(msg + " null expected", expected);
>>         assertNotNull(msg + " null test", test);
>>
>>         for (int i=start; i<len; i++) {
>>             assertEquals(msg + " " + i, expected[i], test[i]);
>>         }
>>     }
>>
>>
>>     // ===========================================================
>>     //  Tests of the basic CompoundFile functionality
>>     // ===========================================================
>>
>>
>>     /** This test creates compound file based on a single file.
>>      *  Files of different sizes are tested: 0, 1, 10, 100 bytes.
>>      */
>>     public void testSingleFile() throws IOException {
>>         int data[] = new int[] { 0, 1, 10, 100 };
>>         for (int i=0; i<data.length; i++) {
>>             String name = "t" + data[i];
>>             createSequenceFile(dir, name, (byte) 0, data[i]);
>>
>>             InputStream expected = dir.openFile(name);
>>
>>             CompoundFileWriter csw = new CompoundFileWriter(dir, name  
>> + ".cfs");
>>             csw.addFile(name);
>>             csw.close();
>>
>>             CompoundFileReader csr = new CompoundFileReader(dir, name  
>> + ".cfs");
>>
>>             InputStream actual = csr.openFile(name);
>>             assertSameStreams(name, expected, actual);
>>             assertSameSeekBehavior(name, expected, actual);
>>
>>             expected.close();
>>             actual.close();
>>             csr.close();
>>         }
>>     }
>>
>>
>>     /** This test creates compound file based on two files.
>>      *
>>      */
>>     public void testTwoFiles() throws IOException {
>>         createSequenceFile(dir, "d1", (byte) 0, 15);
>>         createSequenceFile(dir, "d2", (byte) 0, 114);
>>
>>         InputStream expected1 = dir.openFile("d1");
>>         InputStream expected2 = dir.openFile("d2");
>>
>>         CompoundFileWriter csw = new CompoundFileWriter(dir, "d.csf");
>>         csw.addFile("d1");
>>         csw.addFile("d2");
>>         csw.close();
>>
>>         CompoundFileReader csr = new CompoundFileReader(dir, "d.csf");
>>
>>         InputStream actual = csr.openFile("d1");
>>         assertSameStreams("d1", expected1, actual);
>>         assertSameSeekBehavior("d1", expected1, actual);
>>         expected1.close();
>>         actual.close();
>>
>>
>>         actual = csr.openFile("d2");
>>         assertSameStreams("d2", expected2, actual);
>>         assertSameSeekBehavior("d2", expected2, actual);
>>         expected2.close();
>>         actual.close();
>>         csr.close();
>>     }
>>
>>     /** This test creates a compound file based on a large number of  
>> files of
>>      *  various length. The file content is generated randomly. The  
>> sizes range
>>      *  from 0 to 1Mb. Some of the sizes are selected to test the  
>> buffering
>>      *  logic in the file reading code. For this the chunk variable 
>> is  set to
>>      *  the length of the buffer used internally by the compound file  
>> logic.
>>      */
>>     public void testRandomFiles() throws IOException {
>>         // Setup the test segment
>>         String segment = "test";
>>         int chunk = 1024; // internal buffer size used by the stream
>>         createRandomFile(dir, segment + ".zero", 0);
>>         createRandomFile(dir, segment + ".one", 1);
>>         createRandomFile(dir, segment + ".ten", 10);
>>         createRandomFile(dir, segment + ".hundred", 100);
>>         createRandomFile(dir, segment + ".big1", chunk);
>>         createRandomFile(dir, segment + ".big2", chunk - 1);
>>         createRandomFile(dir, segment + ".big3", chunk + 1);
>>         createRandomFile(dir, segment + ".big4", 3 * chunk);
>>         createRandomFile(dir, segment + ".big5", 3 * chunk - 1);
>>         createRandomFile(dir, segment + ".big6", 3 * chunk + 1);
>>         createRandomFile(dir, segment + ".big7", 1000 * chunk);
>>
>>         // Setup extraneous files
>>         createRandomFile(dir, "onetwothree", 100);
>>         createRandomFile(dir, segment + ".notIn", 50);
>>         createRandomFile(dir, segment + ".notIn2", 51);
>>
>>         final String data[] = new String[] {
>>             ".zero", ".one", ".ten", ".hundred", ".big1", ".big2",  
>> ".big3",
>>             ".big4", ".big5", ".big6", ".big7"
>>         };
>>
>>         InputStream[] check = new InputStream[data.length];
>>         for (int i=0; i<data.length; i++) {
>>            check[i] = dir.openFile(segment + data[i]);
>>         }
>>
>>         // Now test
>>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>> "test.cfs");
>>         for (int i=0; i<data.length; i++) {
>>             csw.addFile(segment + data[i]);
>>         }
>>         csw.close();
>>
>>         CompoundFileReader csr = new CompoundFileReader(dir,  
>> "test.cfs");
>>         for (int i=0; i<data.length; i++) {
>>             InputStream test = csr.openFile(segment + data[i]);
>>             assertSameStreams(data[i], check[i], test);
>>             assertSameSeekBehavior(data[i], check[i], test);
>>             test.close();
>>             check[i].close();
>>         }
>>         csr.close();
>>     }
>>
>>
>>     /** Setup a larger compound file with a number of components, 
>> each  of
>>      *  which is a sequential file (so that we can easily tell that 
>> we  are
>>      *  reading in the right byte). The methods sets up 20 files - f0  
>> to f19,
>>      *  the size of each file is 1000 bytes.
>>      */
>>     private Map setUp_2() throws IOException {
>>             Map streams = new HashMap(20);
>>         CompoundFileWriter cw = new CompoundFileWriter(dir, "f.comp");
>>         for (int i=0; i<20; i++) {
>>             createSequenceFile(dir, "f" + i, (byte) 0, 2000);
>>             cw.addFile("f" + i);
>>
>>             streams.put("f" + i, dir.openFile("f" + i));
>>         }
>>         cw.close();
>>
>>         return streams;
>>     }
>>
>>     private void closeUp(Map streams) throws IOException {
>>         Iterator it = streams.values().iterator();
>>         while (it.hasNext()) {
>>             InputStream stream = (InputStream)it.next();
>>             stream.close();
>>         }
>>     }
>>
>>     public void testReadAfterClose() throws IOException {
>>         demo_FSInputStreamBug((FSDirectory) dir, "test");
>>     }
>>
>>     private void demo_FSInputStreamBug(FSDirectory fsdir, String file)
>>     throws IOException
>>     {
>>         // Setup the test file - we need more than 1024 bytes
>>         OutputStream os = fsdir.createFile(file);
>>         for(int i=0; i<2000; i++) {
>>             os.writeByte((byte) i);
>>         }
>>         os.close();
>>
>>         InputStream in = fsdir.openFile(file);
>>
>>         // This read primes the buffer in InputStream
>>         byte b = in.readByte();
>>
>>         // Close the file
>>         in.close();
>>
>>         // ERROR: this call should fail, but succeeds because the  buffer
>>         // is still filled
>>         b = in.readByte();
>>
>>         // ERROR: this call should fail, but succeeds for some reason  
>> as well
>>         in.seek(1099);
>>
>>         try {
>>             // OK: this call correctly fails. We are now past the 
>> 1024  internal
>>             // buffer, so an actual IO is attempted, which fails
>>             b = in.readByte();
>>         } catch (IOException e) {
>>         }
>>     }
>>
>>
>>     static boolean isCSInputStream(InputStream is) {
>>         return is instanceof CompoundFileReader.CSInputStream;
>>     }
>>
>>     static boolean isCSInputStreamOpen(InputStream is) throws  
>> IOException {
>>         if (isCSInputStream(is)) {
>>             CompoundFileReader.CSInputStream cis =
>>             (CompoundFileReader.CSInputStream) is;
>>
>>             return _TestHelper.isFSInputStreamOpen(cis.base);
>>         } else {
>>             return false;
>>         }
>>     }
>>
>>
>>     public void testClonedStreamsClosing() throws IOException {
>>         Map streams = setUp_2();
>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>
>>         // basic clone
>>         InputStream expected = (InputStream)streams.get("f11");
>>         assertTrue(_TestHelper.isFSInputStreamOpen(expected));
>>
>>         InputStream one = cr.openFile("f11");
>>         assertTrue(isCSInputStreamOpen(one));
>>
>>         InputStream two = (InputStream) one.clone();
>>         assertTrue(isCSInputStreamOpen(two));
>>
>>         assertSameStreams("basic clone one", expected, one);
>>         expected.seek(0);
>>         assertSameStreams("basic clone two", expected, two);
>>
>>         // Now close the first stream
>>         one.close();
>>         assertTrue("Only close when cr is closed",  
>> isCSInputStreamOpen(one));
>>
>>         // The following should really fail since we couldn't expect to
>>         // access a file once close has been called on it (regardless  of
>>         // buffering and/or clone magic)
>>         expected.seek(0);
>>         two.seek(0);
>>         assertSameStreams("basic clone two/2", expected, two);
>>
>>
>>         // Now close the compound reader
>>         cr.close();
>>         assertFalse("Now closed one", isCSInputStreamOpen(one));
>>         assertFalse("Now closed two", isCSInputStreamOpen(two));
>>
>>         // The following may also fail since the compound stream is  
>> closed
>>         expected.seek(0);
>>         two.seek(0);
>>         //assertSameStreams("basic clone two/3", expected, two);
>>
>>
>>         // Now close the second clone
>>         two.close();
>>         expected.seek(0);
>>         two.seek(0);
>>         //assertSameStreams("basic clone two/4", expected, two);
>>
>>         closeUp(streams);
>>
>>     }
>>
>>
>>     /** This test opens two files from a compound stream and verifies  
>> that
>>      *  their file positions are independent of each other.
>>      */
>>     public void testRandomAccess() throws IOException {
>>         Map streams = setUp_2();
>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>
>>         // Open two files
>>         InputStream e1 = (InputStream)streams.get("f11");
>>         InputStream e2 = (InputStream)streams.get("f3");
>>
>>         InputStream a1 = cr.openFile("f11");
>>         InputStream a2 = cr.openFile("f3");
>>
>>         // Seek the first pair
>>         e1.seek(100);
>>         a1.seek(100);
>>         assertEquals(100, e1.getFilePointer());
>>         assertEquals(100, a1.getFilePointer());
>>         byte be1 = e1.readByte();
>>         byte ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now seek the second pair
>>         e2.seek(1027);
>>         a2.seek(1027);
>>         assertEquals(1027, e2.getFilePointer());
>>         assertEquals(1027, a2.getFilePointer());
>>         byte be2 = e2.readByte();
>>         byte ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Now make sure the first one didn't move
>>         assertEquals(101, e1.getFilePointer());
>>         assertEquals(101, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now more the first one again, past the buffer length
>>         e1.seek(1910);
>>         a1.seek(1910);
>>         assertEquals(1910, e1.getFilePointer());
>>         assertEquals(1910, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now make sure the second set didn't move
>>         assertEquals(1028, e2.getFilePointer());
>>         assertEquals(1028, a2.getFilePointer());
>>         be2 = e2.readByte();
>>         ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Move the second set back, again cross the buffer size
>>         e2.seek(17);
>>         a2.seek(17);
>>         assertEquals(17, e2.getFilePointer());
>>         assertEquals(17, a2.getFilePointer());
>>         be2 = e2.readByte();
>>         ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Finally, make sure the first set didn't move
>>         // Now make sure the first one didn't move
>>         assertEquals(1911, e1.getFilePointer());
>>         assertEquals(1911, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         a1.close();
>>         a2.close();
>>         cr.close();
>>
>>         closeUp(streams);
>>     }
>>
>>     /** This test opens two files from a compound stream and verifies  
>> that
>>      *  their file positions are independent of each other.
>>      */
>>     public void testRandomAccessClones() throws IOException {
>>         Map streams = setUp_2();
>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>
>>         // Open two files
>>         InputStream e1 = cr.openFile("f11");
>>         InputStream e2 = cr.openFile("f3");
>>
>>         InputStream a1 = (InputStream) e1.clone();
>>         InputStream a2 = (InputStream) e2.clone();
>>
>>         // Seek the first pair
>>         e1.seek(100);
>>         a1.seek(100);
>>         assertEquals(100, e1.getFilePointer());
>>         assertEquals(100, a1.getFilePointer());
>>         byte be1 = e1.readByte();
>>         byte ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now seek the second pair
>>         e2.seek(1027);
>>         a2.seek(1027);
>>         assertEquals(1027, e2.getFilePointer());
>>         assertEquals(1027, a2.getFilePointer());
>>         byte be2 = e2.readByte();
>>         byte ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Now make sure the first one didn't move
>>         assertEquals(101, e1.getFilePointer());
>>         assertEquals(101, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now more the first one again, past the buffer length
>>         e1.seek(1910);
>>         a1.seek(1910);
>>         assertEquals(1910, e1.getFilePointer());
>>         assertEquals(1910, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         // Now make sure the second set didn't move
>>         assertEquals(1028, e2.getFilePointer());
>>         assertEquals(1028, a2.getFilePointer());
>>         be2 = e2.readByte();
>>         ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Move the second set back, again cross the buffer size
>>         e2.seek(17);
>>         a2.seek(17);
>>         assertEquals(17, e2.getFilePointer());
>>         assertEquals(17, a2.getFilePointer());
>>         be2 = e2.readByte();
>>         ba2 = a2.readByte();
>>         assertEquals(be2, ba2);
>>
>>         // Finally, make sure the first set didn't move
>>         // Now make sure the first one didn't move
>>         assertEquals(1911, e1.getFilePointer());
>>         assertEquals(1911, a1.getFilePointer());
>>         be1 = e1.readByte();
>>         ba1 = a1.readByte();
>>         assertEquals(be1, ba1);
>>
>>         e1.close();
>>         e2.close();
>>         a1.close();
>>         a2.close();
>>         cr.close();
>>
>>         closeUp(streams);
>>     }
>>
>>
>>     public void testFileNotFound() throws IOException {
>>         Map streams = setUp_2();
>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>
>>         // Open two files
>>         try {
>>             InputStream e1 = cr.openFile("bogus");
>>             fail("File not found");
>>
>>         } catch (IOException e) {
>>             /* success */
>>             //System.out.println("SUCCESS: File Not Found: " + e);
>>         }
>>
>>         cr.close();
>>
>>         closeUp(streams);
>>     }
>>
>>
>>     public void testReadPastEOF() throws IOException {
>>         Map streams = setUp_2();
>>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>>         InputStream is = cr.openFile("f2");
>>         is.seek(is.length() - 10);
>>         byte b[] = new byte[100];
>>         is.readBytes(b, 0, 10);
>>
>>         try {
>>             byte test = is.readByte();
>>             fail("Single byte read past end of file");
>>         } catch (IOException e) {
>>             /* success */
>>             //System.out.println("SUCCESS: single byte read past end  
>> of file: " + e);
>>         }
>>
>>         is.seek(is.length() - 10);
>>         try {
>>             is.readBytes(b, 0, 50);
>>             fail("Block read past end of file");
>>         } catch (IOException e) {
>>             /* success */
>>             //System.out.println("SUCCESS: block read past end of  
>> file: " + e);
>>         }
>>
>>         is.close();
>>         cr.close();
>>
>>         closeUp(streams);
>>     }
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 

-- 
*************************************************************
* Dr. Christoph Goller     Tel. : +49 89 203 45734          *
* Geschäftsführer          Email: goller@detego-software.de *
* Detego Software GmbH     Mail : Keuslinstr. 13,           *
*                                 80798 München, Germany    *
*************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: optimized disk usage when creating a compound index

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Bernhard,

Impressive work.  In order to prevent this from being lost in e-mail,  
could you please create a new Bugzilla issue for each of your great  
patches and attach the differences as CVS patches (cvs diff -Nu)?

Many thanks for these contributions.

	Erik

On Aug 6, 2004, at 3:52 AM, Bernhard Messer wrote:

> hi developers,
>
> i made some measurements on lucene disk usage during index creation.  
> It's no surprise that during index creation,  within the index  
> optimization, more disk space is necessary than the final index size  
> will reach. What i didn't expect is such a high difference in disk  
> size usage, switching the compound file option on or off. Using the  
> compound file option, the disk usage during index creation is more  
> than 3 times higher than the final index size. This could be a pain in  
> the neck, running projects like nutch, where huge datasets will be  
> indexed. The grow rate relies on the fact that SegmentMerger creates  
> the fully compound file first, before deleting the original, unused  
> files.
> So i patched SegmentMerger and CompoundFileWriter classes in a way,  
> that they will delete the file immediatly after copying the data  
> within the compound. The result was, that we could reduce the  
> necessary disk space from factor 3 to 2.
> The change forces to make some modifications within the  
> TestCompoundFile class also. In several test methods the original file  
> was compared to it's compound part. Using the modified SegmentMerger  
> and CompoundFileWriter, the file was already deleted and couldn't be  
> opened.
>
> Here are some statistics about disk usage during index creation:
>
> compound option is off:
> final index size: 380 KB           max. diskspace used: 408 KB
> final index size: 11079 KB       max. diskspace used: 11381 KB
> final index size: 204148 KB      max. diskspace used: 20739 KB
>
> using compound index:
> final index size: 380 KB           max. diskspace used: 1145 KB
> final index size: 11079 KB       max. diskspace used: 33544 KB
> final index size: 204148 KB      max. diskspace used: 614977 KB
>
> using compound index with patch:
> final index size: 380 KB           max. diskspace used: 777 KB
> final index size: 11079 KB       max. diskspace used: 22464 KB
> final index size: 204148 KB      max. diskspace used: 410829
>
> The change was tested under windows and linux without any negativ side  
> effects. All JUnit test cases work fine. In the attachment you'll find  
> all the necessary files:
>
> SegmentMerger.java
> CompoundFileWriter.java
> TestCompoundFile.java
>
> SegmentMerger.diff
> CompoundFileWriter.diff
> TestCompoundFile.diff
>
> keep moving
> Bernhard
>
>
> Index: src/java/org/apache/lucene/index/CompoundFileWriter.java
> ===================================================================
> RCS file:  
> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
> CompoundFileWriter.java,v
> retrieving revision 1.3
> diff -r1.3 CompoundFileWriter.java
> 163a164,166
>>
>>                 // immediatly delete the copied file to safe  
>> disk-space
>>                 directory.deleteFile((String) fe.file);
> package org.apache.lucene.index;
>
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
> implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
>
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.OutputStream;
> import org.apache.lucene.store.InputStream;
> import java.util.LinkedList;
> import java.util.HashSet;
> import java.util.Iterator;
> import java.io.IOException;
>
>
> /**
>  * Combines multiple files into a single compound file.
>  * The file format:<br>
>  * <ul>
>  *     <li>VInt fileCount</li>
>  *     <li>{Directory}
>  *         fileCount entries with the following structure:</li>
>  *         <ul>
>  *             <li>long dataOffset</li>
>  *             <li>UTFString extension</li>
>  *         </ul>
>  *     <li>{File Data}
>  *         fileCount entries with the raw data of the corresponding  
> file</li>
>  * </ul>
>  *
>  * The fileCount integer indicates how many files are contained in  
> this compound
>  * file. The {directory} that follows has that many entries. Each  
> directory entry
>  * contains an encoding identifier, an long pointer to the start of  
> this file's
>  * data section, and a UTF String with that file's extension.
>  *
>  * @author Dmitry Serebrennikov
>  * @version $Id: CompoundFileWriter.java,v 1.3 2004/03/29 22:48:02  
> cutting Exp $
>  */
> final class CompoundFileWriter {
>
>     private static final class FileEntry {
>         /** source file */
>         String file;
>
>         /** temporary holder for the start of directory entry for this  
> file */
>         long directoryOffset;
>
>         /** temporary holder for the start of this file's data section  
> */
>         long dataOffset;
>     }
>
>
>     private Directory directory;
>     private String fileName;
>     private HashSet ids;
>     private LinkedList entries;
>     private boolean merged = false;
>
>
>     /** Create the compound stream in the specified file. The file  
> name is the
>      *  entire name (no extensions are added).
>      */
>     public CompoundFileWriter(Directory dir, String name) {
>         if (dir == null)
>             throw new IllegalArgumentException("Missing directory");
>         if (name == null)
>             throw new IllegalArgumentException("Missing name");
>
>         directory = dir;
>         fileName = name;
>         ids = new HashSet();
>         entries = new LinkedList();
>     }
>
>     /** Returns the directory of the compound file. */
>     public Directory getDirectory() {
>         return directory;
>     }
>
>     /** Returns the name of the compound file. */
>     public String getName() {
>         return fileName;
>     }
>
>     /** Add a source stream. If sourceDir is null, it is set to the
>      *  same value as the directory where this compound stream exists.
>      *  The id is the string by which the sub-stream will be know in  
> the
>      *  compound stream. The caller must ensure that the ID is unique.  
> If the
>      *  id is null, it is set to the name of the source file.
>      */
>     public void addFile(String file) {
>         if (merged)
>             throw new IllegalStateException(
>                 "Can't add extensions after merge has been called");
>
>         if (file == null)
>             throw new IllegalArgumentException(
>                 "Missing source file");
>
>         if (! ids.add(file))
>             throw new IllegalArgumentException(
>                 "File " + file + " already added");
>
>         FileEntry entry = new FileEntry();
>         entry.file = file;
>         entries.add(entry);
>     }
>
>     /** Merge files with the extensions added up to now.
>      *  All files with these extensions are combined sequentially into  
> the
>      *  compound stream. After successful merge, the source files
>      *  are deleted.
>      */
>     public void close() throws IOException {
>         if (merged)
>             throw new IllegalStateException(
>                 "Merge already performed");
>
>         if (entries.isEmpty())
>             throw new IllegalStateException(
>                 "No entries to merge have been defined");
>
>         merged = true;
>
>         // open the compound stream
>         OutputStream os = null;
>         try {
>             os = directory.createFile(fileName);
>
>             // Write the number of entries
>             os.writeVInt(entries.size());
>
>             // Write the directory with all offsets at 0.
>             // Remember the positions of directory entries so that we  
> can
>             // adjust the offsets later
>             Iterator it = entries.iterator();
>             while(it.hasNext()) {
>                 FileEntry fe = (FileEntry) it.next();
>                 fe.directoryOffset = os.getFilePointer();
>                 os.writeLong(0);    // for now
>                 os.writeString(fe.file);
>             }
>
>             // Open the files and copy their data into the stream.
>             // Remeber the locations of each file's data section.
>             byte buffer[] = new byte[1024];
>             it = entries.iterator();
>             while(it.hasNext()) {
>                 FileEntry fe = (FileEntry) it.next();
>                 fe.dataOffset = os.getFilePointer();
>                 copyFile(fe, os, buffer);
>
>                 // immediatly delete the copied file to safe disk-space
>                 directory.deleteFile((String) fe.file);
>             }
>
>             // Write the data offsets into the directory of the  
> compound stream
>             it = entries.iterator();
>             while(it.hasNext()) {
>                 FileEntry fe = (FileEntry) it.next();
>                 os.seek(fe.directoryOffset);
>                 os.writeLong(fe.dataOffset);
>             }
>
>             // Close the output stream. Set the os to null before  
> trying to
>             // close so that if an exception occurs during the close,  
> the
>             // finally clause below will not attempt to close the  
> stream
>             // the second time.
>             OutputStream tmp = os;
>             os = null;
>             tmp.close();
>
>         } finally {
>             if (os != null) try { os.close(); } catch (IOException e)  
> { }
>         }
>     }
>
>     /** Copy the contents of the file with specified extension into the
>      *  provided output stream. Use the provided buffer for moving data
>      *  to reduce memory allocation.
>      */
>     private void copyFile(FileEntry source, OutputStream os, byte  
> buffer[])
>     throws IOException
>     {
>         InputStream is = null;
>         try {
>             long startPtr = os.getFilePointer();
>
>             is = directory.openFile(source.file);
>             long length = is.length();
>             long remainder = length;
>             int chunk = buffer.length;
>
>             while(remainder > 0) {
>                 int len = (int) Math.min(chunk, remainder);
>                 is.readBytes(buffer, 0, len);
>                 os.writeBytes(buffer, len);
>                 remainder -= len;
>             }
>
>             // Verify that remainder is 0
>             if (remainder != 0)
>                 throw new IOException(
>                     "Non-zero remainder length after copying: " +  
> remainder
>                     + " (id: " + source.file + ", length: " + length
>                     + ", buffer size: " + chunk + ")");
>
>             // Verify that the output length diff is equal to original  
> file
>             long endPtr = os.getFilePointer();
>             long diff = endPtr - startPtr;
>             if (diff != length)
>                 throw new IOException(
>                     "Difference in the output file offsets " + diff
>                     + " does not match the original file length " +  
> length);
>
>         } finally {
>             if (is != null) is.close();
>         }
>     }
> }
> Index: src/java/org/apache/lucene/index/SegmentMerger.java
> ===================================================================
> RCS file:  
> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/ 
> SegmentMerger.java,v
> retrieving revision 1.11
> diff -r1.11 SegmentMerger.java
> 151c151
> <     // Perform the merge
> ---
>>     // Perform the merge. Files will be deleted within  
>> CompoundFileWriter.close()
> 153,158c153
> <
> <     // Now delete the source files
> <     it = files.iterator();
> <     while (it.hasNext()) {
> <       directory.deleteFile((String) it.next());
> <     }
> ---
>>
> package org.apache.lucene.index;
>
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
> implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
>
> import java.util.Vector;
> import java.util.ArrayList;
> import java.util.Iterator;
> import java.io.IOException;
>
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.OutputStream;
> import org.apache.lucene.store.RAMOutputStream;
>
> /**
>  * The SegmentMerger class combines two or more Segments, represented  
> by an IndexReader ({@link #add},
>  * into a single Segment.  After adding the appropriate readers, call  
> the merge method to combine the
>  * segments.
>  *<P>
>  * If the compoundFile flag is set, then the segments will be merged  
> into a compound file.
>  *
>  *
>  * @see #merge
>  * @see #add
>  */
> final class SegmentMerger {
>   private boolean useCompoundFile;
>   private Directory directory;
>   private String segment;
>
>   private Vector readers = new Vector();
>   private FieldInfos fieldInfos;
>
>   // File extensions of old-style index files
>   private static final String COMPOUND_EXTENSIONS[] = new String[] {
>     "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
>   };
>   private static final String VECTOR_EXTENSIONS[] = new String[] {
>     "tvx", "tvd", "tvf"
>   };
>
>   /**
>    *
>    * @param dir The Directory to merge the other segments into
>    * @param name The name of the new segment
>    * @param compoundFile true if the new segment should use a  
> compoundFile
>    */
>   SegmentMerger(Directory dir, String name, boolean compoundFile) {
>     directory = dir;
>     segment = name;
>     useCompoundFile = compoundFile;
>   }
>
>   /**
>    * Add an IndexReader to the collection of readers that are to be  
> merged
>    * @param reader
>    */
>   final void add(IndexReader reader) {
>     readers.addElement(reader);
>   }
>
>   /**
>    *
>    * @param i The index of the reader to return
>    * @return The ith reader to be merged
>    */
>   final IndexReader segmentReader(int i) {
>     return (IndexReader) readers.elementAt(i);
>   }
>
>   /**
>    * Merges the readers specified by the {@link #add} method into the  
> directory passed to the constructor
>    * @return The number of documents that were merged
>    * @throws IOException
>    */
>   final int merge() throws IOException {
>     int value;
>
>     value = mergeFields();
>     mergeTerms();
>     mergeNorms();
>
>     if (fieldInfos.hasVectors())
>       mergeVectors();
>
>     if (useCompoundFile)
>       createCompoundFile();
>
>     return value;
>   }
>
>   /**
>    * close all IndexReaders that have been added.
>    * Should not be called before merge().
>    * @throws IOException
>    */
>   final void closeReaders() throws IOException {
>     for (int i = 0; i < readers.size(); i++) {  // close readers
>       IndexReader reader = (IndexReader) readers.elementAt(i);
>       reader.close();
>     }
>   }
>
>   private final void createCompoundFile()
>           throws IOException {
>     CompoundFileWriter cfsWriter =
>             new CompoundFileWriter(directory, segment + ".cfs");
>
>     ArrayList files =
>       new ArrayList(COMPOUND_EXTENSIONS.length + fieldInfos.size());
>
>     // Basic files
>     for (int i = 0; i < COMPOUND_EXTENSIONS.length; i++) {
>       files.add(segment + "." + COMPOUND_EXTENSIONS[i]);
>     }
>
>     // Field norm files
>     for (int i = 0; i < fieldInfos.size(); i++) {
>       FieldInfo fi = fieldInfos.fieldInfo(i);
>       if (fi.isIndexed) {
>         files.add(segment + ".f" + i);
>       }
>     }
>
>     // Vector files
>     if (fieldInfos.hasVectors()) {
>       for (int i = 0; i < VECTOR_EXTENSIONS.length; i++) {
>         files.add(segment + "." + VECTOR_EXTENSIONS[i]);
>       }
>     }
>
>     // Now merge all added files
>     Iterator it = files.iterator();
>     while (it.hasNext()) {
>       cfsWriter.addFile((String) it.next());
>     }
>
>     // Perform the merge. Files will be deleted within  
> CompoundFileWriter.close()
>     cfsWriter.close();
>
>   }
>
>   /**
>    *
>    * @return The number of documents in all of the readers
>    * @throws IOException
>    */
>   private final int mergeFields() throws IOException {
>     fieldInfos = new FieldInfos();		  // merge field names
>     int docCount = 0;
>     for (int i = 0; i < readers.size(); i++) {
>       IndexReader reader = (IndexReader) readers.elementAt(i);
>       fieldInfos.addIndexed(reader.getIndexedFieldNames(true), true);
>       fieldInfos.addIndexed(reader.getIndexedFieldNames(false), false);
>       fieldInfos.add(reader.getFieldNames(false), false);
>     }
>     fieldInfos.write(directory, segment + ".fnm");
>
>     FieldsWriter fieldsWriter = // merge field values
>             new FieldsWriter(directory, segment, fieldInfos);
>     try {
>       for (int i = 0; i < readers.size(); i++) {
>         IndexReader reader = (IndexReader) readers.elementAt(i);
>         int maxDoc = reader.maxDoc();
>         for (int j = 0; j < maxDoc; j++)
>           if (!reader.isDeleted(j)) {               // skip deleted  
> docs
>             fieldsWriter.addDocument(reader.document(j));
>             docCount++;
>           }
>       }
>     } finally {
>       fieldsWriter.close();
>     }
>     return docCount;
>   }
>
>   /**
>    * Merge the TermVectors from each of the segments into the new one.
>    * @throws IOException
>    */
>   private final void mergeVectors() throws IOException {
>     TermVectorsWriter termVectorsWriter =
>       new TermVectorsWriter(directory, segment, fieldInfos);
>
>     try {
>       for (int r = 0; r < readers.size(); r++) {
>         IndexReader reader = (IndexReader) readers.elementAt(r);
>         int maxDoc = reader.maxDoc();
>         for (int docNum = 0; docNum < maxDoc; docNum++) {
>           // skip deleted docs
>           if (reader.isDeleted(docNum)) {
>             continue;
>           }
>           termVectorsWriter.openDocument();
>
>           // get all term vectors
>           TermFreqVector[] sourceTermVector =
>             reader.getTermFreqVectors(docNum);
>
>           if (sourceTermVector != null) {
>             for (int f = 0; f < sourceTermVector.length; f++) {
>               // translate field numbers
>               TermFreqVector termVector = sourceTermVector[f];
>               termVectorsWriter.openField(termVector.getField());
>               String [] terms = termVector.getTerms();
>               int [] freqs = termVector.getTermFrequencies();
>
>               for (int t = 0; t < terms.length; t++) {
>                 termVectorsWriter.addTerm(terms[t], freqs[t]);
>               }
>             }
>             termVectorsWriter.closeDocument();
>           }
>         }
>       }
>     } finally {
>       termVectorsWriter.close();
>     }
>   }
>
>   private OutputStream freqOutput = null;
>   private OutputStream proxOutput = null;
>   private TermInfosWriter termInfosWriter = null;
>   private int skipInterval;
>   private SegmentMergeQueue queue = null;
>
>   private final void mergeTerms() throws IOException {
>     try {
>       freqOutput = directory.createFile(segment + ".frq");
>       proxOutput = directory.createFile(segment + ".prx");
>       termInfosWriter =
>               new TermInfosWriter(directory, segment, fieldInfos);
>       skipInterval = termInfosWriter.skipInterval;
>       queue = new SegmentMergeQueue(readers.size());
>
>       mergeTermInfos();
>
>     } finally {
>       if (freqOutput != null) freqOutput.close();
>       if (proxOutput != null) proxOutput.close();
>       if (termInfosWriter != null) termInfosWriter.close();
>       if (queue != null) queue.close();
>     }
>   }
>
>   private final void mergeTermInfos() throws IOException {
>     int base = 0;
>     for (int i = 0; i < readers.size(); i++) {
>       IndexReader reader = (IndexReader) readers.elementAt(i);
>       TermEnum termEnum = reader.terms();
>       SegmentMergeInfo smi = new SegmentMergeInfo(base, termEnum,  
> reader);
>       base += reader.numDocs();
>       if (smi.next())
>         queue.put(smi);				  // initialize queue
>       else
>         smi.close();
>     }
>
>     SegmentMergeInfo[] match = new SegmentMergeInfo[readers.size()];
>
>     while (queue.size() > 0) {
>       int matchSize = 0;			  // pop matching terms
>       match[matchSize++] = (SegmentMergeInfo) queue.pop();
>       Term term = match[0].term;
>       SegmentMergeInfo top = (SegmentMergeInfo) queue.top();
>
>       while (top != null && term.compareTo(top.term) == 0) {
>         match[matchSize++] = (SegmentMergeInfo) queue.pop();
>         top = (SegmentMergeInfo) queue.top();
>       }
>
>       mergeTermInfo(match, matchSize);		  // add new TermInfo
>
>       while (matchSize > 0) {
>         SegmentMergeInfo smi = match[--matchSize];
>         if (smi.next())
>           queue.put(smi);			  // restore queue
>         else
>           smi.close();				  // done with a segment
>       }
>     }
>   }
>
>   private final TermInfo termInfo = new TermInfo(); // minimize consing
>
>   /** Merge one term found in one or more segments. The array  
> <code>smis</code>
>    *  contains segments that are positioned at the same term.  
> <code>N</code>
>    *  is the number of cells in the array actually occupied.
>    *
>    * @param smis array of segments
>    * @param n number of cells in the array actually occupied
>    */
>   private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
>           throws IOException {
>     long freqPointer = freqOutput.getFilePointer();
>     long proxPointer = proxOutput.getFilePointer();
>
>     int df = appendPostings(smis, n);		  // append posting data
>
>     long skipPointer = writeSkip();
>
>     if (df > 0) {
>       // add an entry to the dictionary with pointers to prox and freq  
> files
>       termInfo.set(df, freqPointer, proxPointer, (int) (skipPointer -  
> freqPointer));
>       termInfosWriter.add(smis[0].term, termInfo);
>     }
>   }
>
>   /** Process postings from multiple segments all positioned on the
>    *  same term. Writes out merged entries into freqOutput and
>    *  the proxOutput streams.
>    *
>    * @param smis array of segments
>    * @param n number of cells in the array actually occupied
>    * @return number of documents across all segments where this term  
> was found
>    */
>   private final int appendPostings(SegmentMergeInfo[] smis, int n)
>           throws IOException {
>     int lastDoc = 0;
>     int df = 0;					  // number of docs w/ term
>     resetSkip();
>     for (int i = 0; i < n; i++) {
>       SegmentMergeInfo smi = smis[i];
>       TermPositions postings = smi.postings;
>       int base = smi.base;
>       int[] docMap = smi.docMap;
>       postings.seek(smi.termEnum);
>       while (postings.next()) {
>         int doc = postings.doc();
>         if (docMap != null)
>           doc = docMap[doc];                      // map around  
> deletions
>         doc += base;                              // convert to merged  
> space
>
>         if (doc < lastDoc)
>           throw new IllegalStateException("docs out of order");
>
>         df++;
>
>         if ((df % skipInterval) == 0) {
>           bufferSkip(lastDoc);
>         }
>
>         int docCode = (doc - lastDoc) << 1;	  // use low bit to flag  
> freq=1
>         lastDoc = doc;
>
>         int freq = postings.freq();
>         if (freq == 1) {
>           freqOutput.writeVInt(docCode | 1);	  // write doc & freq=1
>         } else {
>           freqOutput.writeVInt(docCode);	  // write doc
>           freqOutput.writeVInt(freq);		  // write frequency in doc
>         }
>
>         int lastPosition = 0;			  // write position deltas
>         for (int j = 0; j < freq; j++) {
>           int position = postings.nextPosition();
>           proxOutput.writeVInt(position - lastPosition);
>           lastPosition = position;
>         }
>       }
>     }
>     return df;
>   }
>
>   private RAMOutputStream skipBuffer = new RAMOutputStream();
>   private int lastSkipDoc;
>   private long lastSkipFreqPointer;
>   private long lastSkipProxPointer;
>
>   private void resetSkip() throws IOException {
>     skipBuffer.reset();
>     lastSkipDoc = 0;
>     lastSkipFreqPointer = freqOutput.getFilePointer();
>     lastSkipProxPointer = proxOutput.getFilePointer();
>   }
>
>   private void bufferSkip(int doc) throws IOException {
>     long freqPointer = freqOutput.getFilePointer();
>     long proxPointer = proxOutput.getFilePointer();
>
>     skipBuffer.writeVInt(doc - lastSkipDoc);
>     skipBuffer.writeVInt((int) (freqPointer - lastSkipFreqPointer));
>     skipBuffer.writeVInt((int) (proxPointer - lastSkipProxPointer));
>
>     lastSkipDoc = doc;
>     lastSkipFreqPointer = freqPointer;
>     lastSkipProxPointer = proxPointer;
>   }
>
>   private long writeSkip() throws IOException {
>     long skipPointer = freqOutput.getFilePointer();
>     skipBuffer.writeTo(freqOutput);
>     return skipPointer;
>   }
>
>   private void mergeNorms() throws IOException {
>     for (int i = 0; i < fieldInfos.size(); i++) {
>       FieldInfo fi = fieldInfos.fieldInfo(i);
>       if (fi.isIndexed) {
>         OutputStream output = directory.createFile(segment + ".f" + i);
>         try {
>           for (int j = 0; j < readers.size(); j++) {
>             IndexReader reader = (IndexReader) readers.elementAt(j);
>             byte[] input = reader.norms(fi.name);
>             int maxDoc = reader.maxDoc();
>             for (int k = 0; k < maxDoc; k++) {
>               byte norm = input != null ? input[k] : (byte) 0;
>               if (!reader.isDeleted(k)) {
>                 output.writeByte(norm);
>               }
>             }
>           }
>         } finally {
>           output.close();
>         }
>       }
>     }
>   }
>
> }
> Index: src/test/org/apache/lucene/index/TestCompoundFile.java
> ===================================================================
> RCS file:  
> /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/ 
> TestCompoundFile.java,v
> retrieving revision 1.5
> diff -r1.5 TestCompoundFile.java
> 20a21,24
>> import java.util.Collection;
>> import java.util.HashMap;
>> import java.util.Iterator;
>> import java.util.Map;
> 197a202,204
>>
>>             InputStream expected = dir.openFile(name);
>>
> 203c210
> <             InputStream expected = dir.openFile(name);
> ---
>>
> 206a214
>>
> 220a229,231
>>         InputStream expected1 = dir.openFile("d1");
>>         InputStream expected2 = dir.openFile("d2");
>>
> 227c238
> <         InputStream expected = dir.openFile("d1");
> ---
>>
> 229,231c240,242
> <         assertSameStreams("d1", expected, actual);
> <         assertSameSeekBehavior("d1", expected, actual);
> <         expected.close();
> ---
>>         assertSameStreams("d1", expected1, actual);
>>         assertSameSeekBehavior("d1", expected1, actual);
>>         expected1.close();
> 234c245
> <         expected = dir.openFile("d2");
> ---
>>
> 236,238c247,249
> <         assertSameStreams("d2", expected, actual);
> <         assertSameSeekBehavior("d2", expected, actual);
> <         expected.close();
> ---
>>         assertSameStreams("d2", expected2, actual);
>>         assertSameSeekBehavior("d2", expected2, actual);
>>         expected2.close();
> 270,271d280
> <         // Now test
> <         CompoundFileWriter csw = new CompoundFileWriter(dir,  
> "test.cfs");
> 275a285,292
>>
>>         InputStream[] check = new InputStream[data.length];
>>         for (int i=0; i<data.length; i++) {
>>            check[i] = dir.openFile(segment + data[i]);
>>         }
>>
>>         // Now test
>>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
>> "test.cfs");
> 283d299
> <             InputStream check = dir.openFile(segment + data[i]);
> 285,286c301,302
> <             assertSameStreams(data[i], check, test);
> <             assertSameSeekBehavior(data[i], check, test);
> ---
>>             assertSameStreams(data[i], check[i], test);
>>             assertSameSeekBehavior(data[i], check[i], test);
> 288c304
> <             check.close();
> ---
>>             check[i].close();
> 299c315,316
> <     private void setUp_2() throws IOException {
> ---
>>     private Map setUp_2() throws IOException {
>>     		Map streams = new HashMap(20);
> 303a321,322
>>
>>             streams.put("f" + i, dir.openFile("f" + i));
> 305a325,326
>>
>>         return streams;
> 308c329,336
> <
> ---
>>     private void closeUp(Map streams) throws IOException {
>>     	Iterator it = streams.values().iterator();
>>     	while (it.hasNext()) {
>>     		InputStream stream = (InputStream)it.next();
>>     		stream.close();
>>     	}
>>     }
>>
> 364c392
> <         setUp_2();
> ---
>>         Map streams = setUp_2();
> 368c396
> <         InputStream expected = dir.openFile("f11");
> ---
>>         InputStream expected = (InputStream)streams.get("f11");
> 410c438,439
> <         expected.close();
> ---
>>         closeUp(streams);
>>
> 418c447
> <         setUp_2();
> ---
>>         Map streams = setUp_2();
> 422,423c451,452
> <         InputStream e1 = dir.openFile("f11");
> <         InputStream e2 = dir.openFile("f3");
> ---
>>         InputStream e1 = (InputStream)streams.get("f11");
>>         InputStream e2 = (InputStream)streams.get("f3");
> 426c455
> <         InputStream a2 = dir.openFile("f3");
> ---
>>         InputStream a2 = cr.openFile("f3");
> 486,487d514
> <         e1.close();
> <         e2.close();
> 490a518,519
>>
>>         closeUp(streams);
> 497c526
> <         setUp_2();
> ---
>>         Map streams = setUp_2();
> 569a599,600
>>
>>         closeUp(streams);
> 574c605
> <         setUp_2();
> ---
>>         Map streams = setUp_2();
> 587a619,620
>>
>>         closeUp(streams);
> 592c625
> <         setUp_2();
> ---
>>         Map streams = setUp_2();
> 617a651,652
>>
>>         closeUp(streams);
> package org.apache.lucene.index;
>
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
> implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
>
> import java.io.IOException;
> import java.io.File;
> import java.util.Collection;
> import java.util.HashMap;
> import java.util.Iterator;
> import java.util.Map;
>
> import junit.framework.TestCase;
> import junit.framework.TestSuite;
> import junit.textui.TestRunner;
> import org.apache.lucene.store.OutputStream;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.InputStream;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.store._TestHelper;
>
>
> /**
>  * @author dmitrys@earthlink.net
>  * @version $Id: TestCompoundFile.java,v 1.5 2004/03/29 22:48:06  
> cutting Exp $
>  */
> public class TestCompoundFile extends TestCase
> {
>     /** Main for running test case by itself. */
>     public static void main(String args[]) {
>         TestRunner.run (new TestSuite(TestCompoundFile.class));
> //        TestRunner.run (new TestCompoundFile("testSingleFile"));
> //        TestRunner.run (new TestCompoundFile("testTwoFiles"));
> //        TestRunner.run (new TestCompoundFile("testRandomFiles"));
> //        TestRunner.run (new  
> TestCompoundFile("testClonedStreamsClosing"));
> //        TestRunner.run (new TestCompoundFile("testReadAfterClose"));
> //        TestRunner.run (new TestCompoundFile("testRandomAccess"));
> //        TestRunner.run (new  
> TestCompoundFile("testRandomAccessClones"));
> //        TestRunner.run (new TestCompoundFile("testFileNotFound"));
> //        TestRunner.run (new TestCompoundFile("testReadPastEOF"));
>
> //        TestRunner.run (new TestCompoundFile("testIWCreate"));
>
>     }
>
>
>     private Directory dir;
>
>
>     public void setUp() throws IOException {
>         //dir = new RAMDirectory();
>         dir = FSDirectory.getDirectory(new  
> File(System.getProperty("tempDir"), "testIndex"), true);
>     }
>
>
>     /** Creates a file of the specified size with random data. */
>     private void createRandomFile(Directory dir, String name, int size)
>     throws IOException
>     {
>         OutputStream os = dir.createFile(name);
>         for (int i=0; i<size; i++) {
>             byte b = (byte) (Math.random() * 256);
>             os.writeByte(b);
>         }
>         os.close();
>     }
>
>     /** Creates a file of the specified size with sequential data. The  
> first
>      *  byte is written as the start byte provided. All subsequent  
> bytes are
>      *  computed as start + offset where offset is the number of the  
> byte.
>      */
>     private void createSequenceFile(Directory dir,
>                                     String name,
>                                     byte start,
>                                     int size)
>     throws IOException
>     {
>         OutputStream os = dir.createFile(name);
>         for (int i=0; i < size; i++) {
>             os.writeByte(start);
>             start ++;
>         }
>         os.close();
>     }
>
>
>     private void assertSameStreams(String msg,
>                                    InputStream expected,
>                                    InputStream test)
>     throws IOException
>     {
>         assertNotNull(msg + " null expected", expected);
>         assertNotNull(msg + " null test", test);
>         assertEquals(msg + " length", expected.length(),  
> test.length());
>         assertEquals(msg + " position", expected.getFilePointer(),
>                                         test.getFilePointer());
>
>         byte expectedBuffer[] = new byte[512];
>         byte testBuffer[] = new byte[expectedBuffer.length];
>
>         long remainder = expected.length() - expected.getFilePointer();
>         while(remainder > 0) {
>             int readLen = (int) Math.min(remainder,  
> expectedBuffer.length);
>             expected.readBytes(expectedBuffer, 0, readLen);
>             test.readBytes(testBuffer, 0, readLen);
>             assertEqualArrays(msg + ", remainder " + remainder,  
> expectedBuffer,
>                 testBuffer, 0, readLen);
>             remainder -= readLen;
>         }
>     }
>
>
>     private void assertSameStreams(String msg,
>                                    InputStream expected,
>                                    InputStream actual,
>                                    long seekTo)
>     throws IOException
>     {
>         if(seekTo >= 0 && seekTo < expected.length())
>         {
>             expected.seek(seekTo);
>             actual.seek(seekTo);
>             assertSameStreams(msg + ", seek(mid)", expected, actual);
>         }
>     }
>
>
>
>     private void assertSameSeekBehavior(String msg,
>                                         InputStream expected,
>                                         InputStream actual)
>     throws IOException
>     {
>         // seek to 0
>         long point = 0;
>         assertSameStreams(msg + ", seek(0)", expected, actual, point);
>
>         // seek to middle
>         point = expected.length() / 2l;
>         assertSameStreams(msg + ", seek(mid)", expected, actual,  
> point);
>
>         // seek to end - 2
>         point = expected.length() - 2;
>         assertSameStreams(msg + ", seek(end-2)", expected, actual,  
> point);
>
>         // seek to end - 1
>         point = expected.length() - 1;
>         assertSameStreams(msg + ", seek(end-1)", expected, actual,  
> point);
>
>         // seek to the end
>         point = expected.length();
>         assertSameStreams(msg + ", seek(end)", expected, actual,  
> point);
>
>         // seek past end
>         point = expected.length() + 1;
>         assertSameStreams(msg + ", seek(end+1)", expected, actual,  
> point);
>     }
>
>
>     private void assertEqualArrays(String msg,
>                                    byte[] expected,
>                                    byte[] test,
>                                    int start,
>                                    int len)
>     {
>         assertNotNull(msg + " null expected", expected);
>         assertNotNull(msg + " null test", test);
>
>         for (int i=start; i<len; i++) {
>             assertEquals(msg + " " + i, expected[i], test[i]);
>         }
>     }
>
>
>     // ===========================================================
>     //  Tests of the basic CompoundFile functionality
>     // ===========================================================
>
>
>     /** This test creates compound file based on a single file.
>      *  Files of different sizes are tested: 0, 1, 10, 100 bytes.
>      */
>     public void testSingleFile() throws IOException {
>         int data[] = new int[] { 0, 1, 10, 100 };
>         for (int i=0; i<data.length; i++) {
>             String name = "t" + data[i];
>             createSequenceFile(dir, name, (byte) 0, data[i]);
>
>             InputStream expected = dir.openFile(name);
>
>             CompoundFileWriter csw = new CompoundFileWriter(dir, name  
> + ".cfs");
>             csw.addFile(name);
>             csw.close();
>
>             CompoundFileReader csr = new CompoundFileReader(dir, name  
> + ".cfs");
>
>             InputStream actual = csr.openFile(name);
>             assertSameStreams(name, expected, actual);
>             assertSameSeekBehavior(name, expected, actual);
>
>             expected.close();
>             actual.close();
>             csr.close();
>         }
>     }
>
>
>     /** This test creates compound file based on two files.
>      *
>      */
>     public void testTwoFiles() throws IOException {
>         createSequenceFile(dir, "d1", (byte) 0, 15);
>         createSequenceFile(dir, "d2", (byte) 0, 114);
>
>         InputStream expected1 = dir.openFile("d1");
>         InputStream expected2 = dir.openFile("d2");
>
>         CompoundFileWriter csw = new CompoundFileWriter(dir, "d.csf");
>         csw.addFile("d1");
>         csw.addFile("d2");
>         csw.close();
>
>         CompoundFileReader csr = new CompoundFileReader(dir, "d.csf");
>
>         InputStream actual = csr.openFile("d1");
>         assertSameStreams("d1", expected1, actual);
>         assertSameSeekBehavior("d1", expected1, actual);
>         expected1.close();
>         actual.close();
>
>
>         actual = csr.openFile("d2");
>         assertSameStreams("d2", expected2, actual);
>         assertSameSeekBehavior("d2", expected2, actual);
>         expected2.close();
>         actual.close();
>         csr.close();
>     }
>
>     /** This test creates a compound file based on a large number of  
> files of
>      *  various length. The file content is generated randomly. The  
> sizes range
>      *  from 0 to 1Mb. Some of the sizes are selected to test the  
> buffering
>      *  logic in the file reading code. For this the chunk variable is  
> set to
>      *  the length of the buffer used internally by the compound file  
> logic.
>      */
>     public void testRandomFiles() throws IOException {
>         // Setup the test segment
>         String segment = "test";
>         int chunk = 1024; // internal buffer size used by the stream
>         createRandomFile(dir, segment + ".zero", 0);
>         createRandomFile(dir, segment + ".one", 1);
>         createRandomFile(dir, segment + ".ten", 10);
>         createRandomFile(dir, segment + ".hundred", 100);
>         createRandomFile(dir, segment + ".big1", chunk);
>         createRandomFile(dir, segment + ".big2", chunk - 1);
>         createRandomFile(dir, segment + ".big3", chunk + 1);
>         createRandomFile(dir, segment + ".big4", 3 * chunk);
>         createRandomFile(dir, segment + ".big5", 3 * chunk - 1);
>         createRandomFile(dir, segment + ".big6", 3 * chunk + 1);
>         createRandomFile(dir, segment + ".big7", 1000 * chunk);
>
>         // Setup extraneous files
>         createRandomFile(dir, "onetwothree", 100);
>         createRandomFile(dir, segment + ".notIn", 50);
>         createRandomFile(dir, segment + ".notIn2", 51);
>
>         final String data[] = new String[] {
>             ".zero", ".one", ".ten", ".hundred", ".big1", ".big2",  
> ".big3",
>             ".big4", ".big5", ".big6", ".big7"
>         };
>
>         InputStream[] check = new InputStream[data.length];
>         for (int i=0; i<data.length; i++) {
>            check[i] = dir.openFile(segment + data[i]);
>         }
>
>         // Now test
>         CompoundFileWriter csw = new CompoundFileWriter(dir,  
> "test.cfs");
>         for (int i=0; i<data.length; i++) {
>             csw.addFile(segment + data[i]);
>         }
>         csw.close();
>
>         CompoundFileReader csr = new CompoundFileReader(dir,  
> "test.cfs");
>         for (int i=0; i<data.length; i++) {
>             InputStream test = csr.openFile(segment + data[i]);
>             assertSameStreams(data[i], check[i], test);
>             assertSameSeekBehavior(data[i], check[i], test);
>             test.close();
>             check[i].close();
>         }
>         csr.close();
>     }
>
>
>     /** Setup a larger compound file with a number of components, each  
> of
>      *  which is a sequential file (so that we can easily tell that we  
> are
>      *  reading in the right byte). The methods sets up 20 files - f0  
> to f19,
>      *  the size of each file is 1000 bytes.
>      */
>     private Map setUp_2() throws IOException {
>     		Map streams = new HashMap(20);
>         CompoundFileWriter cw = new CompoundFileWriter(dir, "f.comp");
>         for (int i=0; i<20; i++) {
>             createSequenceFile(dir, "f" + i, (byte) 0, 2000);
>             cw.addFile("f" + i);
>
>             streams.put("f" + i, dir.openFile("f" + i));
>         }
>         cw.close();
>
>         return streams;
>     }
>
>     private void closeUp(Map streams) throws IOException {
>     	Iterator it = streams.values().iterator();
>     	while (it.hasNext()) {
>     		InputStream stream = (InputStream)it.next();
>     		stream.close();
>     	}
>     }
>
>     public void testReadAfterClose() throws IOException {
>         demo_FSInputStreamBug((FSDirectory) dir, "test");
>     }
>
>     private void demo_FSInputStreamBug(FSDirectory fsdir, String file)
>     throws IOException
>     {
>         // Setup the test file - we need more than 1024 bytes
>         OutputStream os = fsdir.createFile(file);
>         for(int i=0; i<2000; i++) {
>             os.writeByte((byte) i);
>         }
>         os.close();
>
>         InputStream in = fsdir.openFile(file);
>
>         // This read primes the buffer in InputStream
>         byte b = in.readByte();
>
>         // Close the file
>         in.close();
>
>         // ERROR: this call should fail, but succeeds because the  
> buffer
>         // is still filled
>         b = in.readByte();
>
>         // ERROR: this call should fail, but succeeds for some reason  
> as well
>         in.seek(1099);
>
>         try {
>             // OK: this call correctly fails. We are now past the 1024  
> internal
>             // buffer, so an actual IO is attempted, which fails
>             b = in.readByte();
>         } catch (IOException e) {
>         }
>     }
>
>
>     static boolean isCSInputStream(InputStream is) {
>         return is instanceof CompoundFileReader.CSInputStream;
>     }
>
>     static boolean isCSInputStreamOpen(InputStream is) throws  
> IOException {
>         if (isCSInputStream(is)) {
>             CompoundFileReader.CSInputStream cis =
>             (CompoundFileReader.CSInputStream) is;
>
>             return _TestHelper.isFSInputStreamOpen(cis.base);
>         } else {
>             return false;
>         }
>     }
>
>
>     public void testClonedStreamsClosing() throws IOException {
>         Map streams = setUp_2();
>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>
>         // basic clone
>         InputStream expected = (InputStream)streams.get("f11");
>         assertTrue(_TestHelper.isFSInputStreamOpen(expected));
>
>         InputStream one = cr.openFile("f11");
>         assertTrue(isCSInputStreamOpen(one));
>
>         InputStream two = (InputStream) one.clone();
>         assertTrue(isCSInputStreamOpen(two));
>
>         assertSameStreams("basic clone one", expected, one);
>         expected.seek(0);
>         assertSameStreams("basic clone two", expected, two);
>
>         // Now close the first stream
>         one.close();
>         assertTrue("Only close when cr is closed",  
> isCSInputStreamOpen(one));
>
>         // The following should really fail since we couldn't expect to
>         // access a file once close has been called on it (regardless  
> of
>         // buffering and/or clone magic)
>         expected.seek(0);
>         two.seek(0);
>         assertSameStreams("basic clone two/2", expected, two);
>
>
>         // Now close the compound reader
>         cr.close();
>         assertFalse("Now closed one", isCSInputStreamOpen(one));
>         assertFalse("Now closed two", isCSInputStreamOpen(two));
>
>         // The following may also fail since the compound stream is  
> closed
>         expected.seek(0);
>         two.seek(0);
>         //assertSameStreams("basic clone two/3", expected, two);
>
>
>         // Now close the second clone
>         two.close();
>         expected.seek(0);
>         two.seek(0);
>         //assertSameStreams("basic clone two/4", expected, two);
>
>         closeUp(streams);
>
>     }
>
>
>     /** This test opens two files from a compound stream and verifies  
> that
>      *  their file positions are independent of each other.
>      */
>     public void testRandomAccess() throws IOException {
>         Map streams = setUp_2();
>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>
>         // Open two files
>         InputStream e1 = (InputStream)streams.get("f11");
>         InputStream e2 = (InputStream)streams.get("f3");
>
>         InputStream a1 = cr.openFile("f11");
>         InputStream a2 = cr.openFile("f3");
>
>         // Seek the first pair
>         e1.seek(100);
>         a1.seek(100);
>         assertEquals(100, e1.getFilePointer());
>         assertEquals(100, a1.getFilePointer());
>         byte be1 = e1.readByte();
>         byte ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now seek the second pair
>         e2.seek(1027);
>         a2.seek(1027);
>         assertEquals(1027, e2.getFilePointer());
>         assertEquals(1027, a2.getFilePointer());
>         byte be2 = e2.readByte();
>         byte ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Now make sure the first one didn't move
>         assertEquals(101, e1.getFilePointer());
>         assertEquals(101, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now more the first one again, past the buffer length
>         e1.seek(1910);
>         a1.seek(1910);
>         assertEquals(1910, e1.getFilePointer());
>         assertEquals(1910, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now make sure the second set didn't move
>         assertEquals(1028, e2.getFilePointer());
>         assertEquals(1028, a2.getFilePointer());
>         be2 = e2.readByte();
>         ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Move the second set back, again cross the buffer size
>         e2.seek(17);
>         a2.seek(17);
>         assertEquals(17, e2.getFilePointer());
>         assertEquals(17, a2.getFilePointer());
>         be2 = e2.readByte();
>         ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Finally, make sure the first set didn't move
>         // Now make sure the first one didn't move
>         assertEquals(1911, e1.getFilePointer());
>         assertEquals(1911, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         a1.close();
>         a2.close();
>         cr.close();
>
>         closeUp(streams);
>     }
>
>     /** This test opens two files from a compound stream and verifies  
> that
>      *  their file positions are independent of each other.
>      */
>     public void testRandomAccessClones() throws IOException {
>         Map streams = setUp_2();
>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>
>         // Open two files
>         InputStream e1 = cr.openFile("f11");
>         InputStream e2 = cr.openFile("f3");
>
>         InputStream a1 = (InputStream) e1.clone();
>         InputStream a2 = (InputStream) e2.clone();
>
>         // Seek the first pair
>         e1.seek(100);
>         a1.seek(100);
>         assertEquals(100, e1.getFilePointer());
>         assertEquals(100, a1.getFilePointer());
>         byte be1 = e1.readByte();
>         byte ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now seek the second pair
>         e2.seek(1027);
>         a2.seek(1027);
>         assertEquals(1027, e2.getFilePointer());
>         assertEquals(1027, a2.getFilePointer());
>         byte be2 = e2.readByte();
>         byte ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Now make sure the first one didn't move
>         assertEquals(101, e1.getFilePointer());
>         assertEquals(101, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now more the first one again, past the buffer length
>         e1.seek(1910);
>         a1.seek(1910);
>         assertEquals(1910, e1.getFilePointer());
>         assertEquals(1910, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         // Now make sure the second set didn't move
>         assertEquals(1028, e2.getFilePointer());
>         assertEquals(1028, a2.getFilePointer());
>         be2 = e2.readByte();
>         ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Move the second set back, again cross the buffer size
>         e2.seek(17);
>         a2.seek(17);
>         assertEquals(17, e2.getFilePointer());
>         assertEquals(17, a2.getFilePointer());
>         be2 = e2.readByte();
>         ba2 = a2.readByte();
>         assertEquals(be2, ba2);
>
>         // Finally, make sure the first set didn't move
>         // Now make sure the first one didn't move
>         assertEquals(1911, e1.getFilePointer());
>         assertEquals(1911, a1.getFilePointer());
>         be1 = e1.readByte();
>         ba1 = a1.readByte();
>         assertEquals(be1, ba1);
>
>         e1.close();
>         e2.close();
>         a1.close();
>         a2.close();
>         cr.close();
>
>         closeUp(streams);
>     }
>
>
>     public void testFileNotFound() throws IOException {
>         Map streams = setUp_2();
>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>
>         // Open two files
>         try {
>             InputStream e1 = cr.openFile("bogus");
>             fail("File not found");
>
>         } catch (IOException e) {
>             /* success */
>             //System.out.println("SUCCESS: File Not Found: " + e);
>         }
>
>         cr.close();
>
>         closeUp(streams);
>     }
>
>
>     public void testReadPastEOF() throws IOException {
>         Map streams = setUp_2();
>         CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
>         InputStream is = cr.openFile("f2");
>         is.seek(is.length() - 10);
>         byte b[] = new byte[100];
>         is.readBytes(b, 0, 10);
>
>         try {
>             byte test = is.readByte();
>             fail("Single byte read past end of file");
>         } catch (IOException e) {
>             /* success */
>             //System.out.println("SUCCESS: single byte read past end  
> of file: " + e);
>         }
>
>         is.seek(is.length() - 10);
>         try {
>             is.readBytes(b, 0, 50);
>             fail("Block read past end of file");
>         } catch (IOException e) {
>             /* success */
>             //System.out.println("SUCCESS: block read past end of  
> file: " + e);
>         }
>
>         is.close();
>         cr.close();
>
>         closeUp(streams);
>     }
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org