You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@detego-software.de> on 2003/10/16 21:02:19 UTC

CompoundFileReader

Dear Dmitry,

I finally found time to look into your compound file implementation and
to try it out. It works and I like it. However, I wondered, why you are
using clones of the base input stream in CompoundFileReader.CSInputStream.
The "problem" (actually it is not a real problem) I see is that you are
double buffering your input. Note that the clone has a buffer of 1024 bytes and
that CompoundFileReader.CSInputStream, which extends InputStream, also has
such a buffer. I think in the way you implemented it your input gets copied
into the base buffer and from there into the CompoundFileReader.CSInputStream
buffer before it is actually used. Please have a look at my patch. At first
one might think that my implementation is simpler but less efficient. This
is not the case. Actually it is even a little bit more efficient. Points to
think about:

*) stream is private to CompoundFileReader, so nobody else can use it, if not
explicitly granted access. CompoundFileReader uses stream only in its
constructor and when calling openFile.

*) Therefore, CompoundFileReader.CSInputStream can share the same instance of
stream if they take care of synchronization and seek.

*) An InputStream not necessarily has to implement seekInternal if readInternal
takes care of seeking on the real stream (see FSInputStream).

*) Synchronization of CompoundFileReader.CSInputStream.readInternal on base
does no harm since it is called only when the real file has to be accessed
and thi is synchronized anyway. However, this synchronizytion is necessary!

I tested my patch with your TestCompoundFile and everything seems fine.
I also tried it on one of my indices and searched on this index. It also
works. My impression from my tests is that my patch is a little bit faster
than your original version but there seems to be not much difference in
efficiency.

Christoph

Re: CompoundFileReader

Posted by Christoph Goller <go...@detego-software.de>.
I am not too concerned about the index bound checks. They are an additional
safety and I don´t think they matter in terms of efficiency. The only point
is that I forgot them in my patch and without them the tests fail. So if you
decide to apply my patch, the checks have to be added. If you give your ok,
I can do the commit too.

Christoph

Dmitry Serebrennikov schrieb:
> I put those in mostly to assure myself that I got things right. I think 
> the key question is whether it possible to read part of another file. If 
> not, I think that's fine. If yes, I think that's a problem.
> 
> Dmitry.
> 
> Christoph Goller wrote:
> 
>> Hi Dmitry,
>>
>> Now I tried all test cases. They all work except for Russian 
>> analyser/stemmer
>> and occational fails of TestIndexReader (the timestamp problem). So I 
>> think
>> it should be ok as far as CoumpoundFile is concerned. Off course we still
>> have to find a good solution for the timestamp problem.
>>
>> However,I stumbled over a problem that I had missed last time. 
>> TestCompoundFile
>> only succedds with your index bound tests in CSInputStream.seekInternal.
>> On Thursday I had deleted them after trying your test cases because 
>> the other
>> implementations don´t do these tests either. I did not go too deep 
>> into your
>> tests, but do you think the bahaviour of throwing an exception if the 
>> seek
>> index is out of bound is required? Its not part of the contract of the 
>> other
>> implementations of InputStream. Maybe I am missing something here.
>>
>> Dmitry Serebrennikov schrieb:
>>
>>> Dear Christoph,
>>>
>>> Sounds like an excellent enhancement. From a quick look, it appears 
>>> that you are right and everything should work just fine but use less 
>>> memory. One question: have you tried the other test cases also or 
>>> just the TestCompoundFile. There are quite a few conditions that 
>>> TestCompoundFile does not cover.
>>>
>>> At first I thought that the synchronization around readBytes would 
>>> cause too much performance degradation when a lot of concurrent 
>>> queries were executing. But after I looked at it some more, I 
>>> convinced myself that it should be ok. Have you ran any 
>>> multi-threaded tests / benchmarks? I think it might also be a good 
>>> idea before making this change.
>>>
>>> Christoph, do you think it is possible to just call readInternal on 
>>> the base stream instead of the readBytes? The main difference is that 
>>> we would bypass the buffering in the base stream. I think the 
>>> buffering done by the superclass of the CSInputStream would be quite 
>>> enough (which is your point to begin with, right)? Perhaps it would 
>>> be worthwhile to make InputStream.readInternal() public instead of 
>>> protected?
>>
>>
>>
>> In CSInputStream.readInternal I call:
>>
>> synchronized (base) {
>>   base.seek(fileOffset + getFilePointer());
>>   base.readBytes(b, offset, len);
>> }
>>
>> Calling base.seek does nothing more than setting the file pointer
>> (bufferStart + bufferPosition) of base correctly.
>>
>> base.readBytes(b, offset, len) in this case does not use the buffer of
>> base (at least in most cases). Look into InputStream.readBytes.
>> If len >= BUFFER_SIZE the base buffer is skipped and the buffer b is
>> used directly.
>>
>> I think synchronized in our case does not much more than synchronizing
>> on the actual file in FSInputStream.readInternal.
>>
>> Christoph
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 

-- 
*************************************************************
* Dr. Christoph Goller     Tel. : +49 89 203 45734          *
* Managing Director        Email: goller@detego-software.de *
* Detego Software GmbH     Mail : Keuslinstr. 13,           *
*                                 80798 München, Germany    *
*************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
+1
Looks like a great optimization.
Dmitry.


Christoph Goller wrote:

> Dmitry Serebrennikov schrieb:
>
>> I put those in mostly to assure myself that I got things right. I 
>> think the key question is whether it possible to read part of another 
>> file. If not, I think that's fine. If yes, I think that's a problem.
>>
>> Dmitry.
>>
>
> Hi Dmitry,
>
> I have been using my patch for a couple of weaks now and I think it is 
> ok.
> I kept the index bound checks, so that the original tests go through.
> I would like to commit these changes. Did I answer all your questions
> in my last email? Do you have any objections against commiting the patch?
>
> Christoph
>
> PS: patch is attached again.
>
>------------------------------------------------------------------------
>
>Index: CompoundFileReader.java
>===================================================================
>RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/CompoundFileReader.java,v
>retrieving revision 1.2
>diff -u -r1.2 CompoundFileReader.java
>--- CompoundFileReader.java	13 Oct 2003 14:18:04 -0000	1.2
>+++ CompoundFileReader.java	19 Nov 2003 13:06:40 -0000
>@@ -240,10 +240,9 @@
>                       final long length)
>         throws IOException
>         {
>-            this.base = (InputStream) base.clone();
>+            this.base = base;
>             this.fileOffset = fileOffset;
>             this.length = length;   // variable in the superclass
>-            seekInternal(0);        // position to the adjusted 0th byte
>         }
> 
>         /** Expert: implements buffer refill.  Reads bytes from the current
>@@ -255,7 +254,10 @@
>         protected void readInternal(byte[] b, int offset, int len)
>         throws IOException
>         {
>-            base.readBytes(b, offset, len);
>+            synchronized (base) {
>+              base.seek(fileOffset + getFilePointer());
>+              base.readBytes(b, offset, len);
>+            }
>         }
> 
>         /** Expert: implements seek.  Sets current position in this file, where
>@@ -269,29 +271,11 @@
> 
>             if (pos < 0)
>                 throw new IOException("Seek to a negative offset");
>-
>-            base.seek(fileOffset + pos);
>         }
> 
>         /** Closes the stream to futher operations. */
>         public void close() throws IOException
>-        {
>-            base.close();
>-        }
>+        {}
> 
>-        /** Returns a clone of this stream.
>-         *
>-         * <p>Clones of a stream access the same data, and are positioned at the same
>-         * point as the stream they were cloned from.
>-         *
>-         * <p>Expert: Subclasses must ensure that clones may be positioned at
>-         * different points in the input from each other and from the stream they
>-         * were cloned from.
>-         */
>-        public Object clone() {
>-            CSInputStream other = (CSInputStream) super.clone();
>-            other.base = (InputStream) base.clone();
>-            return other;
>-        }
>     }
> }
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Doug Cutting <cu...@lucene.com>.
Christoph Goller wrote:
> I would like to commit these changes.

+1

Thanks,

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Christoph Goller <go...@detego-software.de>.
Otis Gospodnetic schrieb:
> Hello Christoph,
> 
> I don't know enough to judge all changes you made with confidence, and
> don't have the code on this machine to apply the patch and look at it
> properly.  However, it looks like seekInternal(long) should now be left
> with only the conditionals that can cause IOException, and nothing
> else.  Why do we need to keep the body of the method then?

I discussed that already with Dmitry and he was not very passionate
regarding these conditionals. Finally I kept them since they were required
by Dmitry's test. But throwing an exception on seek is
not a behavior implemented by the other subclasses of InputStream.
I decided to remove the conditionals and slightly changed the unit test.

If more index checks are desirable, I would put them into InputStream.
Despite that I added a "read past EOF" check to CSInputStream.readInternal
to guarantee the same behavior as FSInputStream.

> 
> I also see getFilePointer() call, but can't tell where that method is. 
> I'll assume it's from the parent InputStream.

yes.






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello Christoph,

I don't know enough to judge all changes you made with confidence, and
don't have the code on this machine to apply the patch and look at it
properly.  However, it looks like seekInternal(long) should now be left
with only the conditionals that can cause IOException, and nothing
else.  Why do we need to keep the body of the method then?

I also see getFilePointer() call, but can't tell where that method is. 
I'll assume it's from the parent InputStream.

Otis


--- Christoph Goller <go...@detego-software.de> wrote:
> Dmitry Serebrennikov schrieb:
> > I put those in mostly to assure myself that I got things right. I
> think 
> > the key question is whether it possible to read part of another
> file. If 
> > not, I think that's fine. If yes, I think that's a problem.
> > 
> > Dmitry.
> > 
> 
> Hi Dmitry,
> 
> I have been using my patch for a couple of weaks now and I think it
> is ok.
> I kept the index bound checks, so that the original tests go through.
> I would like to commit these changes. Did I answer all your questions
> in my last email? Do you have any objections against commiting the
> patch?
> 
> Christoph
> 
> PS: patch is attached again.
> > Index: CompoundFileReader.java
> ===================================================================
> RCS file:
>
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/CompoundFileReader.java,v
> retrieving revision 1.2
> diff -u -r1.2 CompoundFileReader.java
> --- CompoundFileReader.java	13 Oct 2003 14:18:04 -0000	1.2
> +++ CompoundFileReader.java	19 Nov 2003 13:06:40 -0000
> @@ -240,10 +240,9 @@
>                        final long length)
>          throws IOException
>          {
> -            this.base = (InputStream) base.clone();
> +            this.base = base;
>              this.fileOffset = fileOffset;
>              this.length = length;   // variable in the superclass
> -            seekInternal(0);        // position to the adjusted 0th
> byte
>          }
>  
>          /** Expert: implements buffer refill.  Reads bytes from the
> current
> @@ -255,7 +254,10 @@
>          protected void readInternal(byte[] b, int offset, int len)
>          throws IOException
>          {
> -            base.readBytes(b, offset, len);
> +            synchronized (base) {
> +              base.seek(fileOffset + getFilePointer());
> +              base.readBytes(b, offset, len);
> +            }
>          }
>  
>          /** Expert: implements seek.  Sets current position in this
> file, where
> @@ -269,29 +271,11 @@
>  
>              if (pos < 0)
>                  throw new IOException("Seek to a negative offset");
> -
> -            base.seek(fileOffset + pos);
>          }
>  
>          /** Closes the stream to futher operations. */
>          public void close() throws IOException
> -        {
> -            base.close();
> -        }
> +        {}
>  
> -        /** Returns a clone of this stream.
> -         *
> -         * <p>Clones of a stream access the same data, and are
> positioned at the same
> -         * point as the stream they were cloned from.
> -         *
> -         * <p>Expert: Subclasses must ensure that clones may be
> positioned at
> -         * different points in the input from each other and from
> the stream they
> -         * were cloned from.
> -         */
> -        public Object clone() {
> -            CSInputStream other = (CSInputStream) super.clone();
> -            other.base = (InputStream) base.clone();
> -            return other;
> -        }
>      }
>  }
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Christoph Goller <go...@detego-software.de>.
Dmitry Serebrennikov schrieb:
> I put those in mostly to assure myself that I got things right. I think 
> the key question is whether it possible to read part of another file. If 
> not, I think that's fine. If yes, I think that's a problem.
> 
> Dmitry.
> 

Hi Dmitry,

I have been using my patch for a couple of weaks now and I think it is ok.
I kept the index bound checks, so that the original tests go through.
I would like to commit these changes. Did I answer all your questions
in my last email? Do you have any objections against commiting the patch?

Christoph

PS: patch is attached again.

Re: CompoundFileReader

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
I put those in mostly to assure myself that I got things right. I think 
the key question is whether it possible to read part of another file. If 
not, I think that's fine. If yes, I think that's a problem.

Dmitry.

Christoph Goller wrote:

> Hi Dmitry,
>
> Now I tried all test cases. They all work except for Russian 
> analyser/stemmer
> and occational fails of TestIndexReader (the timestamp problem). So I 
> think
> it should be ok as far as CoumpoundFile is concerned. Off course we still
> have to find a good solution for the timestamp problem.
>
> However,I stumbled over a problem that I had missed last time. 
> TestCompoundFile
> only succedds with your index bound tests in CSInputStream.seekInternal.
> On Thursday I had deleted them after trying your test cases because 
> the other
> implementations don´t do these tests either. I did not go too deep 
> into your
> tests, but do you think the bahaviour of throwing an exception if the 
> seek
> index is out of bound is required? Its not part of the contract of the 
> other
> implementations of InputStream. Maybe I am missing something here.
>
> Dmitry Serebrennikov schrieb:
>
>> Dear Christoph,
>>
>> Sounds like an excellent enhancement. From a quick look, it appears 
>> that you are right and everything should work just fine but use less 
>> memory. One question: have you tried the other test cases also or 
>> just the TestCompoundFile. There are quite a few conditions that 
>> TestCompoundFile does not cover.
>>
>> At first I thought that the synchronization around readBytes would 
>> cause too much performance degradation when a lot of concurrent 
>> queries were executing. But after I looked at it some more, I 
>> convinced myself that it should be ok. Have you ran any 
>> multi-threaded tests / benchmarks? I think it might also be a good 
>> idea before making this change.
>>
>> Christoph, do you think it is possible to just call readInternal on 
>> the base stream instead of the readBytes? The main difference is that 
>> we would bypass the buffering in the base stream. I think the 
>> buffering done by the superclass of the CSInputStream would be quite 
>> enough (which is your point to begin with, right)? Perhaps it would 
>> be worthwhile to make InputStream.readInternal() public instead of 
>> protected?
>
>
> In CSInputStream.readInternal I call:
>
> synchronized (base) {
>   base.seek(fileOffset + getFilePointer());
>   base.readBytes(b, offset, len);
> }
>
> Calling base.seek does nothing more than setting the file pointer
> (bufferStart + bufferPosition) of base correctly.
>
> base.readBytes(b, offset, len) in this case does not use the buffer of
> base (at least in most cases). Look into InputStream.readBytes.
> If len >= BUFFER_SIZE the base buffer is skipped and the buffer b is
> used directly.
>
> I think synchronized in our case does not much more than synchronizing
> on the actual file in FSInputStream.readInternal.
>
> Christoph
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Christoph Goller <go...@detego-software.de>.
Thank you Erik,

TestRussianAnalyzer now works from within Eclipse.
I made/committed similar changes to TestRussianStem
and it now works from within Eclipse too.

Erik Hatcher schrieb:
> 
> On Monday, October 20, 2003, at 01:10  PM, Christoph Goller wrote:
> 
>> Sorry fo all the mess. I found out it works for me from the command 
>> line, too.
>> My IDE fooled me. The file not found errors for TestRussianAnalyzer 
>> TestRussianStem
>> only occur if I call ant from within Eclipse, even if I explicitly set 
>> the base
>> directory????? Well, maybe I should look into the manual :-)
> 
> 
> I just made a change that will hopefully make this test work fine in 
> your IDE.  Set the system VM property "dataDir" to the absolute path to 
> your src/test directory and it should work.  There are other ways to 
> make JUnit test cases resilient to their environment, but this was the 
> quickest/easiest one.
> 
> Let me know if there are still issues with TestRussianAnalyzer.
> 
>     Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Monday, October 20, 2003, at 01:10  PM, Christoph Goller wrote:
> Sorry fo all the mess. I found out it works for me from the command 
> line, too.
> My IDE fooled me. The file not found errors for TestRussianAnalyzer 
> TestRussianStem
> only occur if I call ant from within Eclipse, even if I explicitly set 
> the base
> directory????? Well, maybe I should look into the manual :-)

I just made a change that will hopefully make this test work fine in 
your IDE.  Set the system VM property "dataDir" to the absolute path to 
your src/test directory and it should work.  There are other ways to 
make JUnit test cases resilient to their environment, but this was the 
quickest/easiest one.

Let me know if there are still issues with TestRussianAnalyzer.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Christoph Goller <go...@detego-software.de>.
Sorry fo all the mess. I found out it works for me from the command line, too.
My IDE fooled me. The file not found errors for TestRussianAnalyzer TestRussianStem
only occur if I call ant from within Eclipse, even if I explicitly set the base
directory????? Well, maybe I should look into the manual :-)

Christoph

Erik Hatcher schrieb:
> TestRussionAnalyzer works fine for me on Mac OS X running 'ant test' 
> from the command-line.
> 
> Could you make sure you have the latest from CVS and let me know the 
> details of the error you're seeing?
> 
> Thanks,
>     Erik
> 
> 
> 
> On Monday, October 20, 2003, at 12:02  PM, Christoph Goller wrote:
> 
>> Its not an encoding problem. I now checked that. What happens is that 
>> the ASCII
>> files for the tests are not found if I run the test from ant using 
>> build.xml.
>> Could you try if you can run the tests from ant using Erik´s latest 
>> build.xml?
>> I am not an ant expert. Either I somewhere have a wrong global 
>> base/root directory
>> setting, or some setting has to be changed in build.xml or within the 
>> tests.
>>
>> Christoph
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 

-- 
*************************************************************
* Dr. Christoph Goller     Tel. : +49 89 203 45734          *
* Managing Director        Email: goller@detego-software.de *
* Detego Software GmbH     Mail : Keuslinstr. 13,           *
*                                 80798 München, Germany    *
*************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
TestRussionAnalyzer works fine for me on Mac OS X running 'ant test' 
from the command-line.

Could you make sure you have the latest from CVS and let me know the 
details of the error you're seeing?

Thanks,
	Erik



On Monday, October 20, 2003, at 12:02  PM, Christoph Goller wrote:
> Its not an encoding problem. I now checked that. What happens is that 
> the ASCII
> files for the tests are not found if I run the test from ant using 
> build.xml.
> Could you try if you can run the tests from ant using Erik´s latest 
> build.xml?
> I am not an ant expert. Either I somewhere have a wrong global 
> base/root directory
> setting, or some setting has to be changed in build.xml or within the 
> tests.
>
> Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Christoph Goller <go...@detego-software.de>.
Its not an encoding problem. I now checked that. What happens is that the ASCII
files for the tests are not found if I run the test from ant using build.xml.
Could you try if you can run the tests from ant using Erik´s latest build.xml?
I am not an ant expert. Either I somewhere have a wrong global base/root directory
setting, or some setting has to be changed in build.xml or within the tests.

Christoph

Hani Suleiman schrieb:
> What platform are you running on? The russian stuff was failing 
> previously due to encoding issues, whereby the tests assumed the the 
> default file encoding is always iso-8859-1, which isn't the case on all 
> OS's.
> 
> The reader and writer tests both fail occasionally on linux without the 
> timestamp patch. Would be nice if you could test out my patch and see if 
> those errors go away ;)
> 
> Christoph Goller wrote:
> 
>>
>> Hani Suleiman schrieb:
>>
>>> What Russian stemmer problems? I've submitted a fix for that a few 
>>> weeks ago which was checked in, so there shouldn't be any more 
>>> problems with it.
>>>
>>
>> If I run these two tests separately from my eclipse, everything is fine.
>> If I run all tests with ant from the current build.xml these two tests
>> and TestIndexReader fail. I have checked and I am sure I have the latest
>> sources. I have not idea whats happening. Do you think failure of
>> TestRussianAnalyzer and TestRussianStem could be caused by the timestamp
>> problem? I am working with Linux and TestIndexReader occasionally fails
>> if FSDirectory is used.
>>
>> Christoph
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 

-- 
*************************************************************
* Dr. Christoph Goller     Tel. : +49 89 203 45734          *
* Managing Director        Email: goller@detego-software.de *
* Detego Software GmbH     Mail : Keuslinstr. 13,           *
*                                 80798 München, Germany    *
*************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: TestRussianAnalyzer TestRussianStem

Posted by Hani Suleiman <ha...@formicary.net>.
What platform are you running on? The russian stuff was failing 
previously due to encoding issues, whereby the tests assumed the the 
default file encoding is always iso-8859-1, which isn't the case on all 
OS's.

The reader and writer tests both fail occasionally on linux without the 
timestamp patch. Would be nice if you could test out my patch and see if 
those errors go away ;)

Christoph Goller wrote:

> 
> Hani Suleiman schrieb:
> 
>> What Russian stemmer problems? I've submitted a fix for that a few 
>> weeks ago which was checked in, so there shouldn't be any more 
>> problems with it.
>>
> 
> If I run these two tests separately from my eclipse, everything is fine.
> If I run all tests with ant from the current build.xml these two tests
> and TestIndexReader fail. I have checked and I am sure I have the latest
> sources. I have not idea whats happening. Do you think failure of
> TestRussianAnalyzer and TestRussianStem could be caused by the timestamp
> problem? I am working with Linux and TestIndexReader occasionally fails
> if FSDirectory is used.
> 
> Christoph
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


TestRussianAnalyzer TestRussianStem

Posted by Christoph Goller <go...@detego-software.de>.
Hani Suleiman schrieb:
> What Russian stemmer problems? I've submitted a fix for that a few weeks 
> ago which was checked in, so there shouldn't be any more problems with it.
> 

If I run these two tests separately from my eclipse, everything is fine.
If I run all tests with ant from the current build.xml these two tests
and TestIndexReader fail. I have checked and I am sure I have the latest
sources. I have not idea whats happening. Do you think failure of
TestRussianAnalyzer and TestRussianStem could be caused by the timestamp
problem? I am working with Linux and TestIndexReader occasionally fails
if FSDirectory is used.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Hani Suleiman <ha...@formicary.net>.
What Russian stemmer problems? I've submitted a fix for that a few 
weeks ago which was checked in, so there shouldn't be any more problems 
with it.

On Saturday, October 18, 2003, at 01:07 PM, Christoph Goller wrote:

> Hi Dmitry,
>
> Now I tried all test cases. They all work except for Russian 
> analyser/stemmer
> and occational fails of TestIndexReader (the timestamp problem). So I 
> think
> it should be ok as far as CoumpoundFile is concerned. Off course we 
> still
> have to find a good solution for the timestamp problem.
>
> However,I stumbled over a problem that I had missed last time. 
> TestCompoundFile
> only succedds with your index bound tests in 
> CSInputStream.seekInternal.
> On Thursday I had deleted them after trying your test cases because 
> the other
> implementations don´t do these tests either. I did not go too deep 
> into your
> tests, but do you think the bahaviour of throwing an exception if the 
> seek
> index is out of bound is required? Its not part of the contract of the 
> other
> implementations of InputStream. Maybe I am missing something here.
>
> Dmitry Serebrennikov schrieb:
>> Dear Christoph,
>> Sounds like an excellent enhancement. From a quick look, it appears 
>> that you are right and everything should work just fine but use less 
>> memory. One question: have you tried the other test cases also or 
>> just the TestCompoundFile. There are quite a few conditions that 
>> TestCompoundFile does not cover.
>> At first I thought that the synchronization around readBytes would 
>> cause too much performance degradation when a lot of concurrent 
>> queries were executing. But after I looked at it some more, I 
>> convinced myself that it should be ok. Have you ran any 
>> multi-threaded tests / benchmarks? I think it might also be a good 
>> idea before making this change.
>> Christoph, do you think it is possible to just call readInternal on 
>> the base stream instead of the readBytes? The main difference is that 
>> we would bypass the buffering in the base stream. I think the 
>> buffering done by the superclass of the CSInputStream would be quite 
>> enough (which is your point to begin with, right)? Perhaps it would 
>> be worthwhile to make InputStream.readInternal() public instead of 
>> protected?
>
> In CSInputStream.readInternal I call:
>
> synchronized (base) {
>   base.seek(fileOffset + getFilePointer());
>   base.readBytes(b, offset, len);
> }
>
> Calling base.seek does nothing more than setting the file pointer
> (bufferStart + bufferPosition) of base correctly.
>
> base.readBytes(b, offset, len) in this case does not use the buffer of
> base (at least in most cases). Look into InputStream.readBytes.
> If len >= BUFFER_SIZE the base buffer is skipped and the buffer b is
> used directly.
>
> I think synchronized in our case does not much more than synchronizing
> on the actual file in FSInputStream.readInternal.
>
> Christoph
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Christoph Goller <go...@detego-software.de>.
Hi Dmitry,

Now I tried all test cases. They all work except for Russian analyser/stemmer
and occational fails of TestIndexReader (the timestamp problem). So I think
it should be ok as far as CoumpoundFile is concerned. Off course we still
have to find a good solution for the timestamp problem.

However,I stumbled over a problem that I had missed last time. TestCompoundFile
only succedds with your index bound tests in CSInputStream.seekInternal.
On Thursday I had deleted them after trying your test cases because the other
implementations don´t do these tests either. I did not go too deep into your
tests, but do you think the bahaviour of throwing an exception if the seek
index is out of bound is required? Its not part of the contract of the other
implementations of InputStream. Maybe I am missing something here.

Dmitry Serebrennikov schrieb:
> Dear Christoph,
> 
> Sounds like an excellent enhancement. From a quick look, it appears that 
> you are right and everything should work just fine but use less memory. 
> One question: have you tried the other test cases also or just the 
> TestCompoundFile. There are quite a few conditions that TestCompoundFile 
> does not cover.
> 
> At first I thought that the synchronization around readBytes would cause 
> too much performance degradation when a lot of concurrent queries were 
> executing. But after I looked at it some more, I convinced myself that 
> it should be ok. Have you ran any multi-threaded tests / benchmarks? I 
> think it might also be a good idea before making this change.
> 
> Christoph, do you think it is possible to just call readInternal on the 
> base stream instead of the readBytes? The main difference is that we 
> would bypass the buffering in the base stream. I think the buffering 
> done by the superclass of the CSInputStream would be quite enough (which 
> is your point to begin with, right)? Perhaps it would be worthwhile to 
> make InputStream.readInternal() public instead of protected?

In CSInputStream.readInternal I call:

synchronized (base) {
   base.seek(fileOffset + getFilePointer());
   base.readBytes(b, offset, len);
}

Calling base.seek does nothing more than setting the file pointer
(bufferStart + bufferPosition) of base correctly.

base.readBytes(b, offset, len) in this case does not use the buffer of
base (at least in most cases). Look into InputStream.readBytes.
If len >= BUFFER_SIZE the base buffer is skipped and the buffer b is
used directly.

I think synchronized in our case does not much more than synchronizing
on the actual file in FSInputStream.readInternal.

Christoph



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: CompoundFileReader

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Dear Christoph,

Sounds like an excellent enhancement. From a quick look, it appears that 
you are right and everything should work just fine but use less memory. 
One question: have you tried the other test cases also or just the 
TestCompoundFile. There are quite a few conditions that TestCompoundFile 
does not cover.

At first I thought that the synchronization around readBytes would cause 
too much performance degradation when a lot of concurrent queries were 
executing. But after I looked at it some more, I convinced myself that 
it should be ok. Have you ran any multi-threaded tests / benchmarks? I 
think it might also be a good idea before making this change.

Christoph, do you think it is possible to just call readInternal on the 
base stream instead of the readBytes? The main difference is that we 
would bypass the buffering in the base stream. I think the buffering 
done by the superclass of the CSInputStream would be quite enough (which 
is your point to begin with, right)? Perhaps it would be worthwhile to 
make InputStream.readInternal() public instead of protected?

Thanks.
Dmitry.


Christoph Goller wrote:

> Dear Dmitry,
>
> I finally found time to look into your compound file implementation and
> to try it out. It works and I like it. However, I wondered, why you are
> using clones of the base input stream in 
> CompoundFileReader.CSInputStream.
> The "problem" (actually it is not a real problem) I see is that you are
> double buffering your input. Note that the clone has a buffer of 1024 
> bytes and
> that CompoundFileReader.CSInputStream, which extends InputStream, also 
> has
> such a buffer. I think in the way you implemented it your input gets 
> copied
> into the base buffer and from there into the 
> CompoundFileReader.CSInputStream
> buffer before it is actually used. Please have a look at my patch. At 
> first
> one might think that my implementation is simpler but less efficient. 
> This
> is not the case. Actually it is even a little bit more efficient. 
> Points to
> think about:
>
> *) stream is private to CompoundFileReader, so nobody else can use it, 
> if not
> explicitly granted access. CompoundFileReader uses stream only in its
> constructor and when calling openFile.
>
> *) Therefore, CompoundFileReader.CSInputStream can share the same 
> instance of
> stream if they take care of synchronization and seek.
>
> *) An InputStream not necessarily has to implement seekInternal if 
> readInternal
> takes care of seeking on the real stream (see FSInputStream).
>
> *) Synchronization of CompoundFileReader.CSInputStream.readInternal on 
> base
> does no harm since it is called only when the real file has to be 
> accessed
> and thi is synchronized anyway. However, this synchronizytion is 
> necessary!
>
> I tested my patch with your TestCompoundFile and everything seems fine.
> I also tried it on one of my indices and searched on this index. It also
> works. My impression from my tests is that my patch is a little bit 
> faster
> than your original version but there seems to be not much difference in
> efficiency.
>
> Christoph
>
>------------------------------------------------------------------------
>
>Index: CompoundFileReader.java
>===================================================================
>RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/index/CompoundFileReader.java,v
>retrieving revision 1.2
>diff -u -r1.2 CompoundFileReader.java
>--- CompoundFileReader.java	13 Oct 2003 14:18:04 -0000	1.2
>+++ CompoundFileReader.java	16 Oct 2003 18:58:02 -0000
>@@ -240,10 +240,9 @@
>                       final long length)
>         throws IOException
>         {
>-            this.base = (InputStream) base.clone();
>+            this.base = base;
>             this.fileOffset = fileOffset;
>             this.length = length;   // variable in the superclass
>-            seekInternal(0);        // position to the adjusted 0th byte
>         }
> 
>         /** Expert: implements buffer refill.  Reads bytes from the current
>@@ -255,7 +254,10 @@
>         protected void readInternal(byte[] b, int offset, int len)
>         throws IOException
>         {
>-            base.readBytes(b, offset, len);
>+            synchronized (base) {
>+              base.seek(fileOffset + getFilePointer());
>+              base.readBytes(b, offset, len);
>+            }
>         }
> 
>         /** Expert: implements seek.  Sets current position in this file, where
>@@ -263,35 +265,11 @@
>          * @see #readInternal(byte[],int,int)
>          */
>         protected void seekInternal(long pos) throws IOException
>-        {
>-            if (pos > 0 && pos >= length)
>-                throw new IOException("Seek past the end of file");
>-
>-            if (pos < 0)
>-                throw new IOException("Seek to a negative offset");
>-
>-            base.seek(fileOffset + pos);
>-        }
>+        {}
> 
>         /** Closes the stream to futher operations. */
>         public void close() throws IOException
>-        {
>-            base.close();
>-        }
>+        {}
> 
>-        /** Returns a clone of this stream.
>-         *
>-         * <p>Clones of a stream access the same data, and are positioned at the same
>-         * point as the stream they were cloned from.
>-         *
>-         * <p>Expert: Subclasses must ensure that clones may be positioned at
>-         * different points in the input from each other and from the stream they
>-         * were cloned from.
>-         */
>-        public Object clone() {
>-            CSInputStream other = (CSInputStream) super.clone();
>-            other.base = (InputStream) base.clone();
>-            return other;
>-        }
>     }
> }
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org