You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Roman Rokytskyy <rr...@yahoo.co.uk> on 2002/04/22 16:00:58 UTC

InputStream handling problem

Hi,

I just joined the list, and I'm not sure if it is correct place to ask, but
I do believe that my problem is development issue. I searched archives but
didn't find any reference that it was raised before.

I am creating directory implementation for JDataStore database from Borland.
And I have the following problem: Lucene tries to delete a file that is
still open by some of the InputStreams.

JDataStore has direct support of streams and files, so my stream
implementation does not do too much - it opens the underlying stream and
delegates calls to it. But JDataStore does not allow you to delete file when
there's at least one open stream.

I created a code that monitors all opened streams on the file and closes
them before deleting. And I get another problem: after deleting file there's
some activity on streams that were closed before deleting
object(readInternal(...)). This seems to be a bug.

Also, I modified RAMDirectory and implemented mechanism of registering
number of references on the RAMFile that exist ( +1 when stream is
opened, -1 when stream is closed). In RAMDirectory.deleteFile(String) I
check the number of references and if it >0 throw an java.lang.Error (when
you throw IOException in Directory.deleteFile(String), Lucene seems not to
notice it) and I do get them when running ThreadSafetyTest (I had to make
small modification in order to use one instance of directory and not create
them when needed).

Should I post my changes that show the problem here?

Thank you in advance.

Best regards,
Roman Rokytskyy


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Roman Rokytskyy <rr...@yahoo.co.uk>.
> I'm sorry, I should have been more specific. The file handle is only in
> the picture when FSInputStream is cloned. From what I can tell after a
> quick look, InputStream is responsible for buffering and it delegates to
> subclasses (via a call to readInternal) to refill the buffer from the
> underlying data store. When cloned, the InputStream clones the buffer
> (in the hope that the next read will still hit the buffered data I
> suppose), but after that it has its own seek position and its own
> buffer. In the case of FSInputStream, there is a Descriptor object that
> is shared between the clones. In the case of RAMInputStream - RAMFile is
> the shared object.

What is the reason to have buffer with RAMInputStream? To have another copy
of same data?

> Perhaps a factory patter would be more flexible, but it looks like the
> existing code does a pretty good job for the RAM and FS cases. Would the
> factory pattern allow a better database implementation?

It might. If you use embedded database like JDataStore, you should not cache
data internally, database does this. So, buffer and cache simply introduce
addtional memory consumption.

> I don't know, I have not heard many complaints about that code recently.

Ok, I will try it "as is" with JDataStore, and if it works - fine.

> There is activity in terms of creating a crawler / content handler
> framework. There is also a need to handle "update" better, I think. For
> example, I think it would be great to have deletes go through
> IndexWriter and get "cached" in the new segment, to be later applied to
> the prior segments during optimization. This would make deletes and adds
> transactional.

Ok, I will have a look, but I have almost no experience with Lucene.

> Another thing on my wish / todo list is to reduce the number of OS files
> that must be open. Once you get a lot of indexes, with a number of
> stored fields, and keep re-indexing them, the number of open files grows
> rather quickly. And if Lucene is part of another program that already
> has other file IO needs, you end up quickly pushing into the max open
> files limit of the OS. The idea I have for this one is to implement a
> different kind of segment - one that is composed of a single file. Once
> a segment is created by IndexWriter, it never changes (besides the
> deletes), so it could easily be stored as a single file.

I will check this thing with JDataStore. Maybe we could borrow couple of
ideas from them (like built-in file system)... This would simplify life -
one file for all indices, tx support?, backup, etc.

Thanks!
Roman Rokytskyy


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: InputStream handling problem

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Roman Rokytskyy wrote:

>>Yes, I forgot about that one. It's even more interesting than that! The
>>stream objects that Doug coded are not java.io. streams. They are
>>wrappers on top of those. Each clone maintains it's own seek offset.
>>Essentially, they share the same OS file handle but present an
>>abstraction of multiple independent streams into the same file.
>>
>
>Sorry, but isn't file handle sharing something specific to FSInputStream?
>Why do we force that on our abstract class level?
>
I'm sorry, I should have been more specific. The file handle is only in 
the picture when FSInputStream is cloned. From what I can tell after a 
quick look, InputStream is responsible for buffering and it delegates to 
subclasses (via a call to readInternal) to refill the buffer from the 
underlying data store. When cloned, the InputStream clones the buffer 
(in the hope that the next read will still hit the buffered data I 
suppose), but after that it has its own seek position and its own 
buffer. In the case of FSInputStream, there is a Descriptor object that 
is shared between the clones. In the case of RAMInputStream - RAMFile is 
the shared object.

>
>
>I would suggest a factory pattern, where input stream is created for a file,
>and how this is handled is up to the implementation. FSDirectory will share
>handles, RAMDirectory will have references to same RAMFile object, my
>JDataStoreDirectory will rely on JDataStore to manage it effectively.
>
Perhaps a factory patter would be more flexible, but it looks like the 
existing code does a pretty good job for the RAM and FS cases. Would the 
factory pattern allow a better database implementation?

>
>
>Should I try to rewrite it? (I also would appreciate your opinion if I
>should try to touch that code at all).
>
I don't know, I have not heard many complaints about that code recently. 
There is activity in terms of creating a crawler / content handler 
framework. There is also a need to handle "update" better, I think. For 
example, I think it would be great to have deletes go through 
IndexWriter and get "cached" in the new segment, to be later applied to 
the prior segments during optimization. This would make deletes and adds 
transactional.

Another thing on my wish / todo list is to reduce the number of OS files 
that must be open. Once you get a lot of indexes, with a number of 
stored fields, and keep re-indexing them, the number of open files grows 
rather quickly. And if Lucene is part of another program that already 
has other file IO needs, you end up quickly pushing into the max open 
files limit of the OS. The idea I have for this one is to implement a 
different kind of segment - one that is composed of a single file. Once 
a segment is created by IndexWriter, it never changes (besides the 
deletes), so it could easily be stored as a single file.

These are just a few areas that are my favorites... But then again, if 
you see another problem that's in your way, chances are that there are 
other people out there with the same issue. 

In any case, good luck!
Dmitry.

>
>
>Thanks,
>Roman Rokytskyy
>
>
>_________________________________________________________
>Do You Yahoo!?
>Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>.
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Roman Rokytskyy <rr...@yahoo.co.uk>.
> Yes, I forgot about that one. It's even more interesting than that! The
> stream objects that Doug coded are not java.io. streams. They are
> wrappers on top of those. Each clone maintains it's own seek offset.
> Essentially, they share the same OS file handle but present an
> abstraction of multiple independent streams into the same file.

Sorry, but isn't file handle sharing something specific to FSInputStream?
Why do we force that on our abstract class level?

I would suggest a factory pattern, where input stream is created for a file,
and how this is handled is up to the implementation. FSDirectory will share
handles, RAMDirectory will have references to same RAMFile object, my
JDataStoreDirectory will rely on JDataStore to manage it effectively.

Should I try to rewrite it? (I also would appreciate your opinion if I
should try to touch that code at all).

Thanks,
Roman Rokytskyy


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: InputStream handling problem

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Roman Rokytskyy wrote:

>>So, I think Otis is right - it's really not a "problem", besides being
>>an interesting design problem that is. There is an issue of whether it
>>is a good practice to make use of OS-specific behavior in this way.
>>Obviously, the portability suffers. I'm not sure if there are
>>performance arguments one way or another (Doug?).
>>
>
>My main concern here was that I get exceptions from JDataStore, and I had to
>put checks whether there's still open stream. If you say that the file will
>be deleted, I have no problem with it. :) I will hope that eventually all
>streams will be closed and file will be deleted.
>
I'm pretty sure it's ok, but you may want to run a few test to make sure.

>
>Another interesing design issue is cloneable input streams (this is where
>the problem comes). Since I never met it before, it would be great to put
>some Javadoc that explains in what state cloned input stream is supposed to
>be. Thanks to JDataStore, input stream there is random access and I can seek
>where I want, but what if some stream does not provide such functionality?
>
Yes, I forgot about that one. It's even more interesting than that! The 
stream objects that Doug coded are not java.io. streams. They are 
wrappers on top of those. Each clone maintains it's own seek offset. 
Essentially, they share the same OS file handle but present an 
abstraction of multiple independent streams into the same file.

Dmitry.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Roman Rokytskyy <rr...@yahoo.co.uk>.
> So, I think Otis is right - it's really not a "problem", besides being
> an interesting design problem that is. There is an issue of whether it
> is a good practice to make use of OS-specific behavior in this way.
> Obviously, the portability suffers. I'm not sure if there are
> performance arguments one way or another (Doug?).

My main concern here was that I get exceptions from JDataStore, and I had to
put checks whether there's still open stream. If you say that the file will
be deleted, I have no problem with it. :) I will hope that eventually all
streams will be closed and file will be deleted.

Another interesing design issue is cloneable input streams (this is where
the problem comes). Since I never met it before, it would be great to put
some Javadoc that explains in what state cloned input stream is supposed to
be. Thanks to JDataStore, input stream there is random access and I can seek
where I want, but what if some stream does not provide such functionality?

Best regards,
Roman Rokytskyy


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: InputStream handling problem

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Otis Gospodnetic wrote:

>
>I don't know, are you sure that what you are seeing really is a
>problem, that it is wrong to get rid of a file for which there is
>interest?
>It sounds logical, but maybe Doug wrote something that we can't find
>that makes this an okay thing to do.
>If this is a bug I wonder how come more people haven't complained about
>it...
>
>Sorry I couldn't help more, maybe somebody else can find the problem.
>
>Otis
>
I think originally the code was written to work on Unix, where deleting 
an open file is ok - it simply removes the directory entry and so one 
else can open the file any more. OS keeps track of the file readers and 
cleans up after the last reader has closed the file (or died). Later 
one, when the code was moved to DOS, there was a problem with this 
because DOS will refuse to delete a file it if is open. So the delete 
code would fail (I think File.delete() just returns false, no exceptions 
are thrown). To work with this, there is the "deleted" file which lists 
the segments that could be deleted (but weren't due to the fact that 
they were still open). Periodically, this file is checked and deletes 
are attempted again.

So, I think Otis is right - it's really not a "problem", besides being 
an interesting design problem that is. There is an issue of whether it 
is a good practice to make use of OS-specific behavior in this way. 
Obviously, the portability suffers. I'm not sure if there are 
performance arguments one way or another (Doug?).

Dmitry.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ah, sorry.  After I hit sent I realized that I should have cleared my
CLASSPATH, as it was old jar, with the old RAMDirectory, that was being
used.

I do get the error now:
java.lang.Error: Cannot delete file while there's interest in it
	at
org.apache.lucene.store.RAMDirectory.deleteFile(RAMDirectory.java:145)
	at
org.apache.lucene.index.IndexWriter.deleteFiles(IndexWriter.java:364)
	at
org.apache.lucene.index.IndexWriter.deleteSegments(IndexWriter.java:345)
	at org.apache.lucene.index.IndexWriter.access$200(IndexWriter.java:87)
	at org.apache.lucene.index.IndexWriter$2.doBody(IndexWriter.java:325)
	at org.apache.lucene.store.Lock$With.run(Lock.java:116)
	at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:322)
	at
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:283)
	at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:185)
	at
org.apache.lucene.ThreadSafetyTest$IndexerThread.run(ThreadSafetyTest.java:112)


Well, you may be onto something, although I don't see the culprit yet.
At first I thought you forgot to add this in RAMDirectory:

  public final OutputStream createFile(String name) {
    RAMFile file = new RAMFile();
    files.put(name, file);
    // OG
    incInterest(name);     // this
    return new RAMOutputStream(file, name);
  }

However, that causes the error to happen right away, which I didn't
expect, as I am increasing 'interest'.  Oh, I see, you are increasing
it in RAMOutputStream constructor...

Hm, well, it looks like I am only able to confirm your observations:

[otis@linux2 classes]$ java org.apache.lucene.ThreadSafetyTest &> o.log
[otis@linux2 classes]$ grep Error o.log 
java.lang.Error: Cannot delete file while there's interest in it:
_b.fdx
[otis@linux2 classes]$ 
[otis@linux2 classes]$ grep _b.fdx o.log 
Increased interest in _b.fdx to 1
Decreased interest in _b.fdx to 0
Increased interest in _b.fdx to 1
Increased interest in _b.fdx to 2
Increased interest in _b.fdx to 3
Decreased interest in _b.fdx to 2
Increased interest in _b.fdx to 3
Increased interest in _b.fdx to 4
Increased interest in _b.fdx to 5
Increased interest in _b.fdx to 6
Decreased interest in _b.fdx to 5
Increased interest in _b.fdx to 6
Increased interest in _b.fdx to 7
Increased interest in _b.fdx to 8
Decreased interest in _b.fdx to 7
Increased interest in _b.fdx to 8
java.lang.Error: Cannot delete file while there's interest in it:
_b.fdx
Decreased interest in _b.fdx to 7

It looks like the number of places where you increase interest, and
where you decrease or drop it, is balanced.  The super class of
RAMDirectory has all abstract methods, so nothing is happening there.

I don't know, are you sure that what you are seeing really is a
problem, that it is wrong to get rid of a file for which there is
interest?
It sounds logical, but maybe Doug wrote something that we can't find
that makes this an okay thing to do.
If this is a bug I wonder how come more people haven't complained about
it...

Sorry I couldn't help more, maybe somebody else can find the problem.

Otis



--- Otis Gospodnetic <ot...@yahoo.com> wrote:
> Hello,
> 
> I just used your classes (I picked the ones that looked right), run
> ThreadSafetyTest, but I can't get it to throw any Errors/Exceptions.
> 
> I can send you the 2 .java files I picked from the 4 that you sent,
> perhaps I picked the wrong ones...
> 
> Otis
> 
> 
> --- Roman Rokytskyy <rr...@yahoo.co.uk> wrote:
> > > Please send a test case, that will be really helpful.
> > 
> > Thanks for quick reply.
> > 
> > In attachment you will find two modified classes from latest CVS
> > update.
> > Please comment out/remove references to JDSDirectory if any (they
> are
> > copies
> > from my working env, and JDSDirectory depends on JDataStore
> classes,
> > therefore its not included).
> > 
> > Best regards,
> > Roman Rokytskyy
> > 
> 
> > ATTACHMENT part 2 application/x-javascript
> name=ThreadSafetyTest.java
> 
> 
> > ATTACHMENT part 3 application/x-javascript name=RAMDirectory.java
> 
> 
> > ATTACHMENT part 4 application/x-javascript name=RAMDirectory.java
> 
> 
> > ATTACHMENT part 5 application/x-javascript
> name=ThreadSafetyTest.java
> > --
> > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Games - play chess, backgammon, pool and more
> http://games.yahoo.com/
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

I just used your classes (I picked the ones that looked right), run
ThreadSafetyTest, but I can't get it to throw any Errors/Exceptions.

I can send you the 2 .java files I picked from the 4 that you sent,
perhaps I picked the wrong ones...

Otis


--- Roman Rokytskyy <rr...@yahoo.co.uk> wrote:
> > Please send a test case, that will be really helpful.
> 
> Thanks for quick reply.
> 
> In attachment you will find two modified classes from latest CVS
> update.
> Please comment out/remove references to JDSDirectory if any (they are
> copies
> from my working env, and JDSDirectory depends on JDataStore classes,
> therefore its not included).
> 
> Best regards,
> Roman Rokytskyy
> 

> ATTACHMENT part 2 application/x-javascript name=ThreadSafetyTest.java


> ATTACHMENT part 3 application/x-javascript name=RAMDirectory.java


> ATTACHMENT part 4 application/x-javascript name=RAMDirectory.java


> ATTACHMENT part 5 application/x-javascript name=ThreadSafetyTest.java
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Roman Rokytskyy <rr...@yahoo.co.uk>.
Sorry for double copies of each file. This was Outlook error. I will be more
careful next time.

> -----Original Message-----
> From: Roman Rokytskyy [mailto:rrokytskyy@yahoo.co.uk]
> Sent: Montag, 22. April 2002 16:56
> To: Lucene Developers List
> Subject: RE: InputStream handling problem
>
>
> > Please send a test case, that will be really helpful.
>
> Thanks for quick reply.
>
> In attachment you will find two modified classes from latest CVS update.
> Please comment out/remove references to JDSDirectory if any (they
> are copies
> from my working env, and JDSDirectory depends on JDataStore classes,
> therefore its not included).
>
> Best regards,
> Roman Rokytskyy
>


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: InputStream handling problem

Posted by Roman Rokytskyy <rr...@yahoo.co.uk>.
> Please send a test case, that will be really helpful.

Thanks for quick reply.

In attachment you will find two modified classes from latest CVS update.
Please comment out/remove references to JDSDirectory if any (they are copies
from my working env, and JDSDirectory depends on JDataStore classes,
therefore its not included).

Best regards,
Roman Rokytskyy

Re: InputStream handling problem

Posted by Peter Carlson <ca...@bookandhammer.com>.
Thanks for the work Roman,

Please send a test case, that will be really helpful.

--Peter

On 4/22/02 7:00 AM, "Roman Rokytskyy" <rr...@yahoo.co.uk> wrote:

> Hi,
> 
> I just joined the list, and I'm not sure if it is correct place to ask, but
> I do believe that my problem is development issue. I searched archives but
> didn't find any reference that it was raised before.
> 
> I am creating directory implementation for JDataStore database from Borland.
> And I have the following problem: Lucene tries to delete a file that is
> still open by some of the InputStreams.
> 
> JDataStore has direct support of streams and files, so my stream
> implementation does not do too much - it opens the underlying stream and
> delegates calls to it. But JDataStore does not allow you to delete file when
> there's at least one open stream.
> 
> I created a code that monitors all opened streams on the file and closes
> them before deleting. And I get another problem: after deleting file there's
> some activity on streams that were closed before deleting
> object(readInternal(...)). This seems to be a bug.
> 
> Also, I modified RAMDirectory and implemented mechanism of registering
> number of references on the RAMFile that exist ( +1 when stream is
> opened, -1 when stream is closed). In RAMDirectory.deleteFile(String) I
> check the number of references and if it >0 throw an java.lang.Error (when
> you throw IOException in Directory.deleteFile(String), Lucene seems not to
> notice it) and I do get them when running ThreadSafetyTest (I had to make
> small modification in order to use one instance of directory and not create
> them when needed).
> 
> Should I post my changes that show the problem here?
> 
> Thank you in advance.
> 
> Best regards,
> Roman Rokytskyy
> 
> 
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>