You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doron Cohen <DO...@il.ibm.com> on 2007/05/11 20:22:17 UTC

IndexReader.isCurrent in presence of many files

If this really turns to be related to having many files in the index
dir, could we maintain SEGMENTS_N files in a sub-directory..?

Doron

-- Forward --
-- http://www.mail-archive.com/java-user@lucene.apache.org/msg14398.html

Chris Hostetter <ho...@fucit.org> wrote on 11/05/2007 11:02:50:

>
> : Are there are large number of files in your index directory?
>
> and is there any correlation between the number files matching segment*
> and the time isCurrent taks?
>
> it would also be handy to know what filesystem you use as well ...
> directory listings may be more expensive on some filesystems then others.
>
>
>
> -Hoss
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Yonik Seeley <yo...@apache.org>.
On 5/11/07, Doron Cohen <DO...@il.ibm.com> wrote:
> If this really turns to be related to having many files in the index
> dir, could we maintain SEGMENTS_N files in a sub-directory..?

1) There might be slight incompatibilities with tools that assume a
lucene index is a bunch of files in the index directory.

2) I don't think we should bend over backward to speed up NFS
access... it's a *slow* way to distribute an index anyway.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Doron Cohen <DO...@il.ibm.com>.
yseeley@gmail.com wrote on 11/05/2007 20:07:11:

> However, is there a way to portably stat a directory? That could lead
> to a fast-path if no new files were added.

Do you mean something like File.lastModified() on the index dir?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Yonik Seeley <yo...@apache.org>.
On 5/13/07, Nadav Har'El <ny...@math.technion.ac.il> wrote:
> > >However, isCurrent() may be called before every query.
> >
> > That's never going to be a high performance architecture.
>
> Why is that so?
> Potentially, isCurrent() could do a couple of disk accesses (and usually
> cached by the operating system), which is much faster than running the
> actual query, which needs to read a lot more from disk (of course, that
> is often cached as well) and do a lot more processing.

I was thinking about the scenario of a moderate to high update rate that would
cause the IndexReader to be re-opened multiple times per second (presumably
that's why someone would want to call isCurrent()).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Fri, May 11, 2007, Yonik Seeley wrote about "Re: IndexReader.isCurrent in presence of many files":
> On 5/11/07, Doron Cohen <DO...@il.ibm.com> wrote:
> >However, isCurrent() may be called before every query.
> 
> That's never going to be a high performance architecture.

Why is that so?
Potentially, isCurrent() could do a couple of disk accesses (and usually
cached by the operating system), which is much faster than running the
actual query, which needs to read a lot more from disk (of course, that
is often cached as well) and do a lot more processing.

If isCurrent() takes 50ms (and I'm just inventing a number here), one way
to look at it is to say that it limits the number of calls to 20 each
second, which rules out high performance. The other way to look at is to
say that each query takes 200ms (say), so that the added 50ms is more-or-less
negligable.

Of course, the application could, and probably should, have its own mechanism
for specifying when the index needs to be reopened after a background process
modified it. But very often, just using isCurrent() is very convenient, and
works well.

So if isCurrent() can be kept reasonably quick (at least on local disks),
it would be great.

-- 
Nadav Har'El                        |       Sunday, May 13 2007, 25 Iyyar 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |How do you get holy water? Boil the hell
http://nadav.harel.org.il           |out of it.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Yonik Seeley <yo...@apache.org>.
On 5/11/07, Doron Cohen <DO...@il.ibm.com> wrote:
> However, isCurrent() may be called before every query.

That's never going to be a high performance architecture.

However, is there a way to portably stat a directory? That could lead
to a fast-path if no new files were added.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Doron Cohen <DO...@il.ibm.com>.
Chris Hostetter <ho...@fucit.org> wrote on 11/05/2007 17:10:54:

>
> : If this really turns to be related to having many files in the index
> : dir, could we maintain SEGMENTS_N files in a sub-directory..?
>
> I haven't done much experimenting / performance testing of File
> operations in Java, but just from looking at the java1.4.2 javadocs it
> seems like it *might* be possible to make the FindSegmentsFile class
> faster if line 504 (of SegmentInfos.java r518529) used a FilenameFilter.
>
> I say might because:
>   1) i have no idea how Java uses the filter udner the covers when
dealing
>      with the filesystem .. worst case it would help keep the array size
>      down.

That's right - AFAIK Java's File.list(filter) just iterates all the files
in the filesystem directory and adds those accepted by the filter to
its result. So this is not supposed to help.

>   2) it's not immediately clear to me if that code path is ever used ...
i
>      suspect most clients are using the lucene Directory constructor and
>      not the java.io.File constructor ... Directory.list does not support
>      FilenameFilters. so we can't try the same thing on that code path
>      with making some other changes.

I'm aware that anything like this would call for an API addition to
o.a.l.s.Directory.

Also, Yonik has a point about backwards compatibility
vs. speed on NFS, though I don't know if this only shows on NFS.

However, isCurrent() may be called before every query.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexReader.isCurrent in presence of many files

Posted by Chris Hostetter <ho...@fucit.org>.
: If this really turns to be related to having many files in the index
: dir, could we maintain SEGMENTS_N files in a sub-directory..?

I haven't done much experimenting / performance testing of File
operations in Java, but just from looking at the java1.4.2 javadocs it
seems like it *might* be possible to make the FindSegmentsFile class
faster if line 504 (of SegmentInfos.java r518529) used a FilenameFilter.

I say might because:
  1) i have no idea how Java uses the filter udner the covers when dealing
     with the filesystem .. worst case it would help keep the array size
     down.
  2) it's not immediately clear to me if that code path is ever used ... i
     suspect most clients are using the lucene Directory constructor and
     not the java.io.File constructor ... Directory.list does not support
     FilenameFilters. so we can't try the same thing on that code path
     with making some other changes.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org