You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chris D <br...@gmail.com> on 2006/09/22 00:44:49 UTC

Distributed Indexes, Searches and HDFS

Hi List,

As a bit of an experiment I'm redoing some of our indexing and searching
code to try to make it easier to manage and distributed. The system has to
modify its indexes frequently, sometimes in huge batches, and the documents
in the indexes are frequently modified (deleted, modified and readded). Just
for scale we're wanting the system to be capable of searching a terabyte or
so of data.

Currently we have a bunch of index machines indexing to a local file system,
every hour or so they merge to a group of indexes stored on NFS or similar
common filesystem, and the search nodes retrieve the new indexes and search
on those. The merge can take about as long as it took to originally index
the files, since it has to re-index the "contents" field since that field
isn't stored.

After reading this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/13803#13803 There
were several good suggestions but I'm curious, is there a generally accepted
best practice of distributing lucene? The cronjob/link solution which is
quite clean, doesn't work well in a windows environment. While it's my
favorite, no dice... Rats.

So I decided to experiment with a couple different ideas, and I have some
questions.

1) Indexing and Searching Directly from HDFS

Indexing to HDFS is possible with a patch if we don't use CFS. While not
ideal performance-wise, it's reliable and takes care of data redundancy,
component failure and means that I can have cheap small drives instead of a
large expensive NAS. It's also quite simple to implement (see Nutch's
indexer.FsDirectory for the Directory implmentation)

So I would have several indexes (ie 16) and the same number of indexers, and
a searcher for each index (possibly in the same process) that searches each
one directly from HDFS. One problem I'm having is an occasional filenotfound
exception. (Probably locking related)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open
filename /index/_3.f0
        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:178)
        at sun.reflect.GeneratedMethodAccessor41.invoke (Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:332)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:468)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:245)

It comes out of the Searcher when I try to do a search while things are
being indexed. I'd be interested to know what exactly is happening when this
exception is thrown, maybe I can design around it. (Do synchronization at
the appropriate times or similar)

2) Index Locally, Search in HDFS

I haven't implemented this but I was thinking something along the lines of
merging every little while and having the searchers refresh after that's
finished. I still have a problem with the merge taking a fairly long time
and if a node fails we lose the documents stored locally in that index.

3) Index HDFS, Search Locally

The system indexes to HDFS and the searchers ask the indexers to pause while
it retrieves the indexes from the store. It's then searched locally and the
Indexers continue trucking along. This, in my head, seems to work alright,
at least until the indexes get very large and copying them is prohibitive.
(Is there a java rsync?) I'll have to investigate how much a performance hit
indexing to the network actually is. If anyone has any numbers I would be
interested in seeing them.

4) Map/Reduce

I don't know a lot about this and haven't been able to find much on applying
map/reduce to lucene indexing. Well, except for the Nutch source code, which
is rather difficult to sort through for an overview. So if anyone has a
snippet or a good overview I could look over I would be grateful. Even if
you can just point at a critical part in Nutch that would also be quite
helpful.

5) Anything else

I would appreciate any insight anyone has on distributing indexes, either on
list or off.

Many Thanks,
Chris

PS. Sorry if this got double posted. Didn't seem to get through first time.

Re: Distributed Indexes, Searches and HDFS

Posted by Michael McCandless <lu...@mikemccandless.com>.
I think this is a great question ("what's the best way to really scale
up Lucene?").  I don't have alot of experience in that area so I'll
defer to others (and I'm eager to learn myself!).

I think understanding Solr's overall approach (whose design I believe
came out of the thread you've referenced) is also a good step here.
Even if you can't re-use the hard links trick, you might be able to
reuse its snapshotting & index distribution protocol.

However, I have been working on some "bottoms up" improvements to
Lucene (getting native OS locking working and [separate but related]
"lock-less commits") that I think could be related to some of the
issues you're seeing with HDFS -- see below:

 > > The cronjob/link solution which is quite clean, doesn't work well in
 > > a windows environment. While it's my favorite, no dice... Rats.
 >
 > There may be hope yet for that on Windows.  Hard links work on
 > Windows, but the only problem is that you can't rename/delete any
 > links when the file is open. Michael McCandless is working on a
 > patch that would eliminate all renames (and deletes can be handled
 > by deferring them).

Right, with "lock-less commits" patch we never rename a file and also
never re-use a file name (ie, making Lucene's use of the filesystem
"write once").

 > 1) Indexing and Searching Directly from HDFS
 >
 > Indexing to HDFS is possible with a patch if we don't use CFS. While
 > not ideal performance-wise, it's reliable and takes care of data
 > redundancy, component failure and means that I can have cheap small
 > drives instead of a large expensive NAS. It's also quite simple to
 > implement (see Nutch's indexer.FsDirectory for the Directory
 > implmentation)

This is very interesting!  I don't know enough about HDFS (yet!).  On
very quick read, I like that it's a "write once" filesystem because
it's a good match to lock-less commits.

 > So I would have several indexes (ie 16) and the same number of
 > indexers, and a searcher for each index (possibly in the same
 > process) that searches each one directly from HDFS. One problem I'm
 > having is an occasional filenotfound exception. (Probably locking
 > related)
 >
 > It comes out of the Searcher when I try to do a search while things
 > are being indexed. I'd be interested to know what exactly is
 > happening when this exception is thrown, maybe I can design around
 > it. (Do synchronization at the appropriate times or similar)

That exception looks disturbingly similar to the ones Lucene hits on
NFS.  See here for gory details:

     http://issues.apache.org/jira/browse/LUCENE-673

The summary of that [long] issue is that these exceptions seem to be
due to cache staleness of Lucene's "segments" file (due to how the NFS
client does caching, even on NFS V4 client/server) and not in fact due
to locking (as had been previously assumed/expected).  The good news
is the lock-less commits fixes resolve this at least in my testing so
far (ie, make it possible to share a single index over NFS).

I wonder if in HDFS a similar cause is at work?  HDFS is "write once"
but the current Lucene isn't (not until we can get lock-less commits
in).  For example, it re-uses the "segments" file.

I think even if lock-less commits ("write once") enables sharing of a
single copy of index over remote filesystems like HDFS or NFS or
SMB/CIFS, whether or not that's performant enough (vs replicating
copies to local filesystems that are presumably quite a bit faster at
IO, at the expense of local storage consumed) would still be a big
open question.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Distributed Indexes, Searches and HDFS

Posted by Yonik Seeley <yo...@apache.org>.
On 9/21/06, Chris D <br...@gmail.com> wrote:
> The cronjob/link solution which is
> quite clean, doesn't work well in a windows environment. While it's my
> favorite, no dice... Rats.

There may be hope yet for that on Windows.
Hard links work on Windows, but the only problem is that you can't
rename/delete any links when the file is open.  Michael McCandless is
working on a patch that would eliminate all renames (and deletes can
be handled by deferring them).

http://www.nabble.com/Re%3A--Solr-Wiki--Update-of-%22TaskList%22-by-YonikSeeley-tf2081816.html#a5736265
http://www.nabble.com/-jira--Created%3A-%28LUCENE-665%29-temporary-file-access-denied-on-Windows-tf2167540.html#a6295771


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org