You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jed Glazner <jg...@adobe.com> on 2013/03/21 19:25:21 UTC
Writing new indexes from index readers slow!
Hey Hey Everybody!,
I'm not sure if I should have posted this to the developers list... if
i'm totally barking up the wrong tree here, please let me know!
Anywho, I've developed a command line utility based on the
MultiPassIndexSplitter class from the lucene library, but I'm finding
that on our large index (350GB), it's taking WAY to long to write the
newly split indexes! It took 20.5 hours for execution to finish. I
should note that solr is not running while I'm splitting the index.
Because solr can't really be running while I run this tool performance
is critical as our service will be down!
I am aware that there is an api currently under development on trunk in
solr cloud (https://issues.apache.org/jira/browse/SOLR-3755) but I need
something now as our large index wreaking havoc on our service.
Here is some basic context info:
The Index:
==============
Solr/Lucene 4.1
Index Size: 350GB
Documents: 185,194,528
The Hardware (http://aws.amazon.com/ec2/instance-types/):
===============
AWS High-Memory X-Large (m2.xlarge) instance
CPU: 8 cores (2 virtual cores with 3.25 EC2 Compute Units each)
17.1 GB ram
1.2TB ebs raid
The Process (splitting 1 index into 8):
===============
I'm trying to split this index into 8 separate indexes using this tool.
To do this I create 8 worker threads. Each thread creates gets a new
FakeDeleteIndexReader object, and loops over every document, and uses a
hash algorithm to decide if it should keep or delete the document. Note
that the documents are not actually deleted at this point because (as I
understand it) the FakeDeleteIndexReader emulates deletes without
actually modifying the underlying index.
After each worker has determined which documents it should keep I create
a new Directory object, Instanciate a new IndexWriter, and pass the
FakeDeleteIndexReader object to the addIndexs method. (this is the part
that takes forever!)
It only takes about an hour for all of the threads to hash/delete the
documents it doesn't want. However it takes 19+ hours to write all of
the new indexes! Watching iowait The disk doesn't look to be over
worked (about 85% idle), so i'm baffled as to why it would take that
long! I've tried running the write operations inside the worker
threads, and serialy with no real difference!
Here is the relevant code that I'm using to write the indexes:
/**
* Creates/merges a new index with a FakeDeleteIndexReader. The reader
should have marked/deleted all
* of the documents that should not be included in this new index. When
the index is written/committed
* these documents will be removed.
*
* @param directory
* The directory object of the new index
* @param version
* The lucene version of the index
* @param reader
* A FakeDeleteIndexReader that contains lots of uncommitted
deletes.
* @throws IOException
*/
private void writeToDisk(Directory directory, Version version,
FakeDeleteIndexReader reader) throws IOException
{
IndexWriterConfig cfg = new IndexWriterConfig(version, new
WhitespaceAnalyzer(version));
cfg.setOpenMode(OpenMode.CREATE);
IndexWriter w = new IndexWriter(directory, cfg);
w.addIndexes(reader);
w.commit();
w.close();
reader.close();
}
Any Ideas?? I'm happy to share more snippets of source code if that is
helpful..
--
*Jed**Glazner*
Sr. Software Engineer
Adobe Social
385.221.1072 (tel)
801.360.0181 (cell)
jglazner@adobe.com
550 East Timpanogus Circle
Orem, UT 84097-6215, USA
www.adobe.com
Re: Writing new indexes from index readers slow!
Posted by Otis Gospodnetic <ot...@gmail.com>.
Jed,
While this is something completely different, have you considered using
SolrEntityProcessor instead? (assuming all your fields are stored)
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
Otis
--
Solr & ElasticSearch Support
http://sematext.com/
On Thu, Mar 21, 2013 at 2:25 PM, Jed Glazner <jg...@adobe.com> wrote:
> Hey Hey Everybody!,
>
> I'm not sure if I should have posted this to the developers list... if i'm
> totally barking up the wrong tree here, please let me know!
>
> Anywho, I've developed a command line utility based on the
> MultiPassIndexSplitter class from the lucene library, but I'm finding that
> on our large index (350GB), it's taking WAY to long to write the newly
> split indexes! It took 20.5 hours for execution to finish. I should note
> that solr is not running while I'm splitting the index. Because solr can't
> really be running while I run this tool performance is critical as our
> service will be down!
>
> I am aware that there is an api currently under development on trunk in
> solr cloud (https://issues.apache.org/jira/browse/SOLR-3755) but I need
> something now as our large index wreaking havoc on our service.
>
> Here is some basic context info:
>
> The Index:
> ==============
> Solr/Lucene 4.1
> Index Size: 350GB
> Documents: 185,194,528
>
> The Hardware (http://aws.amazon.com/ec2/instance-types/):
> ===============
> AWS High-Memory X-Large (m2.xlarge) instance
> CPU: 8 cores (2 virtual cores with 3.25 EC2 Compute Units each)
> 17.1 GB ram
> 1.2TB ebs raid
>
> The Process (splitting 1 index into 8):
> ===============
> I'm trying to split this index into 8 separate indexes using this tool.
> To do this I create 8 worker threads. Each thread creates gets a new
> FakeDeleteIndexReader object, and loops over every document, and uses a
> hash algorithm to decide if it should keep or delete the document. Note
> that the documents are not actually deleted at this point because (as I
> understand it) the FakeDeleteIndexReader emulates deletes without actually
> modifying the underlying index.
>
> After each worker has determined which documents it should keep I create a
> new Directory object, Instanciate a new IndexWriter, and pass the
> FakeDeleteIndexReader object to the addIndexs method. (this is the part
> that takes forever!)
>
> It only takes about an hour for all of the threads to hash/delete the
> documents it doesn't want. However it takes 19+ hours to write all of the
> new indexes! Watching iowait The disk doesn't look to be over worked
> (about 85% idle), so i'm baffled as to why it would take that long! I've
> tried running the write operations inside the worker threads, and serialy
> with no real difference!
>
> Here is the relevant code that I'm using to write the indexes:
>
> /**
> * Creates/merges a new index with a FakeDeleteIndexReader. The reader
> should have marked/deleted all
> * of the documents that should not be included in this new index. When
> the index is written/committed
> * these documents will be removed.
> *
> * @param directory
> * The directory object of the new index
> * @param version
> * The lucene version of the index
> * @param reader
> * A FakeDeleteIndexReader that contains lots of uncommitted
> deletes.
> * @throws IOException
> */
> private void writeToDisk(Directory directory, Version version,
> FakeDeleteIndexReader reader) throws IOException
> {
> IndexWriterConfig cfg = new IndexWriterConfig(version, new
> WhitespaceAnalyzer(version));
> cfg.setOpenMode(OpenMode.CREATE);
>
> IndexWriter w = new IndexWriter(directory, cfg);
> w.addIndexes(reader);
> w.commit();
> w.close();
> reader.close();
> }
>
> Any Ideas?? I'm happy to share more snippets of source code if that is
> helpful..
> --
>
> ****
>
> *Jed** Glazner*
> Sr. Software Engineer
> Adobe Social****
>
> 385.221.1072 (tel)
> 801.360.0181 (cell)
> jglazner@adobe.com****
>
> 550 East Timpanogus Circle
> Orem, UT 84097-6215, USA
> www.adobe.com****
>
> ** **
>