You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Britske <gb...@gmail.com> on 2009/11/02 22:40:04 UTC
manually creating indices to speed up indexing with app-knowledge
This may seem like a strange question, but here it goes anyway.
Im considering the possibility of low-level constructing indices for about
20.000 indexed fields (type sInt) if at all possible . (With indices in this
context I mean the inverted indices from term to Documentid just to be 100%
complete)
These indices have to be recreated each night, along with the normal
reindex.
Globally it should go something like this (each night) :
- documents (consisting of about 20 stored fields and about 10 stored &
indexed fields) are indexed through the normal 'code-path' (solrJ in my
case)
- After all docs are persisted (max 200.000) I want to extract the mapping
from 'lucene docid' --> 'stored/indexed product key'
I believe this should work, because after all docs are persisted the
internal docids aren't altered, so the relationship between 'lucene docid'
--> 'stored/indexed product key' is invariant from that point forward.
(please correct if wrong)
- construct the 20.000 inverted indices on such a low enough level that I do
not have to go through IndexWriter if possible, so I do not need to
construct Documents, I only need to construct the native format of the
indices themselves. Ideally this should work on multiple servers so that the
indices can be created in parallel and the index-files later simply copied
to the index-directory of the master.
Basically what it boils down to is that indexing time (a reindex should be
done each night) is a big show-stopper at the moment, although we've tried
and tested all the more standard optimization tricks & techniques, as well
as having build a home-grown shard-like indexing strategy which uses 20
pretty big servers in parallel. The 20.000 indexed fields are still simply
killing.
At the same time the app has a lot of knowledge of the 20.000 indices.
- All indices consist of prices (ints) between 0 and 10.000
- and most important: as part of the document construction process the
ordening of each of the 20.000 indices is known for all documents that are
processed by the document-construction server in question. (This part is
needed, and is already performing at light speed)
for sake of argument say we have 5 document-construction servers. Each
server processes 40.000 documents. Each server has 20.000 ordered indices in
its own format readily available for the 40.000 documents it's processing.
Something like: LinkedHashMap<Integer,Set<Integer>> -->
<price,{productids}>
Say we have 20 indexing servers. Each server has to calculate 1.000 indices
(totalling the 20.000)
We have the 5 doc-construction servers distribute the ordered sub-indices to
the correct servers.
Each server constructs an index from 5 ordered sub-indices coming from 5
different construction-servers. This can be done efficiently using a
mergesort (since the sub-indices are already sorted)
All that is missing (oversimplifying here ) is going from the ordered
indices in application-format to the index-format of lucene (substituting
the productids by the lucene docid's along the way) and stream it to disk.
I believe this would quite posisbly give a really big indexing improvement.
Is my thinking correct in the steps involved?
Do you believe that this indeed would give a big speedup for this specific
situation
Where would I hook in the SOlr / lucene code to construct the native format?
Thanks in advance (and for making it to here)
Geert-Jan
--
View this message in context: http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: manually creating indices to speed up indexing with
app-knowledge
Posted by Britske <gb...@gmail.com>.
Thanks Otis,
The fileformat-info seems almost 100% accurate. The different Writer-classes
indeed seem the way to go.
I'll post to lucene-user for follow-ups if/when needed.
Geert-Jan
Otis Gospodnetic wrote:
>
> Britske,
>
> The place to ask is on java-user@lucene if you want to go low-level. Look
> at IndexWriter and even DocumentsWriter classes.
>
> I'm not sure how up to date it is, but look at
> http://lucene.apache.org/java/2_9_0/fileformats.html
>
> You should also try streaming your data directly into Solr, it's the
> fastest way to index. Info on the Wiki.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Britske <gb...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Mon, November 2, 2009 4:40:04 PM
>> Subject: manually creating indices to speed up indexing with
>> app-knowledge
>>
>>
>> This may seem like a strange question, but here it goes anyway.
>>
>> Im considering the possibility of low-level constructing indices for
>> about
>> 20.000 indexed fields (type sInt) if at all possible . (With indices in
>> this
>> context I mean the inverted indices from term to Documentid just to be
>> 100%
>> complete)
>> These indices have to be recreated each night, along with the normal
>> reindex.
>>
>> Globally it should go something like this (each night) :
>> - documents (consisting of about 20 stored fields and about 10 stored &
>> indexed fields) are indexed through the normal 'code-path' (solrJ in my
>> case)
>> - After all docs are persisted (max 200.000) I want to extract the
>> mapping
>> from 'lucene docid' --> 'stored/indexed product key'
>> I believe this should work, because after all docs are persisted the
>> internal docids aren't altered, so the relationship between 'lucene
>> docid'
>> --> 'stored/indexed product key' is invariant from that point forward.
>> (please correct if wrong)
>> - construct the 20.000 inverted indices on such a low enough level that I
>> do
>> not have to go through IndexWriter if possible, so I do not need to
>> construct Documents, I only need to construct the native format of the
>> indices themselves. Ideally this should work on multiple servers so that
>> the
>> indices can be created in parallel and the index-files later simply
>> copied
>> to the index-directory of the master.
>>
>> Basically what it boils down to is that indexing time (a reindex should
>> be
>> done each night) is a big show-stopper at the moment, although we've
>> tried
>> and tested all the more standard optimization tricks & techniques, as
>> well
>> as having build a home-grown shard-like indexing strategy which uses 20
>> pretty big servers in parallel. The 20.000 indexed fields are still
>> simply
>> killing.
>>
>> At the same time the app has a lot of knowledge of the 20.000 indices.
>> - All indices consist of prices (ints) between 0 and 10.000
>> - and most important: as part of the document construction process the
>> ordening of each of the 20.000 indices is known for all documents that
>> are
>> processed by the document-construction server in question. (This part is
>> needed, and is already performing at light speed)
>>
>> for sake of argument say we have 5 document-construction servers. Each
>> server processes 40.000 documents. Each server has 20.000 ordered indices
>> in
>> its own format readily available for the 40.000 documents it's
>> processing.
>> Something like: LinkedHashMap> -->
>>
>>
>> Say we have 20 indexing servers. Each server has to calculate 1.000
>> indices
>> (totalling the 20.000)
>> We have the 5 doc-construction servers distribute the ordered sub-indices
>> to
>> the correct servers.
>> Each server constructs an index from 5 ordered sub-indices coming from 5
>> different construction-servers. This can be done efficiently using a
>> mergesort (since the sub-indices are already sorted)
>>
>> All that is missing (oversimplifying here ) is going from the ordered
>> indices in application-format to the index-format of lucene (substituting
>> the productids by the lucene docid's along the way) and stream it to
>> disk.
>> I believe this would quite posisbly give a really big indexing
>> improvement.
>>
>> Is my thinking correct in the steps involved?
>> Do you believe that this indeed would give a big speedup for this
>> specific
>> situation
>> Where would I hook in the SOlr / lucene code to construct the native
>> format?
>>
>>
>> Thanks in advance (and for making it to here)
>>
>> Geert-Jan
>>
>> --
>> View this message in context:
>> http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
--
View this message in context: http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26230260.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: manually creating indices to speed up indexing with app-knowledge
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Britske,
The place to ask is on java-user@lucene if you want to go low-level. Look at IndexWriter and even DocumentsWriter classes.
I'm not sure how up to date it is, but look at http://lucene.apache.org/java/2_9_0/fileformats.html
You should also try streaming your data directly into Solr, it's the fastest way to index. Info on the Wiki.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
----- Original Message ----
> From: Britske <gb...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Mon, November 2, 2009 4:40:04 PM
> Subject: manually creating indices to speed up indexing with app-knowledge
>
>
> This may seem like a strange question, but here it goes anyway.
>
> Im considering the possibility of low-level constructing indices for about
> 20.000 indexed fields (type sInt) if at all possible . (With indices in this
> context I mean the inverted indices from term to Documentid just to be 100%
> complete)
> These indices have to be recreated each night, along with the normal
> reindex.
>
> Globally it should go something like this (each night) :
> - documents (consisting of about 20 stored fields and about 10 stored &
> indexed fields) are indexed through the normal 'code-path' (solrJ in my
> case)
> - After all docs are persisted (max 200.000) I want to extract the mapping
> from 'lucene docid' --> 'stored/indexed product key'
> I believe this should work, because after all docs are persisted the
> internal docids aren't altered, so the relationship between 'lucene docid'
> --> 'stored/indexed product key' is invariant from that point forward.
> (please correct if wrong)
> - construct the 20.000 inverted indices on such a low enough level that I do
> not have to go through IndexWriter if possible, so I do not need to
> construct Documents, I only need to construct the native format of the
> indices themselves. Ideally this should work on multiple servers so that the
> indices can be created in parallel and the index-files later simply copied
> to the index-directory of the master.
>
> Basically what it boils down to is that indexing time (a reindex should be
> done each night) is a big show-stopper at the moment, although we've tried
> and tested all the more standard optimization tricks & techniques, as well
> as having build a home-grown shard-like indexing strategy which uses 20
> pretty big servers in parallel. The 20.000 indexed fields are still simply
> killing.
>
> At the same time the app has a lot of knowledge of the 20.000 indices.
> - All indices consist of prices (ints) between 0 and 10.000
> - and most important: as part of the document construction process the
> ordening of each of the 20.000 indices is known for all documents that are
> processed by the document-construction server in question. (This part is
> needed, and is already performing at light speed)
>
> for sake of argument say we have 5 document-construction servers. Each
> server processes 40.000 documents. Each server has 20.000 ordered indices in
> its own format readily available for the 40.000 documents it's processing.
> Something like: LinkedHashMap> -->
>
>
> Say we have 20 indexing servers. Each server has to calculate 1.000 indices
> (totalling the 20.000)
> We have the 5 doc-construction servers distribute the ordered sub-indices to
> the correct servers.
> Each server constructs an index from 5 ordered sub-indices coming from 5
> different construction-servers. This can be done efficiently using a
> mergesort (since the sub-indices are already sorted)
>
> All that is missing (oversimplifying here ) is going from the ordered
> indices in application-format to the index-format of lucene (substituting
> the productids by the lucene docid's along the way) and stream it to disk.
> I believe this would quite posisbly give a really big indexing improvement.
>
> Is my thinking correct in the steps involved?
> Do you believe that this indeed would give a big speedup for this specific
> situation
> Where would I hook in the SOlr / lucene code to construct the native format?
>
>
> Thanks in advance (and for making it to here)
>
> Geert-Jan
>
> --
> View this message in context:
> http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
> Sent from the Solr - User mailing list archive at Nabble.com.