You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Denis Bazhenov <do...@gmail.com> on 2017/03/30 07:45:50 UTC

Document serializable representation

Hi.

We have in-house distributed Lucene setup. 40 dual-socket servers with approximatley 700 cores divided in 7 partitions. Those machines are doing index search only. Indexes are prepared on several isolated machines (so called, Index Masters) and distributed over the cluster with plain rsync.

The search speed is great, but we need more indexation throughput. Index Masters are becoming CPU-bounded lately. The reason is we use quite complicated analysis pipeline using morphological dictionary as opposed to stemming and some NER-elements. Right now indexation throughput is about ~1-1.5K documents per second. Considering corpus size of 140 million documents, full reindex is about day or so. We want better. Out target at the moment >10K documents per second. It seems like Lucene by itself can handle this requirement. It's just our comparatively slow analysis pipeline can't.

So we have a Plan.

To move analysis algorithm from Index Master dedicated boxes where it can be easily scaled, as being stateless. The problem we facing is Lucene at the moment doesn't have serializable Document representation which can be used for communicating over network.

We are planning to implement this kind of representation. The question is there any pitfals or problems we'd better know before starting? :)

Denis.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document serializable representation

Posted by Denis Bazhenov <do...@gmail.com>.

Hi.

Thanks for the reply. Of course each document go into exactly one shard.

> On Mar 31, 2017, at 15:01, Erick Erickson <er...@gmail.com> wrote:
> 
> I don't believe addIndexes does much except rewrite the
> segments file (i.e. the file that tells Lucene what
> the current segments are).
> 
> That said, if you're desperate you can optimize/force-merge.
> 
> Do note, though, that no deduplication is done. So if the
> indexes you're merging have docs with the same
> <uniqueKey> you'll have duplicate documents. That's not
> a problem if you merge shards properly. "Properly" here means
> that the hash ranges of the merged shards exactly span the
> ranges of the merged segments.
> 
> And if you're merging them all down to one segment the ranges
> don't matter.
> 
> Best,
> Erick
> 
> On Thu, Mar 30, 2017 at 6:09 PM, Denis Bazhenov <do...@gmail.com> wrote:
>> Yeah, I definitely will look into PreAnalyzedField as you and Michail suggest.
>> 
>> Thank you.
>> 
>>> On Mar 30, 2017, at 19:15, Uwe Schindler <uw...@thetaphi.de> wrote:
>>> 
>>> But that's hard to implement. I'd go for Solr instead of doing that on your own!
>> 
>> ---
>> Denis Bazhenov <do...@gmail.com>
>> 
>> 
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---
Denis Bazhenov <do...@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document serializable representation

Posted by Erick Erickson <er...@gmail.com>.

I don't believe addIndexes does much except rewrite the
segments file (i.e. the file that tells Lucene what
the current segments are).

That said, if you're desperate you can optimize/force-merge.

Do note, though, that no deduplication is done. So if the
indexes you're merging have docs with the same
<uniqueKey> you'll have duplicate documents. That's not
a problem if you merge shards properly. "Properly" here means
that the hash ranges of the merged shards exactly span the
ranges of the merged segments.

And if you're merging them all down to one segment the ranges
don't matter.

Best,
Erick

On Thu, Mar 30, 2017 at 6:09 PM, Denis Bazhenov <do...@gmail.com> wrote:
> Yeah, I definitely will look into PreAnalyzedField as you and Michail suggest.
>
> Thank you.
>
>> On Mar 30, 2017, at 19:15, Uwe Schindler <uw...@thetaphi.de> wrote:
>>
>> But that's hard to implement. I'd go for Solr instead of doing that on your own!
>
> ---
> Denis Bazhenov <do...@gmail.com>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document serializable representation

Posted by Denis Bazhenov <do...@gmail.com>.

Yeah, I definitely will look into PreAnalyzedField as you and Michail suggest.

Thank you.

> On Mar 30, 2017, at 19:15, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> But that's hard to implement. I'd go for Solr instead of doing that on your own!

---
Denis Bazhenov <do...@gmail.com>

RE: Document serializable representation

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

there is no easy way to do this with Lucene. The analysis part is tightly bound to IndexWriter. There are ways to decouple this, but you have to write your own Analyzer and some network protocol.

Solr has something lik this, it's called PreAnalyzedField: This is a field type that has some special analyzer behind that does not analyze text in the conventional way, but instead treats the indexed content as JSON, with all the tokens with their attributes implemented as a JSON array. On the indexing node the IndexWriter just uses this JSON-Analyzer and creates tokens from it that are indexed. On the other side you have several machines that parse and analyze your documents, but instead of creating Lucene documents they just create JSON objects with all analyzed tokens from it (those analyzed tokens contain token text, position and offset information, NLP stuff, keyword markers - all attributes a normal tokenstream in Lucene would have). Those JSON objects are transferred over the network and IndexWriter parses them using the "special analyzer".

But that's hard to implement. I'd go for Solr instead of doing that on your own! 😊

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Denis Bazhenov [mailto:dotsid@gmail.com]
> Sent: Thursday, March 30, 2017 11:02 AM
> To: java-user@lucene.apache.org
> Subject: Re: Document serializable representation
> 
> We already have done this. Many years ago :)
> 
> At the moment we have 7 shards. The problem with getting more shards is
> that search become less cost effective (in terms of cluster CPU time per
> request) as you split index in more shards. Considering response time is good
> enough and the fact search nodes are ~90% of all hardware budget of the
> cluster, it’s much more cost effective to split analysis from IndexWriter than
> split index in more shards. It simply would require from us to put
> disproportionately more hardware in cluster.
> 
> > On Mar 30, 2017, at 18:36, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > What you would better do is to just split your index into multiple shards
> and have separate IndexWriter instances on different machines. Those can
> act on their own. This is what Elasticsearch or Solr are doing: They accept the
> document, decide which shard they should be located and transfer the plain
> fieldname:value pairs over the network. Each node then creates Lucene
> IndexableDocuments out of it and passes to their own IndexWriter.
> 
> ---
> Denis Bazhenov <do...@gmail.com>
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document serializable representation

Posted by Denis Bazhenov <do...@gmail.com>.

Interesting. In case of addIndexes() does Lucene perform any optimization on segments before searching over individual segments or those indexes are searched "as is”?

> On Mar 30, 2017, at 19:09, Mikhail Khludnev <mk...@apache.org> wrote:
> 
> I believe you can have more shards for indexing and then merge (and not
> literally, but just by addIndexes() or so ) them to smaller number for
> search. Transferring indices is more efficient (scp -C) than separate
> tokens and their attributes over the wire.
> 
> On Thu, Mar 30, 2017 at 12:02 PM, Denis Bazhenov <do...@gmail.com> wrote:
> 
>> We already have done this. Many years ago :)
>> 
>> At the moment we have 7 shards. The problem with getting more shards is
>> that search become less cost effective (in terms of cluster CPU time per
>> request) as you split index in more shards. Considering response time is
>> good enough and the fact search nodes are ~90% of all hardware budget of
>> the cluster, it’s much more cost effective to split analysis from
>> IndexWriter than split index in more shards. It simply would require from
>> us to put disproportionately more hardware in cluster.
>> 
>>> On Mar 30, 2017, at 18:36, Uwe Schindler <uw...@thetaphi.de> wrote:
>>> 
>>> What you would better do is to just split your index into multiple
>> shards and have separate IndexWriter instances on different machines. Those
>> can act on their own. This is what Elasticsearch or Solr are doing: They
>> accept the document, decide which shard they should be located and transfer
>> the plain fieldname:value pairs over the network. Each node then creates
>> Lucene IndexableDocuments out of it and passes to their own IndexWriter.
>> 
>> ---
>> Denis Bazhenov <do...@gmail.com>
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

---
Denis Bazhenov <do...@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document serializable representation

Posted by Mikhail Khludnev <mk...@apache.org>.

I believe you can have more shards for indexing and then merge (and not
literally, but just by addIndexes() or so ) them to smaller number for
search. Transferring indices is more efficient (scp -C) than separate
tokens and their attributes over the wire.

On Thu, Mar 30, 2017 at 12:02 PM, Denis Bazhenov <do...@gmail.com> wrote:

> We already have done this. Many years ago :)
>
> At the moment we have 7 shards. The problem with getting more shards is
> that search become less cost effective (in terms of cluster CPU time per
> request) as you split index in more shards. Considering response time is
> good enough and the fact search nodes are ~90% of all hardware budget of
> the cluster, it’s much more cost effective to split analysis from
> IndexWriter than split index in more shards. It simply would require from
> us to put disproportionately more hardware in cluster.
>
> > On Mar 30, 2017, at 18:36, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > What you would better do is to just split your index into multiple
> shards and have separate IndexWriter instances on different machines. Those
> can act on their own. This is what Elasticsearch or Solr are doing: They
> accept the document, decide which shard they should be located and transfer
> the plain fieldname:value pairs over the network. Each node then creates
> Lucene IndexableDocuments out of it and passes to their own IndexWriter.
>
> ---
> Denis Bazhenov <do...@gmail.com>
>
>
>
>
>
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Document serializable representation

Posted by Denis Bazhenov <do...@gmail.com>.

We already have done this. Many years ago :)

At the moment we have 7 shards. The problem with getting more shards is that search become less cost effective (in terms of cluster CPU time per request) as you split index in more shards. Considering response time is good enough and the fact search nodes are ~90% of all hardware budget of the cluster, it’s much more cost effective to split analysis from IndexWriter than split index in more shards. It simply would require from us to put disproportionately more hardware in cluster.

> On Mar 30, 2017, at 18:36, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> What you would better do is to just split your index into multiple shards and have separate IndexWriter instances on different machines. Those can act on their own. This is what Elasticsearch or Solr are doing: They accept the document, decide which shard they should be located and transfer the plain fieldname:value pairs over the network. Each node then creates Lucene IndexableDocuments out of it and passes to their own IndexWriter. 

---
Denis Bazhenov <do...@gmail.com>

RE: Document serializable representation

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

the document does not contain the analyzed tokens. The Lucene Analyzers are called inside the IndexWriter *during* indexing, so there is no way to do that somewhere else. The IndexableDocument instances by Lucene are just iterables of IndexableField that contain the unparsed fulltext as passed to their constructors. You don't even need to transfer whole documents, a bunch of IndexableField instances per document is perfectly fine to represent a Lucene document. If the types of fields are already known to the indexer, it would be enough to transfer key-value pairs over the network.

But as said before that would not help you to do the Lucene Analyzer stuff on another machine. Analysis is done inside IndexWriter.

What you would better do is to just split your index into multiple shards and have separate IndexWriter instances on different machines. Those can act on their own. This is what Elasticsearch or Solr are doing: They accept the document, decide which shard they should be located and transfer the plain fieldname:value pairs over the network. Each node then creates Lucene IndexableDocuments out of it and passes to their own IndexWriter. 

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Denis Bazhenov [mailto:dotsid@gmail.com]
> Sent: Thursday, March 30, 2017 9:46 AM
> To: java-user@lucene.apache.org
> Subject: Document serializable representation
> 
> Hi.
> 
> We have in-house distributed Lucene setup. 40 dual-socket servers with
> approximatley 700 cores divided in 7 partitions. Those machines are doing
> index search only. Indexes are prepared on several isolated machines (so
> called, Index Masters) and distributed over the cluster with plain rsync.
> 
> The search speed is great, but we need more indexation throughput. Index
> Masters are becoming CPU-bounded lately. The reason is we use quite
> complicated analysis pipeline using morphological dictionary as opposed to
> stemming and some NER-elements. Right now indexation throughput is
> about ~1-1.5K documents per second. Considering corpus size of 140 million
> documents, full reindex is about day or so. We want better. Out target at the
> moment >10K documents per second. It seems like Lucene by itself can
> handle this requirement. It's just our comparatively slow analysis pipeline
> can't.
> 
> So we have a Plan.
> 
> To move analysis algorithm from Index Master dedicated boxes where it can
> be easily scaled, as being stateless. The problem we facing is Lucene at the
> moment doesn't have serializable Document representation which can be
> used for communicating over network.
> 
> We are planning to implement this kind of representation. The question is
> there any pitfals or problems we'd better know before starting? :)
> 
> Denis.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org