You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Simone Gianni <si...@apache.org> on 2012/11/19 15:22:26 UTC

SolrCloud and exernal file fields

Hi all,
I'm planning to move a quite big Solr index to SolrCloud. However, in this
index, an external file field is used for popularity ranking.

Does SolrCloud supports external file fields? How does it cope with
sharding and replication? Where should the external file be placed now that
the index folder is not local but in the cloud?

Are there otherwise other best practices to deal with the use cases
external file fields were used for, like popularity/ranking, in SolrCloud?
Custom ValueSources going to something external?

Thanks in advance,
Simone

Re: SolrCloud and exernal file fields

Posted by Simone Gianni <si...@apache.org>.

Hi Gopal,
the post you linked is interesting, it takes a different approach than mine
: it implements a codec for Lucene, so at a lower level than my solution
that works at Solr UpdateHandler level, so before the document reaches
Lucene.

The lucene-codec approach should offer a few advantages : the field is
"normally" exposed in the document, and as such carried by SolrCloud while
creating new replicas (which is the part I'm not yet sure my solution
handles correctly). On the other side, it limits some flexibility, I'm
already planning at least atomic addition to support popularity ranking.

My post on lucene-dev has received no feedback so far. I'll keep working on
it, but I'm still far from a submittable patch, and help from the dev
community would be of great.

Simone

2012/11/24 Gopal Patwa <go...@gmail.com>

> Hi, I am also very much interested in this, since we use Solr 4 with NRT
> where we update index every second but most of time it update only stored
> filed.
>  if Solr/Lucene could provide external datastore without re-indexing even
> for stored field only, it would be very beneficial for frequent update use
> case, where cache invalidation will not happen for stored fields update and
> it will improve indexing performance due to smaller index size.
>
> Here is below link for similar work.
>
>
> http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/
>
>
> On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni <si...@apache.org>
> wrote:
>
> > Posted,
> > see it here
> >
> >
> http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html
> >
> > Simone
> >
> >
> > 2012/11/23 Simone Gianni <si...@apache.org>
> >
> > > 2012/11/22 Martin Koch <ma...@issuu.com>
> > >
> > >> IMO it would be ideal if the lucene/solr community could come up with
> a
> > >> good way of updating fields in a document without reindexing. This
> could
> > >> be
> > >> by linking to some external data store, or in the lucene/solr
> internals.
> > >> If
> > >> it would make things easier, a good first step would be to have
> > >> dynamically
> > >> updateable numerical fields only.
> > >>
> > >
> > > Hi Martin,
> > > I'm working on implementing exactly this, and I have a working
> prototype
> > > right now. I'm going to write on lucene dev about the details and
> asking
> > > advice there. I'll contribute the code, so anyone interested followup
> on
> > > dev.
> > >
> > > Simone
> > >
> > >
> >
>

Re: SolrCloud and exernal file fields

Posted by Gopal Patwa <go...@gmail.com>.

Hi, I am also very much interested in this, since we use Solr 4 with NRT
where we update index every second but most of time it update only stored
filed.
 if Solr/Lucene could provide external datastore without re-indexing even
for stored field only, it would be very beneficial for frequent update use
case, where cache invalidation will not happen for stored fields update and
it will improve indexing performance due to smaller index size.

Here is below link for similar work.

http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni <si...@apache.org> wrote:

> Posted,
> see it here
>
> http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html
>
> Simone
>
>
> 2012/11/23 Simone Gianni <si...@apache.org>
>
> > 2012/11/22 Martin Koch <ma...@issuu.com>
> >
> >> IMO it would be ideal if the lucene/solr community could come up with a
> >> good way of updating fields in a document without reindexing. This could
> >> be
> >> by linking to some external data store, or in the lucene/solr internals.
> >> If
> >> it would make things easier, a good first step would be to have
> >> dynamically
> >> updateable numerical fields only.
> >>
> >
> > Hi Martin,
> > I'm working on implementing exactly this, and I have a working prototype
> > right now. I'm going to write on lucene dev about the details and asking
> > advice there. I'll contribute the code, so anyone interested followup on
> > dev.
> >
> > Simone
> >
> >
>

Re: SolrCloud and exernal file fields

Posted by Simone Gianni <si...@apache.org>.

Posted,
see it here
http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html

Simone


2012/11/23 Simone Gianni <si...@apache.org>

> 2012/11/22 Martin Koch <ma...@issuu.com>
>
>> IMO it would be ideal if the lucene/solr community could come up with a
>> good way of updating fields in a document without reindexing. This could
>> be
>> by linking to some external data store, or in the lucene/solr internals.
>> If
>> it would make things easier, a good first step would be to have
>> dynamically
>> updateable numerical fields only.
>>
>
> Hi Martin,
> I'm working on implementing exactly this, and I have a working prototype
> right now. I'm going to write on lucene dev about the details and asking
> advice there. I'll contribute the code, so anyone interested followup on
> dev.
>
> Simone
>
>

Re: SolrCloud and exernal file fields

Posted by Simone Gianni <si...@apache.org>.

2012/11/22 Martin Koch <ma...@issuu.com>

> IMO it would be ideal if the lucene/solr community could come up with a
> good way of updating fields in a document without reindexing. This could be
> by linking to some external data store, or in the lucene/solr internals. If
> it would make things easier, a good first step would be to have dynamically
> updateable numerical fields only.
>

Hi Martin,
I'm working on implementing exactly this, and I have a working prototype
right now. I'm going to write on lucene dev about the details and asking
advice there. I'll contribute the code, so anyone interested followup on
dev.

Simone

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Mark,

Your comment is quite valuable. Let me mention the keyword to be able to
find later NoOpDistributingUpdateProcessorFactory.*
*Thanks*!
*


On Wed, Nov 28, 2012 at 5:56 PM, Mark Miller <ma...@gmail.com> wrote:

> Keep in mind that the distrib update proc will be auto inserted into
> chains! You have to include a proc that disables it - see the FAQ:
> http://wiki.apache.org/solr/SolrCloud#FAQ
>
> - Mark
>
> On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev <mk...@griddynamics.com>
> wrote:
>
> > Martin,
> > Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
> > commits to all peers. To hack this you can introduce dedicated
> > UpdateProcessorChain without DistributedUpdateProcessor and send commit
> to
> > that chain.
> > 28.11.2012 13:16 пользователь "Martin Koch" <ma...@issuu.com> написал:
> >
> >> Mikhail
> >>
> >> I haven't experimented further yet. I think that the previous experiment
> >> of issuing a commit to a specific core proved that all cores get the
> >> commit, so I don't think that this approach will work.
> >>
> >> Thanks,
> >> /Martin
> >>
> >>
> >> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
> >> mkhludnev@griddynamics.com> wrote:
> >>
> >>> Martin,
> >>>
> >>> It's still not clear to me whether you solve the problem completely or
> >>> partially:
> >>> Does reducing number of cores free some resources for searching during
> >>> commit?
> >>> Does the commiting one-by-one core prevents the "freeze"?
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <ma...@issuu.com> wrote:
> >>>
> >>>> Mikhail
> >>>>
> >>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
> >>>> (bug
> >>>> 3985). But this wasn't good enough, because SOLR would still take very
> >>>> long
> >>>> to restart when that was necessary.
> >>>>
> >>>> I don't see how we could throw more hardware at the problem without
> >>>> making
> >>>> it worse, really - the only solution here would be *fewer* shards, not
> >>>>
> >>>> more.
> >>>>
> >>>> IMO it would be ideal if the lucene/solr community could come up with
> a
> >>>> good way of updating fields in a document without reindexing. This
> could
> >>>> be
> >>>> by linking to some external data store, or in the lucene/solr
> internals.
> >>>> If
> >>>> it would make things easier, a good first step would be to have
> >>>> dynamically
> >>>> updateable numerical fields only.
> >>>>
> >>>> /Martin
> >>>>
> >>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
> >>>> mkhludnev@griddynamics.com> wrote:
> >>>>
> >>>>> Martin,
> >>>>>
> >>>>> I don't think solrconfig.xml shed any light on. I've just found what
> I
> >>>>> didn't get in your setup - the way of how to explicitly assigning
> core
> >>>> to
> >>>>> collection. Now, I realized most of details after all!
> >>>>> Ball is on your side, let us know whether you have managed your cores
> >>>> to
> >>>>> commit one by one to avoid freeze, or could you eliminate pauses by
> >>>>> allocating more hardware?
> >>>>> Thanks in advance!
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
> >>>>>
> >>>>>> Mikhail,
> >>>>>>
> >>>>>> PSB
> >>>>>>
> >>>>>> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> >>>>>> mkhludnev@griddynamics.com> wrote:
> >>>>>>
> >>>>>>> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I wasn't aware until now that it is possible to send a commit to
> >>>> one
> >>>>>> core
> >>>>>>>> only. What we observed was the effect of curl
> >>>>>>>> localhost:8080/solr/update?commit=true but perhaps we should
> >>>>> experiment
> >>>>>>>> with solr/coreN/update?commit=true. A quick trial run seems to
> >>>>> indicate
> >>>>>>>> that a commit to a single core causes commits on all cores.
> >>>>>>>>
> >>>>>>> You should see something like this in the log:
> >>>>>>> ... SolrCmdDistributor .... Distrib commit to: ...
> >>>>>>>
> >>>>>>> Yup, a commit towards a single core results in a commit on all
> >>>> cores.
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Perhaps I should clarify that we are using SOLR as a black box;
> >>>> we do
> >>>>>> not
> >>>>>>>> touch the code at all - we only install the distribution WAR
> >>>> file and
> >>>>>>>> proceed from there.
> >>>>>>>>
> >>>>>>> I still don't understand how you deploy/launch Solr. How many
> >>>> jettys
> >>>>> you
> >>>>>>> start whether you have -DzkRun -DzkHost -DnumShards=2  or you
> >>>> specifies
> >>>>>>> shards= param for every request and distributes updates yourself?
> >>>> What
> >>>>>>> collections do you create and with which settings?
> >>>>>>>
> >>>>>>> We let SOLR do the sharding using one collection with 16 SOLR cores
> >>>>>> holding one shard each. We launch only one instance of jetty with
> the
> >>>>>> folllowing arguments:
> >>>>>>
> >>>>>> -DnumShards=16
> >>>>>> -DzkHost=<zookeeperhost:port>
> >>>>>> -Xmx10G
> >>>>>> -Xms10G
> >>>>>> -Xmn2G
> >>>>>> -server
> >>>>>>
> >>>>>> Would you like to see the solrconfig.xml?
> >>>>>>
> >>>>>> /Martin
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Also from my POV such deployments should start at least from
> >>>> *16*
> >>>>>> 4-way
> >>>>>>>>> vboxes, it's more expensive, but should be much better
> >>>> available
> >>>>>> during
> >>>>>>>>> cpu-consuming operations.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Do you mean that you recommend 16 hosts with 4 cores each? Or 4
> >>>> hosts
> >>>>>>> with
> >>>>>>>> 16 cores? Or am I misunderstanding something :) ?
> >>>>>>>>
> >>>>>>> I prefer to start from 16 hosts with 4 cores each.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Other details, if you use single jetty for all of them, are you
> >>>>> sure
> >>>>>>> that
> >>>>>>>>> jetty's threadpool doesn't limit requests? is it large enough?
> >>>>>>>>> You have 60G and set -Xmx=10G. are you sure that total size of
> >>>>> cores
> >>>>>>>> index
> >>>>>>>>> directories is less than 45G?
> >>>>>>>>>
> >>>>>>>>> The total index size is 230 GB, so it won't fit in ram, but
> >>>> we're
> >>>>>> using
> >>>>>>>> an
> >>>>>>>> SSD disk to minimize disk access time. We have tried putting the
> >>>> EFF
> >>>>>>> onto a
> >>>>>>>> ram disk, but this didn't have a measurable effect.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> /Martin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Mikhail
> >>>>>>>>>>
> >>>>>>>>>> PSB
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> >>>>>>>>>> mkhludnev@griddynamics.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Martin,
> >>>>>>>>>>>
> >>>>>>>>>>> Please find additional question from me below.
> >>>>>>>>>>>
> >>>>>>>>>>> Simone,
> >>>>>>>>>>>
> >>>>>>>>>>> I'm sorry for hijacking your thread. The only what I've
> >>>> heard
> >>>>>> about
> >>>>>>>> it
> >>>>>>>>> at
> >>>>>>>>>>> recent ApacheCon sessions is that Zookeeper is supposed to
> >>>>>>> replicate
> >>>>>>>>>> those
> >>>>>>>>>>> files as configs under solr home. And I'm really looking
> >>>>> forward
> >>>>>> to
> >>>>>>>>> know
> >>>>>>>>>>> how it works with huge files in production.
> >>>>>>>>>>>
> >>>>>>>>>>> Thank You, Guys!
> >>>>>>>>>>>
> >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com
> >>>>>
> >>>>>>> написал:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Mikhail
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please see answers below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> >>>>>>>>>>>> mkhludnev@griddynamics.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Martin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
> >>>>>> useful
> >>>>>>>> for
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>> The first question might seems not really conscious,
> >>>> but
> >>>>>> would
> >>>>>>>> you
> >>>>>>>>>> tell
> >>>>>>>>>>> me
> >>>>>>>>>>>>> what blocks searching during EFF reload, when it's
> >>>>> triggered
> >>>>>> by
> >>>>>>>>>> handler
> >>>>>>>>>>> or
> >>>>>>>>>>>>> by listener?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> We continuously index new documents using CommitWithin
> >>>> to get
> >>>>>>>> regular
> >>>>>>>>>>>> commits. However, we observed that the EFFs were not
> >>>> re-read,
> >>>>>> so
> >>>>>>> we
> >>>>>>>>> had
> >>>>>>>>>>> to
> >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
> >>>> to
> >>>>>> force
> >>>>>>>>>> reload.
> >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
> >>>> why
> >>>>>> it's
> >>>>>>>>> doing
> >>>>>>>>>>>> that (it was related to SOLR-3985).
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a chance to get a thread dump when they are
> >>>> blocked?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> Well I could try to recreate the situation. But the setup is
> >>>>> fairly
> >>>>>>>>> simple:
> >>>>>>>>>> Create a large EFF in a largeish index with many shards.
> >>>> Issue a
> >>>>>>>> commit,
> >>>>>>>>>> and then try to do a search. Solr will not respond to the
> >>>> search
> >>>>>>> before
> >>>>>>>>> the
> >>>>>>>>>> commit has completed, and this will take a long time.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I don't really get the sentence about sequential
> >>>> commits
> >>>>> and
> >>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> cores. Do I get right that file is replicated via
> >>>>> Zookeeper?
> >>>>>>>>> Doesn't
> >>>>>>>>>> it
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
> >>>> on a
> >>>>>>>> system
> >>>>>>>>>> with
> >>>>>>>>>>> a
> >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
> >>>>> for a
> >>>>>>>> long
> >>>>>>>>>> time
> >>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
> >>>> EFF
> >>>>>> is a
> >>>>>>>>>> symlink
> >>>>>>>>>>>> from each cores index dir to the actual file, which is
> >>>>> updated
> >>>>>> by
> >>>>>>>> an
> >>>>>>>>>>>> external process.
> >>>>>>>>>>>
> >>>>>>>>>>> Hold on, I asked about Zookeeper because the subj mentions
> >>>>>>> SolrCloud.
> >>>>>>>>>>>
> >>>>>>>>>>> Do you use SolrCloud, SolrShards, or these cores are just
> >>>>>> replicas
> >>>>>>> of
> >>>>>>>>> the
> >>>>>>>>>>> same index?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ah - we use solr 4 out of the box, so I guess this is
> >>>> SolrCloud.
> >>>>>> I'm
> >>>>>>> a
> >>>>>>>>> bit
> >>>>>>>>>> unsure about the terminology here, but we've got a single
> >>>> index
> >>>>>>> divided
> >>>>>>>>>> into 16 shard. Each shard is hosted in a solr core.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Also, about simlink - Don't you share that file via some
> >>>> NFS?
> >>>>>>>>>>>
> >>>>>>>>>>> No, we generate the EFF on the local solr host (there is
> >>>> only
> >>>>> one
> >>>>>>>>>> physical
> >>>>>>>>>> host that holds all shards), so there is no need for NFS or
> >>>>> copying
> >>>>>>>> files
> >>>>>>>>>> around. No need for Zookeeper either.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> how many cores you run per box?
> >>>>>>>>>>>
> >>>>>>>>>> This box is a 16-virtual core (8 hyperthreaded cores)  with
> >>>> 60GB
> >>>>> of
> >>>>>>>> RAM.
> >>>>>>>>> We
> >>>>>>>>>> run 16 solr cores on this box in Jetty.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Do boxes has plenty of ram to cache filesystem beside of
> >>>> jvm
> >>>>>> heaps?
> >>>>>>>>>>>
> >>>>>>>>>>> Yes. We've allocated 10GB for jetty, and left the rest for
> >>>> the
> >>>>>> OS.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I assume you use 64 bit linux and mmap directory. Please
> >>>>> confirm
> >>>>>>>> that.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> We use 64-bit linux. I'm not sure about the mmap directory or
> >>>>> where
> >>>>>>>> that
> >>>>>>>>>> would be configured in solr - can you explain that?
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> causes scalability problem or long time to reload?
> >>>> Will it
> >>>>>> help
> >>>>>>>> if
> >>>>>>>>>>> we'll
> >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
> >>>>> values
> >>>>>>> from
> >>>>>>>>>> jdbc.
> >>>>>>>>>>> ie.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think the possibility of having some fields being
> >>>> retrieved
> >>>>>>> from
> >>>>>>>> an
> >>>>>>>>>>>> external, dynamically updatable store would be really
> >>>>>>> interesting.
> >>>>>>>>> This
> >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
> >>>>>> product
> >>>>>>>>> (e.g.
> >>>>>>>>>>>> Cassandra).
> >>>>>>>>>>>
> >>>>>>>>>>> Ok. Let's have it in mind as a possible direction.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Alternatively, an API that would allow updating a single
> >>>> field
> >>>>> for
> >>>>>> a
> >>>>>>>>>> document might be an option.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> why all cores can't read these values simultaneously?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Again, this is a solr implementation detail that I can't
> >>>>> answer
> >>>>>>> :)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
> >>>>> index
> >>>>>>> term
> >>>>>>>>>>> order?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> AFAIK it can impact load time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Yes, it does
> >>>>>>>>>>>
> >>>>>>>>>>> Ok, I've got that you aware of it, and your IDs are just
> >>>>> strings,
> >>>>>>> not
> >>>>>>>>>>> integers.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> Yes, ids are strings.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Regarding your post-query solution can you tell me if
> >>>> query
> >>>>>>> found
> >>>>>>>>>> 10000
> >>>>>>>>>>>>> docs, but I need to display only first page with 100
> >>>> rows,
> >>>>>>>> whether
> >>>>>>>>> I
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to pull all 10K results to frontend to order them by
> >>>> the
> >>>>>> rank?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> In our architecture, the clients query an API that
> >>>> generates
> >>>>>> the
> >>>>>>>> SOLR
> >>>>>>>>>>>> query, retrieves the relevant additional fields that we
> >>>>> needs,
> >>>>>>> and
> >>>>>>>>>>> returns
> >>>>>>>>>>>> the relevant JSON to the front-end.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In our use case, results are returned from SOLR by the
> >>>> 10's,
> >>>>>> not
> >>>>>>> by
> >>>>>>>>> the
> >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
> >>>> returned
> >>>>>>>>> thousands
> >>>>>>>>>> of
> >>>>>>>>>>>> results, it would be up to the implementation of the api
> >>>> to
> >>>>>>> augment
> >>>>>>>>>> only
> >>>>>>>>>>>> the results that needed to be returned to the front-end.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
> >>>>> should
> >>>>>>> be
> >>>>>>>>>>>> possible.
> >>>>>>>>>>>
> >>>>>>>>>>> You are right. I'm concerned anyway because retrieving
> >>>> whole
> >>>>>> result
> >>>>>>>> is
> >>>>>>>>>>> expensive, and not always possible.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> In our case, getting the whole result is almost impossible,
> >>>>> because
> >>>>>>>> that
> >>>>>>>>>> would be millions of documents, and returning the Nth result
> >>>>> seems
> >>>>>> to
> >>>>>>>> be
> >>>>>>>>> a
> >>>>>>>>>> quadratic (or worse) operation in SOLR.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm really appreciate if you comment on the questions
> >>>>> above.
> >>>>>>>>>>>>> PS: It's time to pitch, how much
> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
> >>>> "Commit-free
> >>>>>>>>>>>>> ExternalFileField" can help you?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It looks very interesting :) Does it make it possible
> >>>> to
> >>>>>> avoid
> >>>>>>>>>>> re-reading
> >>>>>>>>>>>> the EFF on every commit, and only re-read the values that
> >>>>> have
> >>>>>>>>> actually
> >>>>>>>>>>>> changed?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> You don't need commit (in SOLR-4085) to reload file
> >>>> content,
> >>>>> but
> >>>>>>>> after
> >>>>>>>>>>> commit you need to read whole file and scan all key terms
> >>>> and
> >>>>>>>> postings.
> >>>>>>>>>>> That's because EFF sits on top of top level searcher. it's
> >>>> a
> >>>>>>>> Solr-like
> >>>>>>>>>> way.
> >>>>>>>>>>> In some future we might have per-segment EFF, in this case
> >>>>>> adding a
> >>>>>>>>>> segment
> >>>>>>>>>>> will trigger full file scan, but in the index only that new
> >>>>>> segment
> >>>>>>>>> will
> >>>>>>>>>> be
> >>>>>>>>>>> scanned. It should be faster. You know, straightforward
> >>>> sharing
> >>>>>>>>> internal
> >>>>>>>>>>> data structures between different index views/generations
> >>>> is
> >>>>> not
> >>>>>>>>>> possible.
> >>>>>>>>>>> If you are asking about applying delta changes on external
> >>>> file
> >>>>>>>> that's
> >>>>>>>>>>> something what we did ourselves http://goo.gl/P8GFq . This
> >>>>>> feature
> >>>>>>>> is
> >>>>>>>>>> much
> >>>>>>>>>>> more doubtful and vague, although it might be the next
> >>>>>> contribution
> >>>>>>>>> after
> >>>>>>>>>>> SOLR-4085.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> /Martin
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> >>>>> mak@issuu.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
> >>>> give
> >>>>> you
> >>>>>>>> what
> >>>>>>>>>>> you're
> >>>>>>>>>>>>>> hoping fore.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The EFF is placed in the parent of the index
> >>>> directory in
> >>>>>>> each
> >>>>>>>>>> core;
> >>>>>>>>>>> each
> >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
> >>>> it
> >>>>> is
> >>>>>>>>>>> responsible
> >>>>>>>>>>>>> for.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
> >>>>> (doesn't
> >>>>>>>> answer
> >>>>>>>>>>>>> queries)
> >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
> >>>> the
> >>>>>> time
> >>>>>>> to
> >>>>>>>>>>> re-read
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
> >>>> (i.e. the
> >>>>>> EFF
> >>>>>>>> is
> >>>>>>>>>>> re-read
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>> each core sequentially). The contents of the EFF
> >>>> become
> >>>>>>> active
> >>>>>>>>>> after
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
> >>>> here)
> >>>>>> after
> >>>>>>>> the
> >>>>>>>>>>> file
> >>>>>>>>>>>>> has
> >>>>>>>>>>>>>> been updated.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
> >>>> - and
> >>>>>> we
> >>>>>>>> use
> >>>>>>>>> 16
> >>>>>>>>>>>>> shards,
> >>>>>>>>>>>>>> so when we triggered an external commit to force
> >>>>>> re-reading,
> >>>>>>>> the
> >>>>>>>>>>> whole
> >>>>>>>>>>>>>> system would block for several (10-15) minutes. This
> >>>>> won't
> >>>>>>> work
> >>>>>>>>> in
> >>>>>>>>>> a
> >>>>>>>>>>>>>> production environment. The reason for the size of
> >>>> the
> >>>>> EFF
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>> around 7M documents in the index; each document has
> >>>> a 45
> >>>>>>>>> character
> >>>>>>>>>>> ID.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We got some help to try to fix the problem so that
> >>>> the
> >>>>>>> re-read
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>> EFF
> >>>>>>>>>>>>>> proceeds in the background (see
> >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
> >>>>>
> >>>>> for
> >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
> >>>>> re-read
> >>>>>>>>> proceeds
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> background, the time required to launch solr now
> >>>> takes at
> >>>>>>> least
> >>>>>>>>> as
> >>>>>>>>>>> long
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
> >>>> for
> >>>>> our
> >>>>>>>>> needs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
> >>>>>> (though
> >>>>>>>> you
> >>>>>>>>>> can
> >>>>>>>>>>>>> return
> >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
> >>>>> also
> >>>>>>>> fixed
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>> 4.1
> >>>>>>>>>>>>>> branch here <
> >>>>>> https://issues.apache.org/jira/browse/SOLR-4022
> >>>>>>>> .
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
> >>>> that
> >>>>>>> great.
> >>>>>>>>> Our
> >>>>>>>>>>>>> solution
> >>>>>>>>>>>>>> is as follows: The actual value of the popularity
> >>>> measure
> >>>>>>> (say,
> >>>>>>>>>>> reads)
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>> we want to report to the user is inserted into the
> >>>> search
> >>>>>>>>> response
> >>>>>>>>>>>>>> post-query by our query front-end. This value will
> >>>> then
> >>>>> be
> >>>>>>> the
> >>>>>>>>>>>>>> authoritative value at the time of the query. The
> >>>> value
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>> popularity
> >>>>>>>>>>>>>> measure that we use for boosting in the ranking of
> >>>> the
> >>>>>> search
> >>>>>>>>>> results
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> only updated when the value has changed enough so
> >>>> that
> >>>>> the
> >>>>>>>> impact
> >>>>>>>>>> on
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
> >>>> does
> >>>>>>>> require
> >>>>>>>>>>> frequent
> >>>>>>>>>>>>>> re-indexing of the documents that have significant
> >>>>> changes
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> reads, but at least we won't have to update a
> >>>> document if
> >>>>>> it
> >>>>>>>>> moves
> >>>>>>>>>>> from,
> >>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> >>>>>>>>> simoneg@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
> >>>>> SolrCloud.
> >>>>>>>>>> However,
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> index, an external file field is used for
> >>>> popularity
> >>>>>>> ranking.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
> >>>> does
> >>>>> it
> >>>>>>>> cope
> >>>>>>>>>> with
> >>>>>>>>>>>>>>> sharding and replication? Where should the external
> >>>>> file
> >>>>>> be
> >>>>>>>>>> placed
> >>>>>>>>>>> now
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> the index folder is not local but in the cloud?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Are there otherwise other best practices to deal
> >>>> with
> >>>>> the
> >>>>>>> use
> >>>>>>>>>> cases
> >>>>>>>>>>>>>>> external file fields were used for, like
> >>>>>>> popularity/ranking,
> >>>>>>>> in
> >>>>>>>>>>>>>> SolrCloud?
> >>>>>>>>>>>>>>> Custom ValueSources going to something external?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks in advance,
> >>>>>>>>>>>>>>> Simone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Sincerely yours
> >>>>>>>>>>>>> Mikhail Khludnev
> >>>>>>>>>>>>> Principal Engineer,
> >>>>>>>>>>>>> Grid Dynamics
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <http://www.griddynamics.com>
> >>>>>>>>>>>>> <mk...@griddynamics.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <
> >>>> mak@issuu.com>
> >>>>>>>> написал:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Mikhail
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please see answers below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> >>>>>>>>>>>> mkhludnev@griddynamics.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Martin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
> >>>>>> useful
> >>>>>>>> for
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>> The first question might seems not really conscious,
> >>>> but
> >>>>>> would
> >>>>>>>> you
> >>>>>>>>>> tell
> >>>>>>>>>>>> me
> >>>>>>>>>>>>> what blocks searching during EFF reload, when it's
> >>>>> triggered
> >>>>>> by
> >>>>>>>>>> handler
> >>>>>>>>>>>> or
> >>>>>>>>>>>>> by listener?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> We continuously index new documents using CommitWithin
> >>>> to get
> >>>>>>>> regular
> >>>>>>>>>>>> commits. However, we observed that the EFFs were not
> >>>> re-read,
> >>>>>> so
> >>>>>>> we
> >>>>>>>>> had
> >>>>>>>>>>> to
> >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
> >>>> to
> >>>>>> force
> >>>>>>>>>> reload.
> >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
> >>>> why
> >>>>>> it's
> >>>>>>>>> doing
> >>>>>>>>>>>> that (it was related to SOLR-3985).
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I don't really get the sentence about sequential
> >>>> commits
> >>>>> and
> >>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> cores. Do I get right that file is replicated via
> >>>>> Zookeeper?
> >>>>>>>>> Doesn't
> >>>>>>>>>> it
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
> >>>> on a
> >>>>>>>> system
> >>>>>>>>>>> with a
> >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
> >>>>> for a
> >>>>>>>> long
> >>>>>>>>>> time
> >>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
> >>>> EFF
> >>>>>> is a
> >>>>>>>>>> symlink
> >>>>>>>>>>>> from each cores index dir to the actual file, which is
> >>>>> updated
> >>>>>> by
> >>>>>>>> an
> >>>>>>>>>>>> external process.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> causes scalability problem or long time to reload?
> >>>> Will it
> >>>>>> help
> >>>>>>>> if
> >>>>>>>>>>> we'll
> >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
> >>>>> values
> >>>>>>> from
> >>>>>>>>>> jdbc.
> >>>>>>>>>>>> ie.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think the possibility of having some fields being
> >>>> retrieved
> >>>>>>> from
> >>>>>>>> an
> >>>>>>>>>>>> external, dynamically updatable store would be really
> >>>>>>> interesting.
> >>>>>>>>> This
> >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
> >>>>>> product
> >>>>>>>>> (e.g.
> >>>>>>>>>>>> Cassandra).
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> why all cores can't read these values simultaneously?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Again, this is a solr implementation detail that I can't
> >>>>> answer
> >>>>>>> :)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
> >>>>> index
> >>>>>>> term
> >>>>>>>>>>> order?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> AFAIK it can impact load time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Yes, it does.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Regarding your post-query solution can you tell me if
> >>>> query
> >>>>>>> found
> >>>>>>>>>> 10000
> >>>>>>>>>>>>> docs, but I need to display only first page with 100
> >>>> rows,
> >>>>>>>> whether
> >>>>>>>>> I
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to pull all 10K results to frontend to order them by
> >>>> the
> >>>>>> rank?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> In our architecture, the clients query an API that
> >>>> generates
> >>>>>> the
> >>>>>>>> SOLR
> >>>>>>>>>>>> query, retrieves the relevant additional fields that we
> >>>>> needs,
> >>>>>>> and
> >>>>>>>>>>> returns
> >>>>>>>>>>>> the relevant JSON to the front-end.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In our use case, results are returned from SOLR by the
> >>>> 10's,
> >>>>>> not
> >>>>>>> by
> >>>>>>>>> the
> >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
> >>>> returned
> >>>>>>>>> thousands
> >>>>>>>>>> of
> >>>>>>>>>>>> results, it would be up to the implementation of the api
> >>>> to
> >>>>>>> augment
> >>>>>>>>>> only
> >>>>>>>>>>>> the results that needed to be returned to the front-end.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
> >>>>> should
> >>>>>>> be
> >>>>>>>>>>>> possible.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm really appreciate if you comment on the questions
> >>>>> above.
> >>>>>>>>>>>>> PS: It's time to pitch, how much
> >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
> >>>> "Commit-free
> >>>>>>>>>>>>> ExternalFileField" can help you?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It looks very interesting :) Does it make it possible
> >>>> to
> >>>>>> avoid
> >>>>>>>>>>> re-reading
> >>>>>>>>>>>> the EFF on every commit, and only re-read the values that
> >>>>> have
> >>>>>>>>> actually
> >>>>>>>>>>>> changed?
> >>>>>>>>>>>>
> >>>>>>>>>>>> /Martin
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> >>>>> mak@issuu.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
> >>>> give
> >>>>> you
> >>>>>>>> what
> >>>>>>>>>>> you're
> >>>>>>>>>>>>>> hoping fore.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The EFF is placed in the parent of the index
> >>>> directory in
> >>>>>>> each
> >>>>>>>>>> core;
> >>>>>>>>>>>> each
> >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
> >>>> it
> >>>>> is
> >>>>>>>>>>> responsible
> >>>>>>>>>>>>> for.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
> >>>>> (doesn't
> >>>>>>>> answer
> >>>>>>>>>>>>> queries)
> >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
> >>>> the
> >>>>>> time
> >>>>>>> to
> >>>>>>>>>>> re-read
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
> >>>> (i.e. the
> >>>>>> EFF
> >>>>>>>> is
> >>>>>>>>>>>> re-read
> >>>>>>>>>>>>> by
> >>>>>>>>>>>>>> each core sequentially). The contents of the EFF
> >>>> become
> >>>>>>> active
> >>>>>>>>>> after
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
> >>>> here)
> >>>>>> after
> >>>>>>>> the
> >>>>>>>>>>> file
> >>>>>>>>>>>>> has
> >>>>>>>>>>>>>> been updated.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
> >>>> - and
> >>>>>> we
> >>>>>>>> use
> >>>>>>>>> 16
> >>>>>>>>>>>>> shards,
> >>>>>>>>>>>>>> so when we triggered an external commit to force
> >>>>>> re-reading,
> >>>>>>>> the
> >>>>>>>>>>> whole
> >>>>>>>>>>>>>> system would block for several (10-15) minutes. This
> >>>>> won't
> >>>>>>> work
> >>>>>>>>> in
> >>>>>>>>>> a
> >>>>>>>>>>>>>> production environment. The reason for the size of
> >>>> the
> >>>>> EFF
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>> around 7M documents in the index; each document has
> >>>> a 45
> >>>>>>>>> character
> >>>>>>>>>>> ID.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We got some help to try to fix the problem so that
> >>>> the
> >>>>>>> re-read
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>> EFF
> >>>>>>>>>>>>>> proceeds in the background (see
> >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
> >>>>>
> >>>>> for
> >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
> >>>>> re-read
> >>>>>>>>> proceeds
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> background, the time required to launch solr now
> >>>> takes at
> >>>>>>> least
> >>>>>>>>> as
> >>>>>>>>>>> long
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
> >>>> for
> >>>>> our
> >>>>>>>>> needs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
> >>>>>> (though
> >>>>>>>> you
> >>>>>>>>>> can
> >>>>>>>>>>>>> return
> >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
> >>>>> also
> >>>>>>>> fixed
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>> 4.1
> >>>>>>>>>>>>>> branch here <
> >>>>>> https://issues.apache.org/jira/browse/SOLR-4022
> >>>>>>>> .
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
> >>>> that
> >>>>>>> great.
> >>>>>>>>> Our
> >>>>>>>>>>>>> solution
> >>>>>>>>>>>>>> is as follows: The actual value of the popularity
> >>>> measure
> >>>>>>> (say,
> >>>>>>>>>>> reads)
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>> we want to report to the user is inserted into the
> >>>> search
> >>>>>>>>> response
> >>>>>>>>>>>>>> post-query by our query front-end. This value will
> >>>> then
> >>>>> be
> >>>>>>> the
> >>>>>>>>>>>>>> authoritative value at the time of the query. The
> >>>> value
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>> popularity
> >>>>>>>>>>>>>> measure that we use for boosting in the ranking of
> >>>> the
> >>>>>> search
> >>>>>>>>>> results
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>> only updated when the value has changed enough so
> >>>> that
> >>>>> the
> >>>>>>>> impact
> >>>>>>>>>> on
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
> >>>> does
> >>>>>>>> require
> >>>>>>>>>>>> frequent
> >>>>>>>>>>>>>> re-indexing of the documents that have significant
> >>>>> changes
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> reads, but at least we won't have to update a
> >>>> document if
> >>>>>> it
> >>>>>>>>> moves
> >>>>>>>>>>>> from,
> >>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> >>>>>>>>> simoneg@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
> >>>>> SolrCloud.
> >>>>>>>>>> However,
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> index, an external file field is used for
> >>>> popularity
> >>>>>>> ranking.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
> >>>> does
> >>>>> it
> >>>>>>>> cope
> >>>>>>>>>> with
> >>>>>>>>>>>>>>> sharding and replication? Where should the external
> >>>>> file
> >>>>>> be
> >>>>>>>>>> placed
> >>>>>>>>>>>> now
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> the index folder is not local but in the cloud?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Are there otherwise other best practices to deal
> >>>> with
> >>>>> the
> >>>>>>> use
> >>>>>>>>>> cases
> >>>>>>>>>>>>>>> external file fields were used for, like
> >>>>>>> popularity/ranking,
> >>>>>>>> in
> >>>>>>>>>>>>>> SolrCloud?
> >>>>>>>>>>>>>>> Custom ValueSources going to something external?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks in advance,
> >>>>>>>>>>>>>>> Simone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Sincerely yours
> >>>>>>>>>>>>> Mikhail Khludnev
> >>>>>>>>>>>>> Principal Engineer,
> >>>>>>>>>>>>> Grid Dynamics
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <http://www.griddynamics.com>
> >>>>>>>>>>>>> <mk...@griddynamics.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Sincerely yours
> >>>>>>>>> Mikhail Khludnev
> >>>>>>>>> Principal Engineer,
> >>>>>>>>> Grid Dynamics
> >>>>>>>>>
> >>>>>>>>> <http://www.griddynamics.com>
> >>>>>>>>> <mk...@griddynamics.com>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Sincerely yours
> >>>>>>> Mikhail Khludnev
> >>>>>>> Principal Engineer,
> >>>>>>> Grid Dynamics
> >>>>>>>
> >>>>>>> <http://www.griddynamics.com>
> >>>>>>> <mk...@griddynamics.com>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Sincerely yours
> >>>>> Mikhail Khludnev
> >>>>> Principal Engineer,
> >>>>> Grid Dynamics
> >>>>>
> >>>>> <http://www.griddynamics.com>
> >>>>> <mk...@griddynamics.com>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>> Principal Engineer,
> >>> Grid Dynamics
> >>>
> >>> <http://www.griddynamics.com>
> >>> <mk...@griddynamics.com>
> >>>
> >>>
> >>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Mark Miller <ma...@gmail.com>.

Keep in mind that the distrib update proc will be auto inserted into chains! You have to include a proc that disables it - see the FAQ: http://wiki.apache.org/solr/SolrCloud#FAQ

- Mark

On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev <mk...@griddynamics.com> wrote:

> Martin,
> Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
> commits to all peers. To hack this you can introduce dedicated
> UpdateProcessorChain without DistributedUpdateProcessor and send commit to
> that chain.
> 28.11.2012 13:16 пользователь "Martin Koch" <ma...@issuu.com> написал:
> 
>> Mikhail
>> 
>> I haven't experimented further yet. I think that the previous experiment
>> of issuing a commit to a specific core proved that all cores get the
>> commit, so I don't think that this approach will work.
>> 
>> Thanks,
>> /Martin
>> 
>> 
>> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>> 
>>> Martin,
>>> 
>>> It's still not clear to me whether you solve the problem completely or
>>> partially:
>>> Does reducing number of cores free some resources for searching during
>>> commit?
>>> Does the commiting one-by-one core prevents the "freeze"?
>>> 
>>> Thanks
>>> 
>>> 
>>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <ma...@issuu.com> wrote:
>>> 
>>>> Mikhail
>>>> 
>>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>>>> (bug
>>>> 3985). But this wasn't good enough, because SOLR would still take very
>>>> long
>>>> to restart when that was necessary.
>>>> 
>>>> I don't see how we could throw more hardware at the problem without
>>>> making
>>>> it worse, really - the only solution here would be *fewer* shards, not
>>>> 
>>>> more.
>>>> 
>>>> IMO it would be ideal if the lucene/solr community could come up with a
>>>> good way of updating fields in a document without reindexing. This could
>>>> be
>>>> by linking to some external data store, or in the lucene/solr internals.
>>>> If
>>>> it would make things easier, a good first step would be to have
>>>> dynamically
>>>> updateable numerical fields only.
>>>> 
>>>> /Martin
>>>> 
>>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>>>> mkhludnev@griddynamics.com> wrote:
>>>> 
>>>>> Martin,
>>>>> 
>>>>> I don't think solrconfig.xml shed any light on. I've just found what I
>>>>> didn't get in your setup - the way of how to explicitly assigning core
>>>> to
>>>>> collection. Now, I realized most of details after all!
>>>>> Ball is on your side, let us know whether you have managed your cores
>>>> to
>>>>> commit one by one to avoid freeze, or could you eliminate pauses by
>>>>> allocating more hardware?
>>>>> Thanks in advance!
>>>>> 
>>>>> 
>>>>> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
>>>>> 
>>>>>> Mikhail,
>>>>>> 
>>>>>> PSB
>>>>>> 
>>>>>> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>>>>>> mkhludnev@griddynamics.com> wrote:
>>>>>> 
>>>>>>> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> I wasn't aware until now that it is possible to send a commit to
>>>> one
>>>>>> core
>>>>>>>> only. What we observed was the effect of curl
>>>>>>>> localhost:8080/solr/update?commit=true but perhaps we should
>>>>> experiment
>>>>>>>> with solr/coreN/update?commit=true. A quick trial run seems to
>>>>> indicate
>>>>>>>> that a commit to a single core causes commits on all cores.
>>>>>>>> 
>>>>>>> You should see something like this in the log:
>>>>>>> ... SolrCmdDistributor .... Distrib commit to: ...
>>>>>>> 
>>>>>>> Yup, a commit towards a single core results in a commit on all
>>>> cores.
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Perhaps I should clarify that we are using SOLR as a black box;
>>>> we do
>>>>>> not
>>>>>>>> touch the code at all - we only install the distribution WAR
>>>> file and
>>>>>>>> proceed from there.
>>>>>>>> 
>>>>>>> I still don't understand how you deploy/launch Solr. How many
>>>> jettys
>>>>> you
>>>>>>> start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>>>> specifies
>>>>>>> shards= param for every request and distributes updates yourself?
>>>> What
>>>>>>> collections do you create and with which settings?
>>>>>>> 
>>>>>>> We let SOLR do the sharding using one collection with 16 SOLR cores
>>>>>> holding one shard each. We launch only one instance of jetty with the
>>>>>> folllowing arguments:
>>>>>> 
>>>>>> -DnumShards=16
>>>>>> -DzkHost=<zookeeperhost:port>
>>>>>> -Xmx10G
>>>>>> -Xms10G
>>>>>> -Xmn2G
>>>>>> -server
>>>>>> 
>>>>>> Would you like to see the solrconfig.xml?
>>>>>> 
>>>>>> /Martin
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Also from my POV such deployments should start at least from
>>>> *16*
>>>>>> 4-way
>>>>>>>>> vboxes, it's more expensive, but should be much better
>>>> available
>>>>>> during
>>>>>>>>> cpu-consuming operations.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>>>> hosts
>>>>>>> with
>>>>>>>> 16 cores? Or am I misunderstanding something :) ?
>>>>>>>> 
>>>>>>> I prefer to start from 16 hosts with 4 cores each.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Other details, if you use single jetty for all of them, are you
>>>>> sure
>>>>>>> that
>>>>>>>>> jetty's threadpool doesn't limit requests? is it large enough?
>>>>>>>>> You have 60G and set -Xmx=10G. are you sure that total size of
>>>>> cores
>>>>>>>> index
>>>>>>>>> directories is less than 45G?
>>>>>>>>> 
>>>>>>>>> The total index size is 230 GB, so it won't fit in ram, but
>>>> we're
>>>>>> using
>>>>>>>> an
>>>>>>>> SSD disk to minimize disk access time. We have tried putting the
>>>> EFF
>>>>>>> onto a
>>>>>>>> ram disk, but this didn't have a measurable effect.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> /Martin
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Mikhail
>>>>>>>>>> 
>>>>>>>>>> PSB
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>>>>>>>>>> mkhludnev@griddynamics.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Martin,
>>>>>>>>>>> 
>>>>>>>>>>> Please find additional question from me below.
>>>>>>>>>>> 
>>>>>>>>>>> Simone,
>>>>>>>>>>> 
>>>>>>>>>>> I'm sorry for hijacking your thread. The only what I've
>>>> heard
>>>>>> about
>>>>>>>> it
>>>>>>>>> at
>>>>>>>>>>> recent ApacheCon sessions is that Zookeeper is supposed to
>>>>>>> replicate
>>>>>>>>>> those
>>>>>>>>>>> files as configs under solr home. And I'm really looking
>>>>> forward
>>>>>> to
>>>>>>>>> know
>>>>>>>>>>> how it works with huge files in production.
>>>>>>>>>>> 
>>>>>>>>>>> Thank You, Guys!
>>>>>>>>>>> 
>>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com
>>>>> 
>>>>>>> написал:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Mikhail
>>>>>>>>>>>> 
>>>>>>>>>>>> Please see answers below.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>>>>>>>>>>> mkhludnev@griddynamics.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Martin,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
>>>>>> useful
>>>>>>>> for
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> The first question might seems not really conscious,
>>>> but
>>>>>> would
>>>>>>>> you
>>>>>>>>>> tell
>>>>>>>>>>> me
>>>>>>>>>>>>> what blocks searching during EFF reload, when it's
>>>>> triggered
>>>>>> by
>>>>>>>>>> handler
>>>>>>>>>>> or
>>>>>>>>>>>>> by listener?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We continuously index new documents using CommitWithin
>>>> to get
>>>>>>>> regular
>>>>>>>>>>>> commits. However, we observed that the EFFs were not
>>>> re-read,
>>>>>> so
>>>>>>> we
>>>>>>>>> had
>>>>>>>>>>> to
>>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
>>>> to
>>>>>> force
>>>>>>>>>> reload.
>>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
>>>> why
>>>>>> it's
>>>>>>>>> doing
>>>>>>>>>>>> that (it was related to SOLR-3985).
>>>>>>>>>>> 
>>>>>>>>>>> Is there a chance to get a thread dump when they are
>>>> blocked?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> Well I could try to recreate the situation. But the setup is
>>>>> fairly
>>>>>>>>> simple:
>>>>>>>>>> Create a large EFF in a largeish index with many shards.
>>>> Issue a
>>>>>>>> commit,
>>>>>>>>>> and then try to do a search. Solr will not respond to the
>>>> search
>>>>>>> before
>>>>>>>>> the
>>>>>>>>>> commit has completed, and this will take a long time.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't really get the sentence about sequential
>>>> commits
>>>>> and
>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> cores. Do I get right that file is replicated via
>>>>> Zookeeper?
>>>>>>>>> Doesn't
>>>>>>>>>> it
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
>>>> on a
>>>>>>>> system
>>>>>>>>>> with
>>>>>>>>>>> a
>>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
>>>>> for a
>>>>>>>> long
>>>>>>>>>> time
>>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
>>>> EFF
>>>>>> is a
>>>>>>>>>> symlink
>>>>>>>>>>>> from each cores index dir to the actual file, which is
>>>>> updated
>>>>>> by
>>>>>>>> an
>>>>>>>>>>>> external process.
>>>>>>>>>>> 
>>>>>>>>>>> Hold on, I asked about Zookeeper because the subj mentions
>>>>>>> SolrCloud.
>>>>>>>>>>> 
>>>>>>>>>>> Do you use SolrCloud, SolrShards, or these cores are just
>>>>>> replicas
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>> same index?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Ah - we use solr 4 out of the box, so I guess this is
>>>> SolrCloud.
>>>>>> I'm
>>>>>>> a
>>>>>>>>> bit
>>>>>>>>>> unsure about the terminology here, but we've got a single
>>>> index
>>>>>>> divided
>>>>>>>>>> into 16 shard. Each shard is hosted in a solr core.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Also, about simlink - Don't you share that file via some
>>>> NFS?
>>>>>>>>>>> 
>>>>>>>>>>> No, we generate the EFF on the local solr host (there is
>>>> only
>>>>> one
>>>>>>>>>> physical
>>>>>>>>>> host that holds all shards), so there is no need for NFS or
>>>>> copying
>>>>>>>> files
>>>>>>>>>> around. No need for Zookeeper either.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> how many cores you run per box?
>>>>>>>>>>> 
>>>>>>>>>> This box is a 16-virtual core (8 hyperthreaded cores)  with
>>>> 60GB
>>>>> of
>>>>>>>> RAM.
>>>>>>>>> We
>>>>>>>>>> run 16 solr cores on this box in Jetty.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Do boxes has plenty of ram to cache filesystem beside of
>>>> jvm
>>>>>> heaps?
>>>>>>>>>>> 
>>>>>>>>>>> Yes. We've allocated 10GB for jetty, and left the rest for
>>>> the
>>>>>> OS.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> I assume you use 64 bit linux and mmap directory. Please
>>>>> confirm
>>>>>>>> that.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> We use 64-bit linux. I'm not sure about the mmap directory or
>>>>> where
>>>>>>>> that
>>>>>>>>>> would be configured in solr - can you explain that?
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> causes scalability problem or long time to reload?
>>>> Will it
>>>>>> help
>>>>>>>> if
>>>>>>>>>>> we'll
>>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
>>>>> values
>>>>>>> from
>>>>>>>>>> jdbc.
>>>>>>>>>>> ie.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the possibility of having some fields being
>>>> retrieved
>>>>>>> from
>>>>>>>> an
>>>>>>>>>>>> external, dynamically updatable store would be really
>>>>>>> interesting.
>>>>>>>>> This
>>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
>>>>>> product
>>>>>>>>> (e.g.
>>>>>>>>>>>> Cassandra).
>>>>>>>>>>> 
>>>>>>>>>>> Ok. Let's have it in mind as a possible direction.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Alternatively, an API that would allow updating a single
>>>> field
>>>>> for
>>>>>> a
>>>>>>>>>> document might be an option.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> why all cores can't read these values simultaneously?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is a solr implementation detail that I can't
>>>>> answer
>>>>>>> :)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
>>>>> index
>>>>>>> term
>>>>>>>>>>> order?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> AFAIK it can impact load time.
>>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it does
>>>>>>>>>>> 
>>>>>>>>>>> Ok, I've got that you aware of it, and your IDs are just
>>>>> strings,
>>>>>>> not
>>>>>>>>>>> integers.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> Yes, ids are strings.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding your post-query solution can you tell me if
>>>> query
>>>>>>> found
>>>>>>>>>> 10000
>>>>>>>>>>>>> docs, but I need to display only first page with 100
>>>> rows,
>>>>>>>> whether
>>>>>>>>> I
>>>>>>>>>>> need
>>>>>>>>>>>>> to pull all 10K results to frontend to order them by
>>>> the
>>>>>> rank?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> In our architecture, the clients query an API that
>>>> generates
>>>>>> the
>>>>>>>> SOLR
>>>>>>>>>>>> query, retrieves the relevant additional fields that we
>>>>> needs,
>>>>>>> and
>>>>>>>>>>> returns
>>>>>>>>>>>> the relevant JSON to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> In our use case, results are returned from SOLR by the
>>>> 10's,
>>>>>> not
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
>>>> returned
>>>>>>>>> thousands
>>>>>>>>>> of
>>>>>>>>>>>> results, it would be up to the implementation of the api
>>>> to
>>>>>>> augment
>>>>>>>>>> only
>>>>>>>>>>>> the results that needed to be returned to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
>>>>> should
>>>>>>> be
>>>>>>>>>>>> possible.
>>>>>>>>>>> 
>>>>>>>>>>> You are right. I'm concerned anyway because retrieving
>>>> whole
>>>>>> result
>>>>>>>> is
>>>>>>>>>>> expensive, and not always possible.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> In our case, getting the whole result is almost impossible,
>>>>> because
>>>>>>>> that
>>>>>>>>>> would be millions of documents, and returning the Nth result
>>>>> seems
>>>>>> to
>>>>>>>> be
>>>>>>>>> a
>>>>>>>>>> quadratic (or worse) operation in SOLR.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm really appreciate if you comment on the questions
>>>>> above.
>>>>>>>>>>>>> PS: It's time to pitch, how much
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
>>>> "Commit-free
>>>>>>>>>>>>> ExternalFileField" can help you?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks very interesting :) Does it make it possible
>>>> to
>>>>>> avoid
>>>>>>>>>>> re-reading
>>>>>>>>>>>> the EFF on every commit, and only re-read the values that
>>>>> have
>>>>>>>>> actually
>>>>>>>>>>>> changed?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> You don't need commit (in SOLR-4085) to reload file
>>>> content,
>>>>> but
>>>>>>>> after
>>>>>>>>>>> commit you need to read whole file and scan all key terms
>>>> and
>>>>>>>> postings.
>>>>>>>>>>> That's because EFF sits on top of top level searcher. it's
>>>> a
>>>>>>>> Solr-like
>>>>>>>>>> way.
>>>>>>>>>>> In some future we might have per-segment EFF, in this case
>>>>>> adding a
>>>>>>>>>> segment
>>>>>>>>>>> will trigger full file scan, but in the index only that new
>>>>>> segment
>>>>>>>>> will
>>>>>>>>>> be
>>>>>>>>>>> scanned. It should be faster. You know, straightforward
>>>> sharing
>>>>>>>>> internal
>>>>>>>>>>> data structures between different index views/generations
>>>> is
>>>>> not
>>>>>>>>>> possible.
>>>>>>>>>>> If you are asking about applying delta changes on external
>>>> file
>>>>>>>> that's
>>>>>>>>>>> something what we did ourselves http://goo.gl/P8GFq . This
>>>>>> feature
>>>>>>>> is
>>>>>>>>>> much
>>>>>>>>>>> more doubtful and vague, although it might be the next
>>>>>> contribution
>>>>>>>>> after
>>>>>>>>>>> SOLR-4085.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> /Martin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>>>> mak@issuu.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
>>>> give
>>>>> you
>>>>>>>> what
>>>>>>>>>>> you're
>>>>>>>>>>>>>> hoping fore.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The EFF is placed in the parent of the index
>>>> directory in
>>>>>>> each
>>>>>>>>>> core;
>>>>>>>>>>> each
>>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
>>>> it
>>>>> is
>>>>>>>>>>> responsible
>>>>>>>>>>>>> for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
>>>>> (doesn't
>>>>>>>> answer
>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
>>>> the
>>>>>> time
>>>>>>> to
>>>>>>>>>>> re-read
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
>>>> (i.e. the
>>>>>> EFF
>>>>>>>> is
>>>>>>>>>>> re-read
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> each core sequentially). The contents of the EFF
>>>> become
>>>>>>> active
>>>>>>>>>> after
>>>>>>>>>>> the
>>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
>>>> here)
>>>>>> after
>>>>>>>> the
>>>>>>>>>>> file
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> been updated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
>>>> - and
>>>>>> we
>>>>>>>> use
>>>>>>>>> 16
>>>>>>>>>>>>> shards,
>>>>>>>>>>>>>> so when we triggered an external commit to force
>>>>>> re-reading,
>>>>>>>> the
>>>>>>>>>>> whole
>>>>>>>>>>>>>> system would block for several (10-15) minutes. This
>>>>> won't
>>>>>>> work
>>>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>>>> production environment. The reason for the size of
>>>> the
>>>>> EFF
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> around 7M documents in the index; each document has
>>>> a 45
>>>>>>>>> character
>>>>>>>>>>> ID.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We got some help to try to fix the problem so that
>>>> the
>>>>>>> re-read
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>> EFF
>>>>>>>>>>>>>> proceeds in the background (see
>>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
>>>>> 
>>>>> for
>>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
>>>>> re-read
>>>>>>>>> proceeds
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> background, the time required to launch solr now
>>>> takes at
>>>>>>> least
>>>>>>>>> as
>>>>>>>>>>> long
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
>>>> for
>>>>> our
>>>>>>>>> needs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
>>>>>> (though
>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>>>>> return
>>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
>>>>> also
>>>>>>>> fixed
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>> 4.1
>>>>>>>>>>>>>> branch here <
>>>>>> https://issues.apache.org/jira/browse/SOLR-4022
>>>>>>>> .
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
>>>> that
>>>>>>> great.
>>>>>>>>> Our
>>>>>>>>>>>>> solution
>>>>>>>>>>>>>> is as follows: The actual value of the popularity
>>>> measure
>>>>>>> (say,
>>>>>>>>>>> reads)
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we want to report to the user is inserted into the
>>>> search
>>>>>>>>> response
>>>>>>>>>>>>>> post-query by our query front-end. This value will
>>>> then
>>>>> be
>>>>>>> the
>>>>>>>>>>>>>> authoritative value at the time of the query. The
>>>> value
>>>>> of
>>>>>>> the
>>>>>>>>>>> popularity
>>>>>>>>>>>>>> measure that we use for boosting in the ranking of
>>>> the
>>>>>> search
>>>>>>>>>> results
>>>>>>>>>>> is
>>>>>>>>>>>>>> only updated when the value has changed enough so
>>>> that
>>>>> the
>>>>>>>> impact
>>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
>>>> does
>>>>>>>> require
>>>>>>>>>>> frequent
>>>>>>>>>>>>>> re-indexing of the documents that have significant
>>>>> changes
>>>>>> in
>>>>>>>> the
>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> reads, but at least we won't have to update a
>>>> document if
>>>>>> it
>>>>>>>>> moves
>>>>>>>>>>> from,
>>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>>>>>>>> simoneg@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
>>>>> SolrCloud.
>>>>>>>>>> However,
>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> index, an external file field is used for
>>>> popularity
>>>>>>> ranking.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
>>>> does
>>>>> it
>>>>>>>> cope
>>>>>>>>>> with
>>>>>>>>>>>>>>> sharding and replication? Where should the external
>>>>> file
>>>>>> be
>>>>>>>>>> placed
>>>>>>>>>>> now
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the index folder is not local but in the cloud?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Are there otherwise other best practices to deal
>>>> with
>>>>> the
>>>>>>> use
>>>>>>>>>> cases
>>>>>>>>>>>>>>> external file fields were used for, like
>>>>>>> popularity/ranking,
>>>>>>>> in
>>>>>>>>>>>>>> SolrCloud?
>>>>>>>>>>>>>>> Custom ValueSources going to something external?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>> Simone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sincerely yours
>>>>>>>>>>>>> Mikhail Khludnev
>>>>>>>>>>>>> Principal Engineer,
>>>>>>>>>>>>> Grid Dynamics
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>>>>> <mk...@griddynamics.com>
>>>>>>>>>>>>> 
>>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <
>>>> mak@issuu.com>
>>>>>>>> написал:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Mikhail
>>>>>>>>>>>> 
>>>>>>>>>>>> Please see answers below.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>>>>>>>>>>> mkhludnev@griddynamics.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Martin,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
>>>>>> useful
>>>>>>>> for
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> The first question might seems not really conscious,
>>>> but
>>>>>> would
>>>>>>>> you
>>>>>>>>>> tell
>>>>>>>>>>>> me
>>>>>>>>>>>>> what blocks searching during EFF reload, when it's
>>>>> triggered
>>>>>> by
>>>>>>>>>> handler
>>>>>>>>>>>> or
>>>>>>>>>>>>> by listener?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We continuously index new documents using CommitWithin
>>>> to get
>>>>>>>> regular
>>>>>>>>>>>> commits. However, we observed that the EFFs were not
>>>> re-read,
>>>>>> so
>>>>>>> we
>>>>>>>>> had
>>>>>>>>>>> to
>>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
>>>> to
>>>>>> force
>>>>>>>>>> reload.
>>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
>>>> why
>>>>>> it's
>>>>>>>>> doing
>>>>>>>>>>>> that (it was related to SOLR-3985).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't really get the sentence about sequential
>>>> commits
>>>>> and
>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> cores. Do I get right that file is replicated via
>>>>> Zookeeper?
>>>>>>>>> Doesn't
>>>>>>>>>> it
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
>>>> on a
>>>>>>>> system
>>>>>>>>>>> with a
>>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
>>>>> for a
>>>>>>>> long
>>>>>>>>>> time
>>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
>>>> EFF
>>>>>> is a
>>>>>>>>>> symlink
>>>>>>>>>>>> from each cores index dir to the actual file, which is
>>>>> updated
>>>>>> by
>>>>>>>> an
>>>>>>>>>>>> external process.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> causes scalability problem or long time to reload?
>>>> Will it
>>>>>> help
>>>>>>>> if
>>>>>>>>>>> we'll
>>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
>>>>> values
>>>>>>> from
>>>>>>>>>> jdbc.
>>>>>>>>>>>> ie.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the possibility of having some fields being
>>>> retrieved
>>>>>>> from
>>>>>>>> an
>>>>>>>>>>>> external, dynamically updatable store would be really
>>>>>>> interesting.
>>>>>>>>> This
>>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
>>>>>> product
>>>>>>>>> (e.g.
>>>>>>>>>>>> Cassandra).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> why all cores can't read these values simultaneously?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is a solr implementation detail that I can't
>>>>> answer
>>>>>>> :)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
>>>>> index
>>>>>>> term
>>>>>>>>>>> order?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> AFAIK it can impact load time.
>>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it does.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding your post-query solution can you tell me if
>>>> query
>>>>>>> found
>>>>>>>>>> 10000
>>>>>>>>>>>>> docs, but I need to display only first page with 100
>>>> rows,
>>>>>>>> whether
>>>>>>>>> I
>>>>>>>>>>> need
>>>>>>>>>>>>> to pull all 10K results to frontend to order them by
>>>> the
>>>>>> rank?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> In our architecture, the clients query an API that
>>>> generates
>>>>>> the
>>>>>>>> SOLR
>>>>>>>>>>>> query, retrieves the relevant additional fields that we
>>>>> needs,
>>>>>>> and
>>>>>>>>>>> returns
>>>>>>>>>>>> the relevant JSON to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> In our use case, results are returned from SOLR by the
>>>> 10's,
>>>>>> not
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
>>>> returned
>>>>>>>>> thousands
>>>>>>>>>> of
>>>>>>>>>>>> results, it would be up to the implementation of the api
>>>> to
>>>>>>> augment
>>>>>>>>>> only
>>>>>>>>>>>> the results that needed to be returned to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
>>>>> should
>>>>>>> be
>>>>>>>>>>>> possible.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm really appreciate if you comment on the questions
>>>>> above.
>>>>>>>>>>>>> PS: It's time to pitch, how much
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
>>>> "Commit-free
>>>>>>>>>>>>> ExternalFileField" can help you?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks very interesting :) Does it make it possible
>>>> to
>>>>>> avoid
>>>>>>>>>>> re-reading
>>>>>>>>>>>> the EFF on every commit, and only re-read the values that
>>>>> have
>>>>>>>>> actually
>>>>>>>>>>>> changed?
>>>>>>>>>>>> 
>>>>>>>>>>>> /Martin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>>>> mak@issuu.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
>>>> give
>>>>> you
>>>>>>>> what
>>>>>>>>>>> you're
>>>>>>>>>>>>>> hoping fore.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The EFF is placed in the parent of the index
>>>> directory in
>>>>>>> each
>>>>>>>>>> core;
>>>>>>>>>>>> each
>>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
>>>> it
>>>>> is
>>>>>>>>>>> responsible
>>>>>>>>>>>>> for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
>>>>> (doesn't
>>>>>>>> answer
>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
>>>> the
>>>>>> time
>>>>>>> to
>>>>>>>>>>> re-read
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
>>>> (i.e. the
>>>>>> EFF
>>>>>>>> is
>>>>>>>>>>>> re-read
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> each core sequentially). The contents of the EFF
>>>> become
>>>>>>> active
>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
>>>> here)
>>>>>> after
>>>>>>>> the
>>>>>>>>>>> file
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> been updated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
>>>> - and
>>>>>> we
>>>>>>>> use
>>>>>>>>> 16
>>>>>>>>>>>>> shards,
>>>>>>>>>>>>>> so when we triggered an external commit to force
>>>>>> re-reading,
>>>>>>>> the
>>>>>>>>>>> whole
>>>>>>>>>>>>>> system would block for several (10-15) minutes. This
>>>>> won't
>>>>>>> work
>>>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>>>> production environment. The reason for the size of
>>>> the
>>>>> EFF
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> around 7M documents in the index; each document has
>>>> a 45
>>>>>>>>> character
>>>>>>>>>>> ID.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We got some help to try to fix the problem so that
>>>> the
>>>>>>> re-read
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> EFF
>>>>>>>>>>>>>> proceeds in the background (see
>>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
>>>>> 
>>>>> for
>>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
>>>>> re-read
>>>>>>>>> proceeds
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> background, the time required to launch solr now
>>>> takes at
>>>>>>> least
>>>>>>>>> as
>>>>>>>>>>> long
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
>>>> for
>>>>> our
>>>>>>>>> needs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
>>>>>> (though
>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>>>>> return
>>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
>>>>> also
>>>>>>>> fixed
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>> 4.1
>>>>>>>>>>>>>> branch here <
>>>>>> https://issues.apache.org/jira/browse/SOLR-4022
>>>>>>>> .
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
>>>> that
>>>>>>> great.
>>>>>>>>> Our
>>>>>>>>>>>>> solution
>>>>>>>>>>>>>> is as follows: The actual value of the popularity
>>>> measure
>>>>>>> (say,
>>>>>>>>>>> reads)
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we want to report to the user is inserted into the
>>>> search
>>>>>>>>> response
>>>>>>>>>>>>>> post-query by our query front-end. This value will
>>>> then
>>>>> be
>>>>>>> the
>>>>>>>>>>>>>> authoritative value at the time of the query. The
>>>> value
>>>>> of
>>>>>>> the
>>>>>>>>>>>> popularity
>>>>>>>>>>>>>> measure that we use for boosting in the ranking of
>>>> the
>>>>>> search
>>>>>>>>>> results
>>>>>>>>>>>> is
>>>>>>>>>>>>>> only updated when the value has changed enough so
>>>> that
>>>>> the
>>>>>>>> impact
>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
>>>> does
>>>>>>>> require
>>>>>>>>>>>> frequent
>>>>>>>>>>>>>> re-indexing of the documents that have significant
>>>>> changes
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> reads, but at least we won't have to update a
>>>> document if
>>>>>> it
>>>>>>>>> moves
>>>>>>>>>>>> from,
>>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>>>>>>>> simoneg@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
>>>>> SolrCloud.
>>>>>>>>>> However,
>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> index, an external file field is used for
>>>> popularity
>>>>>>> ranking.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
>>>> does
>>>>> it
>>>>>>>> cope
>>>>>>>>>> with
>>>>>>>>>>>>>>> sharding and replication? Where should the external
>>>>> file
>>>>>> be
>>>>>>>>>> placed
>>>>>>>>>>>> now
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the index folder is not local but in the cloud?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Are there otherwise other best practices to deal
>>>> with
>>>>> the
>>>>>>> use
>>>>>>>>>> cases
>>>>>>>>>>>>>>> external file fields were used for, like
>>>>>>> popularity/ranking,
>>>>>>>> in
>>>>>>>>>>>>>> SolrCloud?
>>>>>>>>>>>>>>> Custom ValueSources going to something external?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>> Simone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sincerely yours
>>>>>>>>>>>>> Mikhail Khludnev
>>>>>>>>>>>>> Principal Engineer,
>>>>>>>>>>>>> Grid Dynamics
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>>>>> <mk...@griddynamics.com>
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Sincerely yours
>>>>>>>>> Mikhail Khludnev
>>>>>>>>> Principal Engineer,
>>>>>>>>> Grid Dynamics
>>>>>>>>> 
>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>> <mk...@griddynamics.com>
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Sincerely yours
>>>>>>> Mikhail Khludnev
>>>>>>> Principal Engineer,
>>>>>>> Grid Dynamics
>>>>>>> 
>>>>>>> <http://www.griddynamics.com>
>>>>>>> <mk...@griddynamics.com>
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Principal Engineer,
>>>>> Grid Dynamics
>>>>> 
>>>>> <http://www.griddynamics.com>
>>>>> <mk...@griddynamics.com>
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Principal Engineer,
>>> Grid Dynamics
>>> 
>>> <http://www.griddynamics.com>
>>> <mk...@griddynamics.com>
>>> 
>>> 
>>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,
Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
commits to all peers. To hack this you can introduce dedicated
UpdateProcessorChain without DistributedUpdateProcessor and send commit to
that chain.
 28.11.2012 13:16 пользователь "Martin Koch" <ma...@issuu.com> написал:

> Mikhail
>
> I haven't experimented further yet. I think that the previous experiment
> of issuing a commit to a specific core proved that all cores get the
> commit, so I don't think that this approach will work.
>
> Thanks,
> /Martin
>
>
> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Martin,
>>
>> It's still not clear to me whether you solve the problem completely or
>> partially:
>> Does reducing number of cores free some resources for searching during
>> commit?
>> Does the commiting one-by-one core prevents the "freeze"?
>>
>> Thanks
>>
>>
>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <ma...@issuu.com> wrote:
>>
>>> Mikhail
>>>
>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>>> (bug
>>> 3985). But this wasn't good enough, because SOLR would still take very
>>> long
>>> to restart when that was necessary.
>>>
>>> I don't see how we could throw more hardware at the problem without
>>> making
>>> it worse, really - the only solution here would be *fewer* shards, not
>>>
>>> more.
>>>
>>> IMO it would be ideal if the lucene/solr community could come up with a
>>> good way of updating fields in a document without reindexing. This could
>>> be
>>> by linking to some external data store, or in the lucene/solr internals.
>>> If
>>> it would make things easier, a good first step would be to have
>>> dynamically
>>> updateable numerical fields only.
>>>
>>> /Martin
>>>
>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>>> mkhludnev@griddynamics.com> wrote:
>>>
>>> > Martin,
>>> >
>>> > I don't think solrconfig.xml shed any light on. I've just found what I
>>> > didn't get in your setup - the way of how to explicitly assigning core
>>> to
>>> > collection. Now, I realized most of details after all!
>>> > Ball is on your side, let us know whether you have managed your cores
>>> to
>>> > commit one by one to avoid freeze, or could you eliminate pauses by
>>> > allocating more hardware?
>>> > Thanks in advance!
>>> >
>>> >
>>> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
>>> >
>>> > > Mikhail,
>>> > >
>>> > > PSB
>>> > >
>>> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>>> > > mkhludnev@griddynamics.com> wrote:
>>> > >
>>> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com>
>>> wrote:
>>> > > >
>>> > > > >
>>> > > > > I wasn't aware until now that it is possible to send a commit to
>>> one
>>> > > core
>>> > > > > only. What we observed was the effect of curl
>>> > > > > localhost:8080/solr/update?commit=true but perhaps we should
>>> > experiment
>>> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
>>> > indicate
>>> > > > > that a commit to a single core causes commits on all cores.
>>> > > > >
>>> > > > You should see something like this in the log:
>>> > > > ... SolrCmdDistributor .... Distrib commit to: ...
>>> > > >
>>> > > > Yup, a commit towards a single core results in a commit on all
>>> cores.
>>> > >
>>> > >
>>> > > > >
>>> > > > >
>>> > > > > Perhaps I should clarify that we are using SOLR as a black box;
>>> we do
>>> > > not
>>> > > > > touch the code at all - we only install the distribution WAR
>>> file and
>>> > > > > proceed from there.
>>> > > > >
>>> > > > I still don't understand how you deploy/launch Solr. How many
>>> jettys
>>> > you
>>> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>>> specifies
>>> > > > shards= param for every request and distributes updates yourself?
>>> What
>>> > > > collections do you create and with which settings?
>>> > > >
>>> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
>>> > > holding one shard each. We launch only one instance of jetty with the
>>> > > folllowing arguments:
>>> > >
>>> > > -DnumShards=16
>>> > > -DzkHost=<zookeeperhost:port>
>>> > > -Xmx10G
>>> > > -Xms10G
>>> > > -Xmn2G
>>> > > -server
>>> > >
>>> > > Would you like to see the solrconfig.xml?
>>> > >
>>> > > /Martin
>>> > >
>>> > >
>>> > > > >
>>> > > > >
>>> > > > > > Also from my POV such deployments should start at least from
>>> *16*
>>> > > 4-way
>>> > > > > > vboxes, it's more expensive, but should be much better
>>> available
>>> > > during
>>> > > > > > cpu-consuming operations.
>>> > > > > >
>>> > > > >
>>> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>>> hosts
>>> > > > with
>>> > > > > 16 cores? Or am I misunderstanding something :) ?
>>> > > > >
>>> > > > I prefer to start from 16 hosts with 4 cores each.
>>> > > >
>>> > > >
>>> > > > >
>>> > > > >
>>> > > > > > Other details, if you use single jetty for all of them, are you
>>> > sure
>>> > > > that
>>> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
>>> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
>>> > cores
>>> > > > > index
>>> > > > > > directories is less than 45G?
>>> > > > > >
>>> > > > > > The total index size is 230 GB, so it won't fit in ram, but
>>> we're
>>> > > using
>>> > > > > an
>>> > > > > SSD disk to minimize disk access time. We have tried putting the
>>> EFF
>>> > > > onto a
>>> > > > > ram disk, but this didn't have a measurable effect.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > /Martin
>>> > > > >
>>> > > > >
>>> > > > > > Thanks
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
>>> > wrote:
>>> > > > > >
>>> > > > > > > Mikhail
>>> > > > > > >
>>> > > > > > > PSB
>>> > > > > > >
>>> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>>> > > > > > > mkhludnev@griddynamics.com> wrote:
>>> > > > > > >
>>> > > > > > > > Martin,
>>> > > > > > > >
>>> > > > > > > > Please find additional question from me below.
>>> > > > > > > >
>>> > > > > > > > Simone,
>>> > > > > > > >
>>> > > > > > > > I'm sorry for hijacking your thread. The only what I've
>>> heard
>>> > > about
>>> > > > > it
>>> > > > > > at
>>> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
>>> > > > replicate
>>> > > > > > > those
>>> > > > > > > > files as configs under solr home. And I'm really looking
>>> > forward
>>> > > to
>>> > > > > > know
>>> > > > > > > > how it works with huge files in production.
>>> > > > > > > >
>>> > > > > > > > Thank You, Guys!
>>> > > > > > > >
>>> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com
>>> >
>>> > > > написал:
>>> > > > > > > > >
>>> > > > > > > > > Hi Mikhail
>>> > > > > > > > >
>>> > > > > > > > > Please see answers below.
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>> > > > > > > > > mkhludnev@griddynamics.com> wrote:
>>> > > > > > > > >
>>> > > > > > > > > > Martin,
>>> > > > > > > > > >
>>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>>> > > useful
>>> > > > > for
>>> > > > > > > > > > community.
>>> > > > > > > > > > The first question might seems not really conscious,
>>> but
>>> > > would
>>> > > > > you
>>> > > > > > > tell
>>> > > > > > > > me
>>> > > > > > > > > > what blocks searching during EFF reload, when it's
>>> > triggered
>>> > > by
>>> > > > > > > handler
>>> > > > > > > > or
>>> > > > > > > > > > by listener?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > We continuously index new documents using CommitWithin
>>> to get
>>> > > > > regular
>>> > > > > > > > > commits. However, we observed that the EFFs were not
>>> re-read,
>>> > > so
>>> > > > we
>>> > > > > > had
>>> > > > > > > > to
>>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>>> to
>>> > > force
>>> > > > > > > reload.
>>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>>> why
>>> > > it's
>>> > > > > > doing
>>> > > > > > > > > that (it was related to SOLR-3985).
>>> > > > > > > >
>>> > > > > > > > Is there a chance to get a thread dump when they are
>>> blocked?
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > Well I could try to recreate the situation. But the setup is
>>> > fairly
>>> > > > > > simple:
>>> > > > > > > Create a large EFF in a largeish index with many shards.
>>> Issue a
>>> > > > > commit,
>>> > > > > > > and then try to do a search. Solr will not respond to the
>>> search
>>> > > > before
>>> > > > > > the
>>> > > > > > > commit has completed, and this will take a long time.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I don't really get the sentence about sequential
>>> commits
>>> > and
>>> > > > > number
>>> > > > > > > of
>>> > > > > > > > > > cores. Do I get right that file is replicated via
>>> > Zookeeper?
>>> > > > > > Doesn't
>>> > > > > > > it
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>>> on a
>>> > > > > system
>>> > > > > > > with
>>> > > > > > > > a
>>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>>> > for a
>>> > > > > long
>>> > > > > > > time
>>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>>> EFF
>>> > > is a
>>> > > > > > > symlink
>>> > > > > > > > > from each cores index dir to the actual file, which is
>>> > updated
>>> > > by
>>> > > > > an
>>> > > > > > > > > external process.
>>> > > > > > > >
>>> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
>>> > > > SolrCloud.
>>> > > > > > > >
>>> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
>>> > > replicas
>>> > > > of
>>> > > > > > the
>>> > > > > > > > same index?
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
>>> SolrCloud.
>>> > > I'm
>>> > > > a
>>> > > > > > bit
>>> > > > > > > unsure about the terminology here, but we've got a single
>>> index
>>> > > > divided
>>> > > > > > > into 16 shard. Each shard is hosted in a solr core.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > Also, about simlink - Don't you share that file via some
>>> NFS?
>>> > > > > > > >
>>> > > > > > > > No, we generate the EFF on the local solr host (there is
>>> only
>>> > one
>>> > > > > > > physical
>>> > > > > > > host that holds all shards), so there is no need for NFS or
>>> > copying
>>> > > > > files
>>> > > > > > > around. No need for Zookeeper either.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > how many cores you run per box?
>>> > > > > > > >
>>> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
>>> 60GB
>>> > of
>>> > > > > RAM.
>>> > > > > > We
>>> > > > > > > run 16 solr cores on this box in Jetty.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of
>>> jvm
>>> > > heaps?
>>> > > > > > > >
>>> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
>>> the
>>> > > OS.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
>>> > confirm
>>> > > > > that.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
>>> > where
>>> > > > > that
>>> > > > > > > would be configured in solr - can you explain that?
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > causes scalability problem or long time to reload?
>>> Will it
>>> > > help
>>> > > > > if
>>> > > > > > > > we'll
>>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>>> > values
>>> > > > from
>>> > > > > > > jdbc.
>>> > > > > > > > ie.
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > I think the possibility of having some fields being
>>> retrieved
>>> > > > from
>>> > > > > an
>>> > > > > > > > > external, dynamically updatable store would be really
>>> > > > interesting.
>>> > > > > > This
>>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>>> > > product
>>> > > > > > (e.g.
>>> > > > > > > > > Cassandra).
>>> > > > > > > >
>>> > > > > > > > Ok. Let's have it in mind as a possible direction.
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > > Alternatively, an API that would allow updating a single
>>> field
>>> > for
>>> > > a
>>> > > > > > > document might be an option.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > why all cores can't read these values simultaneously?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is a solr implementation detail that I can't
>>> > answer
>>> > > > :)
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>>> > index
>>> > > > term
>>> > > > > > > > order?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > AFAIK it can impact load time.
>>> > > > > > > > > >
>>> > > > > > > > > Yes, it does
>>> > > > > > > >
>>> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
>>> > strings,
>>> > > > not
>>> > > > > > > > integers.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > Yes, ids are strings.
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Regarding your post-query solution can you tell me if
>>> query
>>> > > > found
>>> > > > > > > 10000
>>> > > > > > > > > > docs, but I need to display only first page with 100
>>> rows,
>>> > > > > whether
>>> > > > > > I
>>> > > > > > > > need
>>> > > > > > > > > > to pull all 10K results to frontend to order them by
>>> the
>>> > > rank?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > In our architecture, the clients query an API that
>>> generates
>>> > > the
>>> > > > > SOLR
>>> > > > > > > > > query, retrieves the relevant additional fields that we
>>> > needs,
>>> > > > and
>>> > > > > > > > returns
>>> > > > > > > > > the relevant JSON to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > In our use case, results are returned from SOLR by the
>>> 10's,
>>> > > not
>>> > > > by
>>> > > > > > the
>>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>>> returned
>>> > > > > > thousands
>>> > > > > > > of
>>> > > > > > > > > results, it would be up to the implementation of the api
>>> to
>>> > > > augment
>>> > > > > > > only
>>> > > > > > > > > the results that needed to be returned to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>>> > should
>>> > > > be
>>> > > > > > > > > possible.
>>> > > > > > > >
>>> > > > > > > > You are right. I'm concerned anyway because retrieving
>>> whole
>>> > > result
>>> > > > > is
>>> > > > > > > > expensive, and not always possible.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > In our case, getting the whole result is almost impossible,
>>> > because
>>> > > > > that
>>> > > > > > > would be millions of documents, and returning the Nth result
>>> > seems
>>> > > to
>>> > > > > be
>>> > > > > > a
>>> > > > > > > quadratic (or worse) operation in SOLR.
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I'm really appreciate if you comment on the questions
>>> > above.
>>> > > > > > > > > > PS: It's time to pitch, how much
>>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>>> "Commit-free
>>> > > > > > > > > > ExternalFileField" can help you?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > It looks very interesting :) Does it make it possible
>>> to
>>> > > avoid
>>> > > > > > > > re-reading
>>> > > > > > > > > the EFF on every commit, and only re-read the values that
>>> > have
>>> > > > > > actually
>>> > > > > > > > > changed?
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > You don't need commit (in SOLR-4085) to reload file
>>> content,
>>> > but
>>> > > > > after
>>> > > > > > > > commit you need to read whole file and scan all key terms
>>> and
>>> > > > > postings.
>>> > > > > > > > That's because EFF sits on top of top level searcher. it's
>>> a
>>> > > > > Solr-like
>>> > > > > > > way.
>>> > > > > > > > In some future we might have per-segment EFF, in this case
>>> > > adding a
>>> > > > > > > segment
>>> > > > > > > > will trigger full file scan, but in the index only that new
>>> > > segment
>>> > > > > > will
>>> > > > > > > be
>>> > > > > > > > scanned. It should be faster. You know, straightforward
>>> sharing
>>> > > > > > internal
>>> > > > > > > > data structures between different index views/generations
>>> is
>>> > not
>>> > > > > > > possible.
>>> > > > > > > > If you are asking about applying delta changes on external
>>> file
>>> > > > > that's
>>> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
>>> > > feature
>>> > > > > is
>>> > > > > > > much
>>> > > > > > > > more doubtful and vague, although it might be the next
>>> > > contribution
>>> > > > > > after
>>> > > > > > > > SOLR-4085.
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > /Martin
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>> > mak@issuu.com>
>>> > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>>> give
>>> > you
>>> > > > > what
>>> > > > > > > > you're
>>> > > > > > > > > > > hoping fore.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The EFF is placed in the parent of the index
>>> directory in
>>> > > > each
>>> > > > > > > core;
>>> > > > > > > > each
>>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>>> it
>>> > is
>>> > > > > > > > responsible
>>> > > > > > > > > > for.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>>> > (doesn't
>>> > > > > answer
>>> > > > > > > > > > queries)
>>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>>> the
>>> > > time
>>> > > > to
>>> > > > > > > > re-read
>>> > > > > > > > > > the
>>> > > > > > > > > > > EFF is multiplied by the number of cores in use
>>> (i.e. the
>>> > > EFF
>>> > > > > is
>>> > > > > > > > re-read
>>> > > > > > > > > > by
>>> > > > > > > > > > > each core sequentially). The contents of the EFF
>>> become
>>> > > > active
>>> > > > > > > after
>>> > > > > > > > the
>>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>>> here)
>>> > > after
>>> > > > > the
>>> > > > > > > > file
>>> > > > > > > > > > has
>>> > > > > > > > > > > been updated.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB
>>> - and
>>> > > we
>>> > > > > use
>>> > > > > > 16
>>> > > > > > > > > > shards,
>>> > > > > > > > > > > so when we triggered an external commit to force
>>> > > re-reading,
>>> > > > > the
>>> > > > > > > > whole
>>> > > > > > > > > > > system would block for several (10-15) minutes. This
>>> > won't
>>> > > > work
>>> > > > > > in
>>> > > > > > > a
>>> > > > > > > > > > > production environment. The reason for the size of
>>> the
>>> > EFF
>>> > > is
>>> > > > > > that
>>> > > > > > > we
>>> > > > > > > > > > have
>>> > > > > > > > > > > around 7M documents in the index; each document has
>>> a 45
>>> > > > > > character
>>> > > > > > > > ID.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We got some help to try to fix the problem so that
>>> the
>>> > > > re-read
>>> > > > > of
>>> > > > > > > the
>>> > > > > > > > EFF
>>> > > > > > > > > > > proceeds in the background (see
>>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985
>>> >
>>> > for
>>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>>> > re-read
>>> > > > > > proceeds
>>> > > > > > > > in
>>> > > > > > > > > > the
>>> > > > > > > > > > > background, the time required to launch solr now
>>> takes at
>>> > > > least
>>> > > > > > as
>>> > > > > > > > long
>>> > > > > > > > > > as
>>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>>> for
>>> > our
>>> > > > > > needs.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>>> > > (though
>>> > > > > you
>>> > > > > > > can
>>> > > > > > > > > > return
>>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>>> > also
>>> > > > > fixed
>>> > > > > > in
>>> > > > > > > > the
>>> > > > > > > > > > 4.1
>>> > > > > > > > > > > branch here <
>>> > > https://issues.apache.org/jira/browse/SOLR-4022
>>> > > > >.
>>> > > > > > > > > > >
>>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>>> that
>>> > > > great.
>>> > > > > > Our
>>> > > > > > > > > > solution
>>> > > > > > > > > > > is as follows: The actual value of the popularity
>>> measure
>>> > > > (say,
>>> > > > > > > > reads)
>>> > > > > > > > > > that
>>> > > > > > > > > > > we want to report to the user is inserted into the
>>> search
>>> > > > > > response
>>> > > > > > > > > > > post-query by our query front-end. This value will
>>> then
>>> > be
>>> > > > the
>>> > > > > > > > > > > authoritative value at the time of the query. The
>>> value
>>> > of
>>> > > > the
>>> > > > > > > > popularity
>>> > > > > > > > > > > measure that we use for boosting in the ranking of
>>> the
>>> > > search
>>> > > > > > > results
>>> > > > > > > > is
>>> > > > > > > > > > > only updated when the value has changed enough so
>>> that
>>> > the
>>> > > > > impact
>>> > > > > > > on
>>> > > > > > > > the
>>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>>> does
>>> > > > > require
>>> > > > > > > > frequent
>>> > > > > > > > > > > re-indexing of the documents that have significant
>>> > changes
>>> > > in
>>> > > > > the
>>> > > > > > > > number
>>> > > > > > > > > > of
>>> > > > > > > > > > > reads, but at least we won't have to update a
>>> document if
>>> > > it
>>> > > > > > moves
>>> > > > > > > > from,
>>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>>> > > > > > > > > > >
>>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>>> > > > > > > > > > >
>>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>> > > > > > simoneg@apache.org
>>> > > > > > > >
>>> > > > > > > > > > wrote:
>>> > > > > > > > > > >
>>> > > > > > > > > > > > Hi all,
>>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>>> > SolrCloud.
>>> > > > > > > However,
>>> > > > > > > > in
>>> > > > > > > > > > > this
>>> > > > > > > > > > > > index, an external file field is used for
>>> popularity
>>> > > > ranking.
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>>> does
>>> > it
>>> > > > > cope
>>> > > > > > > with
>>> > > > > > > > > > > > sharding and replication? Where should the external
>>> > file
>>> > > be
>>> > > > > > > placed
>>> > > > > > > > now
>>> > > > > > > > > > > that
>>> > > > > > > > > > > > the index folder is not local but in the cloud?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Are there otherwise other best practices to deal
>>> with
>>> > the
>>> > > > use
>>> > > > > > > cases
>>> > > > > > > > > > > > external file fields were used for, like
>>> > > > popularity/ranking,
>>> > > > > in
>>> > > > > > > > > > > SolrCloud?
>>> > > > > > > > > > > > Custom ValueSources going to something external?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Thanks in advance,
>>> > > > > > > > > > > > Simone
>>> > > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > --
>>> > > > > > > > > > Sincerely yours
>>> > > > > > > > > > Mikhail Khludnev
>>> > > > > > > > > > Principal Engineer,
>>> > > > > > > > > > Grid Dynamics
>>> > > > > > > > > >
>>> > > > > > > > > > <http://www.griddynamics.com>
>>> > > > > > > > > >  <mk...@griddynamics.com>
>>> > > > > > > > > >
>>> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <
>>> mak@issuu.com>
>>> > > > > написал:
>>> > > > > > > >
>>> > > > > > > > > Hi Mikhail
>>> > > > > > > > >
>>> > > > > > > > > Please see answers below.
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>> > > > > > > > > mkhludnev@griddynamics.com> wrote:
>>> > > > > > > > >
>>> > > > > > > > > > Martin,
>>> > > > > > > > > >
>>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>>> > > useful
>>> > > > > for
>>> > > > > > > > > > community.
>>> > > > > > > > > > The first question might seems not really conscious,
>>> but
>>> > > would
>>> > > > > you
>>> > > > > > > tell
>>> > > > > > > > > me
>>> > > > > > > > > > what blocks searching during EFF reload, when it's
>>> > triggered
>>> > > by
>>> > > > > > > handler
>>> > > > > > > > > or
>>> > > > > > > > > > by listener?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > We continuously index new documents using CommitWithin
>>> to get
>>> > > > > regular
>>> > > > > > > > > commits. However, we observed that the EFFs were not
>>> re-read,
>>> > > so
>>> > > > we
>>> > > > > > had
>>> > > > > > > > to
>>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>>> to
>>> > > force
>>> > > > > > > reload.
>>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>>> why
>>> > > it's
>>> > > > > > doing
>>> > > > > > > > > that (it was related to SOLR-3985).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I don't really get the sentence about sequential
>>> commits
>>> > and
>>> > > > > number
>>> > > > > > > of
>>> > > > > > > > > > cores. Do I get right that file is replicated via
>>> > Zookeeper?
>>> > > > > > Doesn't
>>> > > > > > > it
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>>> on a
>>> > > > > system
>>> > > > > > > > with a
>>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>>> > for a
>>> > > > > long
>>> > > > > > > time
>>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>>> EFF
>>> > > is a
>>> > > > > > > symlink
>>> > > > > > > > > from each cores index dir to the actual file, which is
>>> > updated
>>> > > by
>>> > > > > an
>>> > > > > > > > > external process.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > causes scalability problem or long time to reload?
>>> Will it
>>> > > help
>>> > > > > if
>>> > > > > > > > we'll
>>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>>> > values
>>> > > > from
>>> > > > > > > jdbc.
>>> > > > > > > > > ie.
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > I think the possibility of having some fields being
>>> retrieved
>>> > > > from
>>> > > > > an
>>> > > > > > > > > external, dynamically updatable store would be really
>>> > > > interesting.
>>> > > > > > This
>>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>>> > > product
>>> > > > > > (e.g.
>>> > > > > > > > > Cassandra).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > why all cores can't read these values simultaneously?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is a solr implementation detail that I can't
>>> > answer
>>> > > > :)
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>>> > index
>>> > > > term
>>> > > > > > > > order?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > AFAIK it can impact load time.
>>> > > > > > > > > >
>>> > > > > > > > > Yes, it does.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Regarding your post-query solution can you tell me if
>>> query
>>> > > > found
>>> > > > > > > 10000
>>> > > > > > > > > > docs, but I need to display only first page with 100
>>> rows,
>>> > > > > whether
>>> > > > > > I
>>> > > > > > > > need
>>> > > > > > > > > > to pull all 10K results to frontend to order them by
>>> the
>>> > > rank?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > In our architecture, the clients query an API that
>>> generates
>>> > > the
>>> > > > > SOLR
>>> > > > > > > > > query, retrieves the relevant additional fields that we
>>> > needs,
>>> > > > and
>>> > > > > > > > returns
>>> > > > > > > > > the relevant JSON to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > In our use case, results are returned from SOLR by the
>>> 10's,
>>> > > not
>>> > > > by
>>> > > > > > the
>>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>>> returned
>>> > > > > > thousands
>>> > > > > > > of
>>> > > > > > > > > results, it would be up to the implementation of the api
>>> to
>>> > > > augment
>>> > > > > > > only
>>> > > > > > > > > the results that needed to be returned to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>>> > should
>>> > > > be
>>> > > > > > > > > possible.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I'm really appreciate if you comment on the questions
>>> > above.
>>> > > > > > > > > > PS: It's time to pitch, how much
>>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>>> "Commit-free
>>> > > > > > > > > > ExternalFileField" can help you?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > It looks very interesting :) Does it make it possible
>>> to
>>> > > avoid
>>> > > > > > > > re-reading
>>> > > > > > > > > the EFF on every commit, and only re-read the values that
>>> > have
>>> > > > > > actually
>>> > > > > > > > > changed?
>>> > > > > > > > >
>>> > > > > > > > > /Martin
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>> > mak@issuu.com>
>>> > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>>> give
>>> > you
>>> > > > > what
>>> > > > > > > > you're
>>> > > > > > > > > > > hoping fore.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The EFF is placed in the parent of the index
>>> directory in
>>> > > > each
>>> > > > > > > core;
>>> > > > > > > > > each
>>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>>> it
>>> > is
>>> > > > > > > > responsible
>>> > > > > > > > > > for.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>>> > (doesn't
>>> > > > > answer
>>> > > > > > > > > > queries)
>>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>>> the
>>> > > time
>>> > > > to
>>> > > > > > > > re-read
>>> > > > > > > > > > the
>>> > > > > > > > > > > EFF is multiplied by the number of cores in use
>>> (i.e. the
>>> > > EFF
>>> > > > > is
>>> > > > > > > > > re-read
>>> > > > > > > > > > by
>>> > > > > > > > > > > each core sequentially). The contents of the EFF
>>> become
>>> > > > active
>>> > > > > > > after
>>> > > > > > > > > the
>>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>>> here)
>>> > > after
>>> > > > > the
>>> > > > > > > > file
>>> > > > > > > > > > has
>>> > > > > > > > > > > been updated.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB
>>> - and
>>> > > we
>>> > > > > use
>>> > > > > > 16
>>> > > > > > > > > > shards,
>>> > > > > > > > > > > so when we triggered an external commit to force
>>> > > re-reading,
>>> > > > > the
>>> > > > > > > > whole
>>> > > > > > > > > > > system would block for several (10-15) minutes. This
>>> > won't
>>> > > > work
>>> > > > > > in
>>> > > > > > > a
>>> > > > > > > > > > > production environment. The reason for the size of
>>> the
>>> > EFF
>>> > > is
>>> > > > > > that
>>> > > > > > > we
>>> > > > > > > > > > have
>>> > > > > > > > > > > around 7M documents in the index; each document has
>>> a 45
>>> > > > > > character
>>> > > > > > > > ID.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We got some help to try to fix the problem so that
>>> the
>>> > > > re-read
>>> > > > > of
>>> > > > > > > the
>>> > > > > > > > > EFF
>>> > > > > > > > > > > proceeds in the background (see
>>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985
>>> >
>>> > for
>>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>>> > re-read
>>> > > > > > proceeds
>>> > > > > > > > in
>>> > > > > > > > > > the
>>> > > > > > > > > > > background, the time required to launch solr now
>>> takes at
>>> > > > least
>>> > > > > > as
>>> > > > > > > > long
>>> > > > > > > > > > as
>>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>>> for
>>> > our
>>> > > > > > needs.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>>> > > (though
>>> > > > > you
>>> > > > > > > can
>>> > > > > > > > > > return
>>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>>> > also
>>> > > > > fixed
>>> > > > > > in
>>> > > > > > > > the
>>> > > > > > > > > > 4.1
>>> > > > > > > > > > > branch here <
>>> > > https://issues.apache.org/jira/browse/SOLR-4022
>>> > > > >.
>>> > > > > > > > > > >
>>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>>> that
>>> > > > great.
>>> > > > > > Our
>>> > > > > > > > > > solution
>>> > > > > > > > > > > is as follows: The actual value of the popularity
>>> measure
>>> > > > (say,
>>> > > > > > > > reads)
>>> > > > > > > > > > that
>>> > > > > > > > > > > we want to report to the user is inserted into the
>>> search
>>> > > > > > response
>>> > > > > > > > > > > post-query by our query front-end. This value will
>>> then
>>> > be
>>> > > > the
>>> > > > > > > > > > > authoritative value at the time of the query. The
>>> value
>>> > of
>>> > > > the
>>> > > > > > > > > popularity
>>> > > > > > > > > > > measure that we use for boosting in the ranking of
>>> the
>>> > > search
>>> > > > > > > results
>>> > > > > > > > > is
>>> > > > > > > > > > > only updated when the value has changed enough so
>>> that
>>> > the
>>> > > > > impact
>>> > > > > > > on
>>> > > > > > > > > the
>>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>>> does
>>> > > > > require
>>> > > > > > > > > frequent
>>> > > > > > > > > > > re-indexing of the documents that have significant
>>> > changes
>>> > > in
>>> > > > > the
>>> > > > > > > > > number
>>> > > > > > > > > > of
>>> > > > > > > > > > > reads, but at least we won't have to update a
>>> document if
>>> > > it
>>> > > > > > moves
>>> > > > > > > > > from,
>>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>>> > > > > > > > > > >
>>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>>> > > > > > > > > > >
>>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>> > > > > > simoneg@apache.org
>>> > > > > > > >
>>> > > > > > > > > > wrote:
>>> > > > > > > > > > >
>>> > > > > > > > > > > > Hi all,
>>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>>> > SolrCloud.
>>> > > > > > > However,
>>> > > > > > > > in
>>> > > > > > > > > > > this
>>> > > > > > > > > > > > index, an external file field is used for
>>> popularity
>>> > > > ranking.
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>>> does
>>> > it
>>> > > > > cope
>>> > > > > > > with
>>> > > > > > > > > > > > sharding and replication? Where should the external
>>> > file
>>> > > be
>>> > > > > > > placed
>>> > > > > > > > > now
>>> > > > > > > > > > > that
>>> > > > > > > > > > > > the index folder is not local but in the cloud?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Are there otherwise other best practices to deal
>>> with
>>> > the
>>> > > > use
>>> > > > > > > cases
>>> > > > > > > > > > > > external file fields were used for, like
>>> > > > popularity/ranking,
>>> > > > > in
>>> > > > > > > > > > > SolrCloud?
>>> > > > > > > > > > > > Custom ValueSources going to something external?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Thanks in advance,
>>> > > > > > > > > > > > Simone
>>> > > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > --
>>> > > > > > > > > > Sincerely yours
>>> > > > > > > > > > Mikhail Khludnev
>>> > > > > > > > > > Principal Engineer,
>>> > > > > > > > > > Grid Dynamics
>>> > > > > > > > > >
>>> > > > > > > > > > <http://www.griddynamics.com>
>>> > > > > > > > > >  <mk...@griddynamics.com>
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > --
>>> > > > > > Sincerely yours
>>> > > > > > Mikhail Khludnev
>>> > > > > > Principal Engineer,
>>> > > > > > Grid Dynamics
>>> > > > > >
>>> > > > > > <http://www.griddynamics.com>
>>> > > > > >  <mk...@griddynamics.com>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Sincerely yours
>>> > > > Mikhail Khludnev
>>> > > > Principal Engineer,
>>> > > > Grid Dynamics
>>> > > >
>>> > > > <http://www.griddynamics.com>
>>> > > >  <mk...@griddynamics.com>
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Sincerely yours
>>> > Mikhail Khludnev
>>> > Principal Engineer,
>>> > Grid Dynamics
>>> >
>>> > <http://www.griddynamics.com>
>>> >  <mk...@griddynamics.com>
>>> >
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Mikhail

I haven't experimented further yet. I think that the previous experiment of
issuing a commit to a specific core proved that all cores get the commit,
so I don't think that this approach will work.

Thanks,
/Martin


On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
>
> It's still not clear to me whether you solve the problem completely or
> partially:
> Does reducing number of cores free some resources for searching during
> commit?
> Does the commiting one-by-one core prevents the "freeze"?
>
> Thanks
>
>
> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <ma...@issuu.com> wrote:
>
>> Mikhail
>>
>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>> (bug
>> 3985). But this wasn't good enough, because SOLR would still take very
>> long
>> to restart when that was necessary.
>>
>> I don't see how we could throw more hardware at the problem without making
>> it worse, really - the only solution here would be *fewer* shards, not
>>
>> more.
>>
>> IMO it would be ideal if the lucene/solr community could come up with a
>> good way of updating fields in a document without reindexing. This could
>> be
>> by linking to some external data store, or in the lucene/solr internals.
>> If
>> it would make things easier, a good first step would be to have
>> dynamically
>> updateable numerical fields only.
>>
>> /Martin
>>
>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>>
>> > Martin,
>> >
>> > I don't think solrconfig.xml shed any light on. I've just found what I
>> > didn't get in your setup - the way of how to explicitly assigning core
>> to
>> > collection. Now, I realized most of details after all!
>> > Ball is on your side, let us know whether you have managed your cores to
>> > commit one by one to avoid freeze, or could you eliminate pauses by
>> > allocating more hardware?
>> > Thanks in advance!
>> >
>> >
>> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
>> >
>> > > Mikhail,
>> > >
>> > > PSB
>> > >
>> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>> > > mkhludnev@griddynamics.com> wrote:
>> > >
>> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com>
>> wrote:
>> > > >
>> > > > >
>> > > > > I wasn't aware until now that it is possible to send a commit to
>> one
>> > > core
>> > > > > only. What we observed was the effect of curl
>> > > > > localhost:8080/solr/update?commit=true but perhaps we should
>> > experiment
>> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
>> > indicate
>> > > > > that a commit to a single core causes commits on all cores.
>> > > > >
>> > > > You should see something like this in the log:
>> > > > ... SolrCmdDistributor .... Distrib commit to: ...
>> > > >
>> > > > Yup, a commit towards a single core results in a commit on all
>> cores.
>> > >
>> > >
>> > > > >
>> > > > >
>> > > > > Perhaps I should clarify that we are using SOLR as a black box;
>> we do
>> > > not
>> > > > > touch the code at all - we only install the distribution WAR file
>> and
>> > > > > proceed from there.
>> > > > >
>> > > > I still don't understand how you deploy/launch Solr. How many jettys
>> > you
>> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>> specifies
>> > > > shards= param for every request and distributes updates yourself?
>> What
>> > > > collections do you create and with which settings?
>> > > >
>> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
>> > > holding one shard each. We launch only one instance of jetty with the
>> > > folllowing arguments:
>> > >
>> > > -DnumShards=16
>> > > -DzkHost=<zookeeperhost:port>
>> > > -Xmx10G
>> > > -Xms10G
>> > > -Xmn2G
>> > > -server
>> > >
>> > > Would you like to see the solrconfig.xml?
>> > >
>> > > /Martin
>> > >
>> > >
>> > > > >
>> > > > >
>> > > > > > Also from my POV such deployments should start at least from
>> *16*
>> > > 4-way
>> > > > > > vboxes, it's more expensive, but should be much better available
>> > > during
>> > > > > > cpu-consuming operations.
>> > > > > >
>> > > > >
>> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>> hosts
>> > > > with
>> > > > > 16 cores? Or am I misunderstanding something :) ?
>> > > > >
>> > > > I prefer to start from 16 hosts with 4 cores each.
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > > Other details, if you use single jetty for all of them, are you
>> > sure
>> > > > that
>> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
>> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
>> > cores
>> > > > > index
>> > > > > > directories is less than 45G?
>> > > > > >
>> > > > > > The total index size is 230 GB, so it won't fit in ram, but
>> we're
>> > > using
>> > > > > an
>> > > > > SSD disk to minimize disk access time. We have tried putting the
>> EFF
>> > > > onto a
>> > > > > ram disk, but this didn't have a measurable effect.
>> > > > >
>> > > > > Thanks,
>> > > > > /Martin
>> > > > >
>> > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
>> > wrote:
>> > > > > >
>> > > > > > > Mikhail
>> > > > > > >
>> > > > > > > PSB
>> > > > > > >
>> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>> > > > > > > mkhludnev@griddynamics.com> wrote:
>> > > > > > >
>> > > > > > > > Martin,
>> > > > > > > >
>> > > > > > > > Please find additional question from me below.
>> > > > > > > >
>> > > > > > > > Simone,
>> > > > > > > >
>> > > > > > > > I'm sorry for hijacking your thread. The only what I've
>> heard
>> > > about
>> > > > > it
>> > > > > > at
>> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
>> > > > replicate
>> > > > > > > those
>> > > > > > > > files as configs under solr home. And I'm really looking
>> > forward
>> > > to
>> > > > > > know
>> > > > > > > > how it works with huge files in production.
>> > > > > > > >
>> > > > > > > > Thank You, Guys!
>> > > > > > > >
>> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
>> > > > написал:
>> > > > > > > > >
>> > > > > > > > > Hi Mikhail
>> > > > > > > > >
>> > > > > > > > > Please see answers below.
>> > > > > > > > >
>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>> > > > > > > > > mkhludnev@griddynamics.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Martin,
>> > > > > > > > > >
>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>> > > useful
>> > > > > for
>> > > > > > > > > > community.
>> > > > > > > > > > The first question might seems not really conscious, but
>> > > would
>> > > > > you
>> > > > > > > tell
>> > > > > > > > me
>> > > > > > > > > > what blocks searching during EFF reload, when it's
>> > triggered
>> > > by
>> > > > > > > handler
>> > > > > > > > or
>> > > > > > > > > > by listener?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > We continuously index new documents using CommitWithin to
>> get
>> > > > > regular
>> > > > > > > > > commits. However, we observed that the EFFs were not
>> re-read,
>> > > so
>> > > > we
>> > > > > > had
>> > > > > > > > to
>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>> to
>> > > force
>> > > > > > > reload.
>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>> why
>> > > it's
>> > > > > > doing
>> > > > > > > > > that (it was related to SOLR-3985).
>> > > > > > > >
>> > > > > > > > Is there a chance to get a thread dump when they are
>> blocked?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > Well I could try to recreate the situation. But the setup is
>> > fairly
>> > > > > > simple:
>> > > > > > > Create a large EFF in a largeish index with many shards.
>> Issue a
>> > > > > commit,
>> > > > > > > and then try to do a search. Solr will not respond to the
>> search
>> > > > before
>> > > > > > the
>> > > > > > > commit has completed, and this will take a long time.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I don't really get the sentence about sequential commits
>> > and
>> > > > > number
>> > > > > > > of
>> > > > > > > > > > cores. Do I get right that file is replicated via
>> > Zookeeper?
>> > > > > > Doesn't
>> > > > > > > it
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>> on a
>> > > > > system
>> > > > > > > with
>> > > > > > > > a
>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>> > for a
>> > > > > long
>> > > > > > > time
>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>> EFF
>> > > is a
>> > > > > > > symlink
>> > > > > > > > > from each cores index dir to the actual file, which is
>> > updated
>> > > by
>> > > > > an
>> > > > > > > > > external process.
>> > > > > > > >
>> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
>> > > > SolrCloud.
>> > > > > > > >
>> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
>> > > replicas
>> > > > of
>> > > > > > the
>> > > > > > > > same index?
>> > > > > > > >
>> > > > > > >
>> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
>> SolrCloud.
>> > > I'm
>> > > > a
>> > > > > > bit
>> > > > > > > unsure about the terminology here, but we've got a single
>> index
>> > > > divided
>> > > > > > > into 16 shard. Each shard is hosted in a solr core.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Also, about simlink - Don't you share that file via some
>> NFS?
>> > > > > > > >
>> > > > > > > > No, we generate the EFF on the local solr host (there is
>> only
>> > one
>> > > > > > > physical
>> > > > > > > host that holds all shards), so there is no need for NFS or
>> > copying
>> > > > > files
>> > > > > > > around. No need for Zookeeper either.
>> > > > > > >
>> > > > > > >
>> > > > > > > > how many cores you run per box?
>> > > > > > > >
>> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
>> 60GB
>> > of
>> > > > > RAM.
>> > > > > > We
>> > > > > > > run 16 solr cores on this box in Jetty.
>> > > > > > >
>> > > > > > >
>> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
>> > > heaps?
>> > > > > > > >
>> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
>> the
>> > > OS.
>> > > > > > >
>> > > > > > >
>> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
>> > confirm
>> > > > > that.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
>> > where
>> > > > > that
>> > > > > > > would be configured in solr - can you explain that?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > causes scalability problem or long time to reload? Will
>> it
>> > > help
>> > > > > if
>> > > > > > > > we'll
>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>> > values
>> > > > from
>> > > > > > > jdbc.
>> > > > > > > > ie.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I think the possibility of having some fields being
>> retrieved
>> > > > from
>> > > > > an
>> > > > > > > > > external, dynamically updatable store would be really
>> > > > interesting.
>> > > > > > This
>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>> > > product
>> > > > > > (e.g.
>> > > > > > > > > Cassandra).
>> > > > > > > >
>> > > > > > > > Ok. Let's have it in mind as a possible direction.
>> > > > > > > >
>> > > > > > >
>> > > > > > > Alternatively, an API that would allow updating a single field
>> > for
>> > > a
>> > > > > > > document might be an option.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > why all cores can't read these values simultaneously?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is a solr implementation detail that I can't
>> > answer
>> > > > :)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>> > index
>> > > > term
>> > > > > > > > order?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > AFAIK it can impact load time.
>> > > > > > > > > >
>> > > > > > > > > Yes, it does
>> > > > > > > >
>> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
>> > strings,
>> > > > not
>> > > > > > > > integers.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > Yes, ids are strings.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Regarding your post-query solution can you tell me if
>> query
>> > > > found
>> > > > > > > 10000
>> > > > > > > > > > docs, but I need to display only first page with 100
>> rows,
>> > > > > whether
>> > > > > > I
>> > > > > > > > need
>> > > > > > > > > > to pull all 10K results to frontend to order them by the
>> > > rank?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > In our architecture, the clients query an API that
>> generates
>> > > the
>> > > > > SOLR
>> > > > > > > > > query, retrieves the relevant additional fields that we
>> > needs,
>> > > > and
>> > > > > > > > returns
>> > > > > > > > > the relevant JSON to the front-end.
>> > > > > > > > >
>> > > > > > > > > In our use case, results are returned from SOLR by the
>> 10's,
>> > > not
>> > > > by
>> > > > > > the
>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>> returned
>> > > > > > thousands
>> > > > > > > of
>> > > > > > > > > results, it would be up to the implementation of the api
>> to
>> > > > augment
>> > > > > > > only
>> > > > > > > > > the results that needed to be returned to the front-end.
>> > > > > > > > >
>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>> > should
>> > > > be
>> > > > > > > > > possible.
>> > > > > > > >
>> > > > > > > > You are right. I'm concerned anyway because retrieving whole
>> > > result
>> > > > > is
>> > > > > > > > expensive, and not always possible.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > In our case, getting the whole result is almost impossible,
>> > because
>> > > > > that
>> > > > > > > would be millions of documents, and returning the Nth result
>> > seems
>> > > to
>> > > > > be
>> > > > > > a
>> > > > > > > quadratic (or worse) operation in SOLR.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I'm really appreciate if you comment on the questions
>> > above.
>> > > > > > > > > > PS: It's time to pitch, how much
>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>> "Commit-free
>> > > > > > > > > > ExternalFileField" can help you?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > It looks very interesting :) Does it make it possible to
>> > > avoid
>> > > > > > > > re-reading
>> > > > > > > > > the EFF on every commit, and only re-read the values that
>> > have
>> > > > > > actually
>> > > > > > > > > changed?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > You don't need commit (in SOLR-4085) to reload file content,
>> > but
>> > > > > after
>> > > > > > > > commit you need to read whole file and scan all key terms
>> and
>> > > > > postings.
>> > > > > > > > That's because EFF sits on top of top level searcher. it's a
>> > > > > Solr-like
>> > > > > > > way.
>> > > > > > > > In some future we might have per-segment EFF, in this case
>> > > adding a
>> > > > > > > segment
>> > > > > > > > will trigger full file scan, but in the index only that new
>> > > segment
>> > > > > > will
>> > > > > > > be
>> > > > > > > > scanned. It should be faster. You know, straightforward
>> sharing
>> > > > > > internal
>> > > > > > > > data structures between different index views/generations is
>> > not
>> > > > > > > possible.
>> > > > > > > > If you are asking about applying delta changes on external
>> file
>> > > > > that's
>> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
>> > > feature
>> > > > > is
>> > > > > > > much
>> > > > > > > > more doubtful and vague, although it might be the next
>> > > contribution
>> > > > > > after
>> > > > > > > > SOLR-4085.
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > /Martin
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>> > mak@issuu.com>
>> > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>> give
>> > you
>> > > > > what
>> > > > > > > > you're
>> > > > > > > > > > > hoping fore.
>> > > > > > > > > > >
>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>> > > > > > > > > > >
>> > > > > > > > > > > The EFF is placed in the parent of the index
>> directory in
>> > > > each
>> > > > > > > core;
>> > > > > > > > each
>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>> it
>> > is
>> > > > > > > > responsible
>> > > > > > > > > > for.
>> > > > > > > > > > >
>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>> > (doesn't
>> > > > > answer
>> > > > > > > > > > queries)
>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>> the
>> > > time
>> > > > to
>> > > > > > > > re-read
>> > > > > > > > > > the
>> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
>> the
>> > > EFF
>> > > > > is
>> > > > > > > > re-read
>> > > > > > > > > > by
>> > > > > > > > > > > each core sequentially). The contents of the EFF
>> become
>> > > > active
>> > > > > > > after
>> > > > > > > > the
>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>> here)
>> > > after
>> > > > > the
>> > > > > > > > file
>> > > > > > > > > > has
>> > > > > > > > > > > been updated.
>> > > > > > > > > > >
>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
>> and
>> > > we
>> > > > > use
>> > > > > > 16
>> > > > > > > > > > shards,
>> > > > > > > > > > > so when we triggered an external commit to force
>> > > re-reading,
>> > > > > the
>> > > > > > > > whole
>> > > > > > > > > > > system would block for several (10-15) minutes. This
>> > won't
>> > > > work
>> > > > > > in
>> > > > > > > a
>> > > > > > > > > > > production environment. The reason for the size of the
>> > EFF
>> > > is
>> > > > > > that
>> > > > > > > we
>> > > > > > > > > > have
>> > > > > > > > > > > around 7M documents in the index; each document has a
>> 45
>> > > > > > character
>> > > > > > > > ID.
>> > > > > > > > > > >
>> > > > > > > > > > > We got some help to try to fix the problem so that the
>> > > > re-read
>> > > > > of
>> > > > > > > the
>> > > > > > > > EFF
>> > > > > > > > > > > proceeds in the background (see
>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
>> > for
>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>> > re-read
>> > > > > > proceeds
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > background, the time required to launch solr now
>> takes at
>> > > > least
>> > > > > > as
>> > > > > > > > long
>> > > > > > > > > > as
>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>> for
>> > our
>> > > > > > needs.
>> > > > > > > > > > >
>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>> > > (though
>> > > > > you
>> > > > > > > can
>> > > > > > > > > > return
>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>> > also
>> > > > > fixed
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > > 4.1
>> > > > > > > > > > > branch here <
>> > > https://issues.apache.org/jira/browse/SOLR-4022
>> > > > >.
>> > > > > > > > > > >
>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>> that
>> > > > great.
>> > > > > > Our
>> > > > > > > > > > solution
>> > > > > > > > > > > is as follows: The actual value of the popularity
>> measure
>> > > > (say,
>> > > > > > > > reads)
>> > > > > > > > > > that
>> > > > > > > > > > > we want to report to the user is inserted into the
>> search
>> > > > > > response
>> > > > > > > > > > > post-query by our query front-end. This value will
>> then
>> > be
>> > > > the
>> > > > > > > > > > > authoritative value at the time of the query. The
>> value
>> > of
>> > > > the
>> > > > > > > > popularity
>> > > > > > > > > > > measure that we use for boosting in the ranking of the
>> > > search
>> > > > > > > results
>> > > > > > > > is
>> > > > > > > > > > > only updated when the value has changed enough so that
>> > the
>> > > > > impact
>> > > > > > > on
>> > > > > > > > the
>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>> does
>> > > > > require
>> > > > > > > > frequent
>> > > > > > > > > > > re-indexing of the documents that have significant
>> > changes
>> > > in
>> > > > > the
>> > > > > > > > number
>> > > > > > > > > > of
>> > > > > > > > > > > reads, but at least we won't have to update a
>> document if
>> > > it
>> > > > > > moves
>> > > > > > > > from,
>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>> > > > > > > > > > >
>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>> > > > > > simoneg@apache.org
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>> > SolrCloud.
>> > > > > > > However,
>> > > > > > > > in
>> > > > > > > > > > > this
>> > > > > > > > > > > > index, an external file field is used for popularity
>> > > > ranking.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>> does
>> > it
>> > > > > cope
>> > > > > > > with
>> > > > > > > > > > > > sharding and replication? Where should the external
>> > file
>> > > be
>> > > > > > > placed
>> > > > > > > > now
>> > > > > > > > > > > that
>> > > > > > > > > > > > the index folder is not local but in the cloud?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Are there otherwise other best practices to deal
>> with
>> > the
>> > > > use
>> > > > > > > cases
>> > > > > > > > > > > > external file fields were used for, like
>> > > > popularity/ranking,
>> > > > > in
>> > > > > > > > > > > SolrCloud?
>> > > > > > > > > > > > Custom ValueSources going to something external?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks in advance,
>> > > > > > > > > > > > Simone
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Sincerely yours
>> > > > > > > > > > Mikhail Khludnev
>> > > > > > > > > > Principal Engineer,
>> > > > > > > > > > Grid Dynamics
>> > > > > > > > > >
>> > > > > > > > > > <http://www.griddynamics.com>
>> > > > > > > > > >  <mk...@griddynamics.com>
>> > > > > > > > > >
>> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com
>> >
>> > > > > написал:
>> > > > > > > >
>> > > > > > > > > Hi Mikhail
>> > > > > > > > >
>> > > > > > > > > Please see answers below.
>> > > > > > > > >
>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>> > > > > > > > > mkhludnev@griddynamics.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Martin,
>> > > > > > > > > >
>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>> > > useful
>> > > > > for
>> > > > > > > > > > community.
>> > > > > > > > > > The first question might seems not really conscious, but
>> > > would
>> > > > > you
>> > > > > > > tell
>> > > > > > > > > me
>> > > > > > > > > > what blocks searching during EFF reload, when it's
>> > triggered
>> > > by
>> > > > > > > handler
>> > > > > > > > > or
>> > > > > > > > > > by listener?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > We continuously index new documents using CommitWithin to
>> get
>> > > > > regular
>> > > > > > > > > commits. However, we observed that the EFFs were not
>> re-read,
>> > > so
>> > > > we
>> > > > > > had
>> > > > > > > > to
>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>> to
>> > > force
>> > > > > > > reload.
>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>> why
>> > > it's
>> > > > > > doing
>> > > > > > > > > that (it was related to SOLR-3985).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I don't really get the sentence about sequential commits
>> > and
>> > > > > number
>> > > > > > > of
>> > > > > > > > > > cores. Do I get right that file is replicated via
>> > Zookeeper?
>> > > > > > Doesn't
>> > > > > > > it
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>> on a
>> > > > > system
>> > > > > > > > with a
>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>> > for a
>> > > > > long
>> > > > > > > time
>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>> EFF
>> > > is a
>> > > > > > > symlink
>> > > > > > > > > from each cores index dir to the actual file, which is
>> > updated
>> > > by
>> > > > > an
>> > > > > > > > > external process.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > causes scalability problem or long time to reload? Will
>> it
>> > > help
>> > > > > if
>> > > > > > > > we'll
>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>> > values
>> > > > from
>> > > > > > > jdbc.
>> > > > > > > > > ie.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I think the possibility of having some fields being
>> retrieved
>> > > > from
>> > > > > an
>> > > > > > > > > external, dynamically updatable store would be really
>> > > > interesting.
>> > > > > > This
>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>> > > product
>> > > > > > (e.g.
>> > > > > > > > > Cassandra).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > why all cores can't read these values simultaneously?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Again, this is a solr implementation detail that I can't
>> > answer
>> > > > :)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>> > index
>> > > > term
>> > > > > > > > order?
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > AFAIK it can impact load time.
>> > > > > > > > > >
>> > > > > > > > > Yes, it does.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Regarding your post-query solution can you tell me if
>> query
>> > > > found
>> > > > > > > 10000
>> > > > > > > > > > docs, but I need to display only first page with 100
>> rows,
>> > > > > whether
>> > > > > > I
>> > > > > > > > need
>> > > > > > > > > > to pull all 10K results to frontend to order them by the
>> > > rank?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > In our architecture, the clients query an API that
>> generates
>> > > the
>> > > > > SOLR
>> > > > > > > > > query, retrieves the relevant additional fields that we
>> > needs,
>> > > > and
>> > > > > > > > returns
>> > > > > > > > > the relevant JSON to the front-end.
>> > > > > > > > >
>> > > > > > > > > In our use case, results are returned from SOLR by the
>> 10's,
>> > > not
>> > > > by
>> > > > > > the
>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>> returned
>> > > > > > thousands
>> > > > > > > of
>> > > > > > > > > results, it would be up to the implementation of the api
>> to
>> > > > augment
>> > > > > > > only
>> > > > > > > > > the results that needed to be returned to the front-end.
>> > > > > > > > >
>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>> > should
>> > > > be
>> > > > > > > > > possible.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > I'm really appreciate if you comment on the questions
>> > above.
>> > > > > > > > > > PS: It's time to pitch, how much
>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>> "Commit-free
>> > > > > > > > > > ExternalFileField" can help you?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > It looks very interesting :) Does it make it possible to
>> > > avoid
>> > > > > > > > re-reading
>> > > > > > > > > the EFF on every commit, and only re-read the values that
>> > have
>> > > > > > actually
>> > > > > > > > > changed?
>> > > > > > > > >
>> > > > > > > > > /Martin
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>> > mak@issuu.com>
>> > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>> give
>> > you
>> > > > > what
>> > > > > > > > you're
>> > > > > > > > > > > hoping fore.
>> > > > > > > > > > >
>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>> > > > > > > > > > >
>> > > > > > > > > > > The EFF is placed in the parent of the index
>> directory in
>> > > > each
>> > > > > > > core;
>> > > > > > > > > each
>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>> it
>> > is
>> > > > > > > > responsible
>> > > > > > > > > > for.
>> > > > > > > > > > >
>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>> > (doesn't
>> > > > > answer
>> > > > > > > > > > queries)
>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>> the
>> > > time
>> > > > to
>> > > > > > > > re-read
>> > > > > > > > > > the
>> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
>> the
>> > > EFF
>> > > > > is
>> > > > > > > > > re-read
>> > > > > > > > > > by
>> > > > > > > > > > > each core sequentially). The contents of the EFF
>> become
>> > > > active
>> > > > > > > after
>> > > > > > > > > the
>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>> here)
>> > > after
>> > > > > the
>> > > > > > > > file
>> > > > > > > > > > has
>> > > > > > > > > > > been updated.
>> > > > > > > > > > >
>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
>> and
>> > > we
>> > > > > use
>> > > > > > 16
>> > > > > > > > > > shards,
>> > > > > > > > > > > so when we triggered an external commit to force
>> > > re-reading,
>> > > > > the
>> > > > > > > > whole
>> > > > > > > > > > > system would block for several (10-15) minutes. This
>> > won't
>> > > > work
>> > > > > > in
>> > > > > > > a
>> > > > > > > > > > > production environment. The reason for the size of the
>> > EFF
>> > > is
>> > > > > > that
>> > > > > > > we
>> > > > > > > > > > have
>> > > > > > > > > > > around 7M documents in the index; each document has a
>> 45
>> > > > > > character
>> > > > > > > > ID.
>> > > > > > > > > > >
>> > > > > > > > > > > We got some help to try to fix the problem so that the
>> > > > re-read
>> > > > > of
>> > > > > > > the
>> > > > > > > > > EFF
>> > > > > > > > > > > proceeds in the background (see
>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
>> > for
>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>> > re-read
>> > > > > > proceeds
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > background, the time required to launch solr now
>> takes at
>> > > > least
>> > > > > > as
>> > > > > > > > long
>> > > > > > > > > > as
>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>> for
>> > our
>> > > > > > needs.
>> > > > > > > > > > >
>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>> > > (though
>> > > > > you
>> > > > > > > can
>> > > > > > > > > > return
>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>> > also
>> > > > > fixed
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > > 4.1
>> > > > > > > > > > > branch here <
>> > > https://issues.apache.org/jira/browse/SOLR-4022
>> > > > >.
>> > > > > > > > > > >
>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>> that
>> > > > great.
>> > > > > > Our
>> > > > > > > > > > solution
>> > > > > > > > > > > is as follows: The actual value of the popularity
>> measure
>> > > > (say,
>> > > > > > > > reads)
>> > > > > > > > > > that
>> > > > > > > > > > > we want to report to the user is inserted into the
>> search
>> > > > > > response
>> > > > > > > > > > > post-query by our query front-end. This value will
>> then
>> > be
>> > > > the
>> > > > > > > > > > > authoritative value at the time of the query. The
>> value
>> > of
>> > > > the
>> > > > > > > > > popularity
>> > > > > > > > > > > measure that we use for boosting in the ranking of the
>> > > search
>> > > > > > > results
>> > > > > > > > > is
>> > > > > > > > > > > only updated when the value has changed enough so that
>> > the
>> > > > > impact
>> > > > > > > on
>> > > > > > > > > the
>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>> does
>> > > > > require
>> > > > > > > > > frequent
>> > > > > > > > > > > re-indexing of the documents that have significant
>> > changes
>> > > in
>> > > > > the
>> > > > > > > > > number
>> > > > > > > > > > of
>> > > > > > > > > > > reads, but at least we won't have to update a
>> document if
>> > > it
>> > > > > > moves
>> > > > > > > > > from,
>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>> > > > > > > > > > >
>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>> > > > > > simoneg@apache.org
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>> > SolrCloud.
>> > > > > > > However,
>> > > > > > > > in
>> > > > > > > > > > > this
>> > > > > > > > > > > > index, an external file field is used for popularity
>> > > > ranking.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>> does
>> > it
>> > > > > cope
>> > > > > > > with
>> > > > > > > > > > > > sharding and replication? Where should the external
>> > file
>> > > be
>> > > > > > > placed
>> > > > > > > > > now
>> > > > > > > > > > > that
>> > > > > > > > > > > > the index folder is not local but in the cloud?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Are there otherwise other best practices to deal
>> with
>> > the
>> > > > use
>> > > > > > > cases
>> > > > > > > > > > > > external file fields were used for, like
>> > > > popularity/ranking,
>> > > > > in
>> > > > > > > > > > > SolrCloud?
>> > > > > > > > > > > > Custom ValueSources going to something external?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks in advance,
>> > > > > > > > > > > > Simone
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Sincerely yours
>> > > > > > > > > > Mikhail Khludnev
>> > > > > > > > > > Principal Engineer,
>> > > > > > > > > > Grid Dynamics
>> > > > > > > > > >
>> > > > > > > > > > <http://www.griddynamics.com>
>> > > > > > > > > >  <mk...@griddynamics.com>
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Sincerely yours
>> > > > > > Mikhail Khludnev
>> > > > > > Principal Engineer,
>> > > > > > Grid Dynamics
>> > > > > >
>> > > > > > <http://www.griddynamics.com>
>> > > > > >  <mk...@griddynamics.com>
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Sincerely yours
>> > > > Mikhail Khludnev
>> > > > Principal Engineer,
>> > > > Grid Dynamics
>> > > >
>> > > > <http://www.griddynamics.com>
>> > > >  <mk...@griddynamics.com>
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > <http://www.griddynamics.com>
>> >  <mk...@griddynamics.com>
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,

It's still not clear to me whether you solve the problem completely or
partially:
Does reducing number of cores free some resources for searching during
commit?
Does the commiting one-by-one core prevents the "freeze"?

Thanks


On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <ma...@issuu.com> wrote:

> Mikhail
>
> To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
> 3985). But this wasn't good enough, because SOLR would still take very long
> to restart when that was necessary.
>
> I don't see how we could throw more hardware at the problem without making
> it worse, really - the only solution here would be *fewer* shards, not
> more.
>
> IMO it would be ideal if the lucene/solr community could come up with a
> good way of updating fields in a document without reindexing. This could be
> by linking to some external data store, or in the lucene/solr internals. If
> it would make things easier, a good first step would be to have dynamically
> updateable numerical fields only.
>
> /Martin
>
> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > Martin,
> >
> > I don't think solrconfig.xml shed any light on. I've just found what I
> > didn't get in your setup - the way of how to explicitly assigning core to
> > collection. Now, I realized most of details after all!
> > Ball is on your side, let us know whether you have managed your cores to
> > commit one by one to avoid freeze, or could you eliminate pauses by
> > allocating more hardware?
> > Thanks in advance!
> >
> >
> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
> >
> > > Mikhail,
> > >
> > > PSB
> > >
> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com> wrote:
> > > >
> > > > >
> > > > > I wasn't aware until now that it is possible to send a commit to
> one
> > > core
> > > > > only. What we observed was the effect of curl
> > > > > localhost:8080/solr/update?commit=true but perhaps we should
> > experiment
> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
> > indicate
> > > > > that a commit to a single core causes commits on all cores.
> > > > >
> > > > You should see something like this in the log:
> > > > ... SolrCmdDistributor .... Distrib commit to: ...
> > > >
> > > > Yup, a commit towards a single core results in a commit on all cores.
> > >
> > >
> > > > >
> > > > >
> > > > > Perhaps I should clarify that we are using SOLR as a black box; we
> do
> > > not
> > > > > touch the code at all - we only install the distribution WAR file
> and
> > > > > proceed from there.
> > > > >
> > > > I still don't understand how you deploy/launch Solr. How many jettys
> > you
> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
> specifies
> > > > shards= param for every request and distributes updates yourself?
> What
> > > > collections do you create and with which settings?
> > > >
> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
> > > holding one shard each. We launch only one instance of jetty with the
> > > folllowing arguments:
> > >
> > > -DnumShards=16
> > > -DzkHost=<zookeeperhost:port>
> > > -Xmx10G
> > > -Xms10G
> > > -Xmn2G
> > > -server
> > >
> > > Would you like to see the solrconfig.xml?
> > >
> > > /Martin
> > >
> > >
> > > > >
> > > > >
> > > > > > Also from my POV such deployments should start at least from *16*
> > > 4-way
> > > > > > vboxes, it's more expensive, but should be much better available
> > > during
> > > > > > cpu-consuming operations.
> > > > > >
> > > > >
> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
> hosts
> > > > with
> > > > > 16 cores? Or am I misunderstanding something :) ?
> > > > >
> > > > I prefer to start from 16 hosts with 4 cores each.
> > > >
> > > >
> > > > >
> > > > >
> > > > > > Other details, if you use single jetty for all of them, are you
> > sure
> > > > that
> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
> > cores
> > > > > index
> > > > > > directories is less than 45G?
> > > > > >
> > > > > > The total index size is 230 GB, so it won't fit in ram, but we're
> > > using
> > > > > an
> > > > > SSD disk to minimize disk access time. We have tried putting the
> EFF
> > > > onto a
> > > > > ram disk, but this didn't have a measurable effect.
> > > > >
> > > > > Thanks,
> > > > > /Martin
> > > > >
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
> > wrote:
> > > > > >
> > > > > > > Mikhail
> > > > > > >
> > > > > > > PSB
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > >
> > > > > > > > Martin,
> > > > > > > >
> > > > > > > > Please find additional question from me below.
> > > > > > > >
> > > > > > > > Simone,
> > > > > > > >
> > > > > > > > I'm sorry for hijacking your thread. The only what I've heard
> > > about
> > > > > it
> > > > > > at
> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> > > > replicate
> > > > > > > those
> > > > > > > > files as configs under solr home. And I'm really looking
> > forward
> > > to
> > > > > > know
> > > > > > > > how it works with huge files in production.
> > > > > > > >
> > > > > > > > Thank You, Guys!
> > > > > > > >
> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > > > написал:
> > > > > > > > >
> > > > > > > > > Hi Mikhail
> > > > > > > > >
> > > > > > > > > Please see answers below.
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > > > >
> > > > > > > > > > Martin,
> > > > > > > > > >
> > > > > > > > > > Thank you for telling your own "war-story". It's really
> > > useful
> > > > > for
> > > > > > > > > > community.
> > > > > > > > > > The first question might seems not really conscious, but
> > > would
> > > > > you
> > > > > > > tell
> > > > > > > > me
> > > > > > > > > > what blocks searching during EFF reload, when it's
> > triggered
> > > by
> > > > > > > handler
> > > > > > > > or
> > > > > > > > > > by listener?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > We continuously index new documents using CommitWithin to
> get
> > > > > regular
> > > > > > > > > commits. However, we observed that the EFFs were not
> re-read,
> > > so
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > > force
> > > > > > > reload.
> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
> why
> > > it's
> > > > > > doing
> > > > > > > > > that (it was related to SOLR-3985).
> > > > > > > >
> > > > > > > > Is there a chance to get a thread dump when they are blocked?
> > > > > > > >
> > > > > > > >
> > > > > > > Well I could try to recreate the situation. But the setup is
> > fairly
> > > > > > simple:
> > > > > > > Create a large EFF in a largeish index with many shards. Issue
> a
> > > > > commit,
> > > > > > > and then try to do a search. Solr will not respond to the
> search
> > > > before
> > > > > > the
> > > > > > > commit has completed, and this will take a long time.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I don't really get the sentence about sequential commits
> > and
> > > > > number
> > > > > > > of
> > > > > > > > > > cores. Do I get right that file is replicated via
> > Zookeeper?
> > > > > > Doesn't
> > > > > > > it
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is observed behavior. When we issue a commit
> on a
> > > > > system
> > > > > > > with
> > > > > > > > a
> > > > > > > > > system with many solr cores using EFFs, the system blocks
> > for a
> > > > > long
> > > > > > > time
> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
> EFF
> > > is a
> > > > > > > symlink
> > > > > > > > > from each cores index dir to the actual file, which is
> > updated
> > > by
> > > > > an
> > > > > > > > > external process.
> > > > > > > >
> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
> > > > SolrCloud.
> > > > > > > >
> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
> > > replicas
> > > > of
> > > > > > the
> > > > > > > > same index?
> > > > > > > >
> > > > > > >
> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
> SolrCloud.
> > > I'm
> > > > a
> > > > > > bit
> > > > > > > unsure about the terminology here, but we've got a single index
> > > > divided
> > > > > > > into 16 shard. Each shard is hosted in a solr core.
> > > > > > >
> > > > > > >
> > > > > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > > > > >
> > > > > > > > No, we generate the EFF on the local solr host (there is only
> > one
> > > > > > > physical
> > > > > > > host that holds all shards), so there is no need for NFS or
> > copying
> > > > > files
> > > > > > > around. No need for Zookeeper either.
> > > > > > >
> > > > > > >
> > > > > > > > how many cores you run per box?
> > > > > > > >
> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
> 60GB
> > of
> > > > > RAM.
> > > > > > We
> > > > > > > run 16 solr cores on this box in Jetty.
> > > > > > >
> > > > > > >
> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
> > > heaps?
> > > > > > > >
> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
> the
> > > OS.
> > > > > > >
> > > > > > >
> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
> > confirm
> > > > > that.
> > > > > > > >
> > > > > > > >
> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
> > where
> > > > > that
> > > > > > > would be configured in solr - can you explain that?
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > causes scalability problem or long time to reload? Will
> it
> > > help
> > > > > if
> > > > > > > > we'll
> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
> > values
> > > > from
> > > > > > > jdbc.
> > > > > > > > ie.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the possibility of having some fields being
> retrieved
> > > > from
> > > > > an
> > > > > > > > > external, dynamically updatable store would be really
> > > > interesting.
> > > > > > This
> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > > product
> > > > > > (e.g.
> > > > > > > > > Cassandra).
> > > > > > > >
> > > > > > > > Ok. Let's have it in mind as a possible direction.
> > > > > > > >
> > > > > > >
> > > > > > > Alternatively, an API that would allow updating a single field
> > for
> > > a
> > > > > > > document might be an option.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is a solr implementation detail that I can't
> > answer
> > > > :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
> > index
> > > > term
> > > > > > > > order?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > AFAIK it can impact load time.
> > > > > > > > > >
> > > > > > > > > Yes, it does
> > > > > > > >
> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
> > strings,
> > > > not
> > > > > > > > integers.
> > > > > > > >
> > > > > > > >
> > > > > > > Yes, ids are strings.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Regarding your post-query solution can you tell me if
> query
> > > > found
> > > > > > > 10000
> > > > > > > > > > docs, but I need to display only first page with 100
> rows,
> > > > > whether
> > > > > > I
> > > > > > > > need
> > > > > > > > > > to pull all 10K results to frontend to order them by the
> > > rank?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > In our architecture, the clients query an API that
> generates
> > > the
> > > > > SOLR
> > > > > > > > > query, retrieves the relevant additional fields that we
> > needs,
> > > > and
> > > > > > > > returns
> > > > > > > > > the relevant JSON to the front-end.
> > > > > > > > >
> > > > > > > > > In our use case, results are returned from SOLR by the
> 10's,
> > > not
> > > > by
> > > > > > the
> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
> returned
> > > > > > thousands
> > > > > > > of
> > > > > > > > > results, it would be up to the implementation of the api to
> > > > augment
> > > > > > > only
> > > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > > >
> > > > > > > > > Even so, patching up a JSON structure with 10000 results
> > should
> > > > be
> > > > > > > > > possible.
> > > > > > > >
> > > > > > > > You are right. I'm concerned anyway because retrieving whole
> > > result
> > > > > is
> > > > > > > > expensive, and not always possible.
> > > > > > > >
> > > > > > > >
> > > > > > > In our case, getting the whole result is almost impossible,
> > because
> > > > > that
> > > > > > > would be millions of documents, and returning the Nth result
> > seems
> > > to
> > > > > be
> > > > > > a
> > > > > > > quadratic (or worse) operation in SOLR.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I'm really appreciate if you comment on the questions
> > above.
> > > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
> "Commit-free
> > > > > > > > > > ExternalFileField" can help you?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It looks very interesting :) Does it make it possible to
> > > avoid
> > > > > > > > re-reading
> > > > > > > > > the EFF on every commit, and only re-read the values that
> > have
> > > > > > actually
> > > > > > > > > changed?
> > > > > > > >
> > > > > > > >
> > > > > > > > You don't need commit (in SOLR-4085) to reload file content,
> > but
> > > > > after
> > > > > > > > commit you need to read whole file and scan all key terms and
> > > > > postings.
> > > > > > > > That's because EFF sits on top of top level searcher. it's a
> > > > > Solr-like
> > > > > > > way.
> > > > > > > > In some future we might have per-segment EFF, in this case
> > > adding a
> > > > > > > segment
> > > > > > > > will trigger full file scan, but in the index only that new
> > > segment
> > > > > > will
> > > > > > > be
> > > > > > > > scanned. It should be faster. You know, straightforward
> sharing
> > > > > > internal
> > > > > > > > data structures between different index views/generations is
> > not
> > > > > > > possible.
> > > > > > > > If you are asking about applying delta changes on external
> file
> > > > > that's
> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
> > > feature
> > > > > is
> > > > > > > much
> > > > > > > > more doubtful and vague, although it might be the next
> > > contribution
> > > > > > after
> > > > > > > > SOLR-4085.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > /Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> > mak@issuu.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> > you
> > > > > what
> > > > > > > > you're
> > > > > > > > > > > hoping fore.
> > > > > > > > > > >
> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > > >
> > > > > > > > > > > The EFF is placed in the parent of the index directory
> in
> > > > each
> > > > > > > core;
> > > > > > > > each
> > > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> > is
> > > > > > > > responsible
> > > > > > > > > > for.
> > > > > > > > > > >
> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> > (doesn't
> > > > > answer
> > > > > > > > > > queries)
> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > > time
> > > > to
> > > > > > > > re-read
> > > > > > > > > > the
> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
> the
> > > EFF
> > > > > is
> > > > > > > > re-read
> > > > > > > > > > by
> > > > > > > > > > > each core sequentially). The contents of the EFF become
> > > > active
> > > > > > > after
> > > > > > > > the
> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > > after
> > > > > the
> > > > > > > > file
> > > > > > > > > > has
> > > > > > > > > > > been updated.
> > > > > > > > > > >
> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
> and
> > > we
> > > > > use
> > > > > > 16
> > > > > > > > > > shards,
> > > > > > > > > > > so when we triggered an external commit to force
> > > re-reading,
> > > > > the
> > > > > > > > whole
> > > > > > > > > > > system would block for several (10-15) minutes. This
> > won't
> > > > work
> > > > > > in
> > > > > > > a
> > > > > > > > > > > production environment. The reason for the size of the
> > EFF
> > > is
> > > > > > that
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > around 7M documents in the index; each document has a
> 45
> > > > > > character
> > > > > > > > ID.
> > > > > > > > > > >
> > > > > > > > > > > We got some help to try to fix the problem so that the
> > > > re-read
> > > > > of
> > > > > > > the
> > > > > > > > EFF
> > > > > > > > > > > proceeds in the background (see
> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> > for
> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
> > re-read
> > > > > > proceeds
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > background, the time required to launch solr now takes
> at
> > > > least
> > > > > > as
> > > > > > > > long
> > > > > > > > > > as
> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> > our
> > > > > > needs.
> > > > > > > > > > >
> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > > (though
> > > > > you
> > > > > > > can
> > > > > > > > > > return
> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> > also
> > > > > fixed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > 4.1
> > > > > > > > > > > branch here <
> > > https://issues.apache.org/jira/browse/SOLR-4022
> > > > >.
> > > > > > > > > > >
> > > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > > great.
> > > > > > Our
> > > > > > > > > > solution
> > > > > > > > > > > is as follows: The actual value of the popularity
> measure
> > > > (say,
> > > > > > > > reads)
> > > > > > > > > > that
> > > > > > > > > > > we want to report to the user is inserted into the
> search
> > > > > > response
> > > > > > > > > > > post-query by our query front-end. This value will then
> > be
> > > > the
> > > > > > > > > > > authoritative value at the time of the query. The value
> > of
> > > > the
> > > > > > > > popularity
> > > > > > > > > > > measure that we use for boosting in the ranking of the
> > > search
> > > > > > > results
> > > > > > > > is
> > > > > > > > > > > only updated when the value has changed enough so that
> > the
> > > > > impact
> > > > > > > on
> > > > > > > > the
> > > > > > > > > > > boost will be significant (say, more than 2%). This
> does
> > > > > require
> > > > > > > > frequent
> > > > > > > > > > > re-indexing of the documents that have significant
> > changes
> > > in
> > > > > the
> > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > reads, but at least we won't have to update a document
> if
> > > it
> > > > > > moves
> > > > > > > > from,
> > > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > > >
> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > > simoneg@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > > I'm planning to move a quite big Solr index to
> > SolrCloud.
> > > > > > > However,
> > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > > index, an external file field is used for popularity
> > > > ranking.
> > > > > > > > > > > >
> > > > > > > > > > > > Does SolrCloud supports external file fields? How
> does
> > it
> > > > > cope
> > > > > > > with
> > > > > > > > > > > > sharding and replication? Where should the external
> > file
> > > be
> > > > > > > placed
> > > > > > > > now
> > > > > > > > > > > that
> > > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > > >
> > > > > > > > > > > > Are there otherwise other best practices to deal with
> > the
> > > > use
> > > > > > > cases
> > > > > > > > > > > > external file fields were used for, like
> > > > popularity/ranking,
> > > > > in
> > > > > > > > > > > SolrCloud?
> > > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > > Simone
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sincerely yours
> > > > > > > > > > Mikhail Khludnev
> > > > > > > > > > Principal Engineer,
> > > > > > > > > > Grid Dynamics
> > > > > > > > > >
> > > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > > >  <mk...@griddynamics.com>
> > > > > > > > > >
> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > > > > написал:
> > > > > > > >
> > > > > > > > > Hi Mikhail
> > > > > > > > >
> > > > > > > > > Please see answers below.
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > > > >
> > > > > > > > > > Martin,
> > > > > > > > > >
> > > > > > > > > > Thank you for telling your own "war-story". It's really
> > > useful
> > > > > for
> > > > > > > > > > community.
> > > > > > > > > > The first question might seems not really conscious, but
> > > would
> > > > > you
> > > > > > > tell
> > > > > > > > > me
> > > > > > > > > > what blocks searching during EFF reload, when it's
> > triggered
> > > by
> > > > > > > handler
> > > > > > > > > or
> > > > > > > > > > by listener?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > We continuously index new documents using CommitWithin to
> get
> > > > > regular
> > > > > > > > > commits. However, we observed that the EFFs were not
> re-read,
> > > so
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > > force
> > > > > > > reload.
> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
> why
> > > it's
> > > > > > doing
> > > > > > > > > that (it was related to SOLR-3985).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I don't really get the sentence about sequential commits
> > and
> > > > > number
> > > > > > > of
> > > > > > > > > > cores. Do I get right that file is replicated via
> > Zookeeper?
> > > > > > Doesn't
> > > > > > > it
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is observed behavior. When we issue a commit
> on a
> > > > > system
> > > > > > > > with a
> > > > > > > > > system with many solr cores using EFFs, the system blocks
> > for a
> > > > > long
> > > > > > > time
> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
> EFF
> > > is a
> > > > > > > symlink
> > > > > > > > > from each cores index dir to the actual file, which is
> > updated
> > > by
> > > > > an
> > > > > > > > > external process.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > causes scalability problem or long time to reload? Will
> it
> > > help
> > > > > if
> > > > > > > > we'll
> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
> > values
> > > > from
> > > > > > > jdbc.
> > > > > > > > > ie.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the possibility of having some fields being
> retrieved
> > > > from
> > > > > an
> > > > > > > > > external, dynamically updatable store would be really
> > > > interesting.
> > > > > > This
> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > > product
> > > > > > (e.g.
> > > > > > > > > Cassandra).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is a solr implementation detail that I can't
> > answer
> > > > :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
> > index
> > > > term
> > > > > > > > order?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > AFAIK it can impact load time.
> > > > > > > > > >
> > > > > > > > > Yes, it does.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Regarding your post-query solution can you tell me if
> query
> > > > found
> > > > > > > 10000
> > > > > > > > > > docs, but I need to display only first page with 100
> rows,
> > > > > whether
> > > > > > I
> > > > > > > > need
> > > > > > > > > > to pull all 10K results to frontend to order them by the
> > > rank?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > In our architecture, the clients query an API that
> generates
> > > the
> > > > > SOLR
> > > > > > > > > query, retrieves the relevant additional fields that we
> > needs,
> > > > and
> > > > > > > > returns
> > > > > > > > > the relevant JSON to the front-end.
> > > > > > > > >
> > > > > > > > > In our use case, results are returned from SOLR by the
> 10's,
> > > not
> > > > by
> > > > > > the
> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
> returned
> > > > > > thousands
> > > > > > > of
> > > > > > > > > results, it would be up to the implementation of the api to
> > > > augment
> > > > > > > only
> > > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > > >
> > > > > > > > > Even so, patching up a JSON structure with 10000 results
> > should
> > > > be
> > > > > > > > > possible.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I'm really appreciate if you comment on the questions
> > above.
> > > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
> "Commit-free
> > > > > > > > > > ExternalFileField" can help you?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It looks very interesting :) Does it make it possible to
> > > avoid
> > > > > > > > re-reading
> > > > > > > > > the EFF on every commit, and only re-read the values that
> > have
> > > > > > actually
> > > > > > > > > changed?
> > > > > > > > >
> > > > > > > > > /Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> > mak@issuu.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> > you
> > > > > what
> > > > > > > > you're
> > > > > > > > > > > hoping fore.
> > > > > > > > > > >
> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > > >
> > > > > > > > > > > The EFF is placed in the parent of the index directory
> in
> > > > each
> > > > > > > core;
> > > > > > > > > each
> > > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> > is
> > > > > > > > responsible
> > > > > > > > > > for.
> > > > > > > > > > >
> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> > (doesn't
> > > > > answer
> > > > > > > > > > queries)
> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > > time
> > > > to
> > > > > > > > re-read
> > > > > > > > > > the
> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
> the
> > > EFF
> > > > > is
> > > > > > > > > re-read
> > > > > > > > > > by
> > > > > > > > > > > each core sequentially). The contents of the EFF become
> > > > active
> > > > > > > after
> > > > > > > > > the
> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > > after
> > > > > the
> > > > > > > > file
> > > > > > > > > > has
> > > > > > > > > > > been updated.
> > > > > > > > > > >
> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
> and
> > > we
> > > > > use
> > > > > > 16
> > > > > > > > > > shards,
> > > > > > > > > > > so when we triggered an external commit to force
> > > re-reading,
> > > > > the
> > > > > > > > whole
> > > > > > > > > > > system would block for several (10-15) minutes. This
> > won't
> > > > work
> > > > > > in
> > > > > > > a
> > > > > > > > > > > production environment. The reason for the size of the
> > EFF
> > > is
> > > > > > that
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > around 7M documents in the index; each document has a
> 45
> > > > > > character
> > > > > > > > ID.
> > > > > > > > > > >
> > > > > > > > > > > We got some help to try to fix the problem so that the
> > > > re-read
> > > > > of
> > > > > > > the
> > > > > > > > > EFF
> > > > > > > > > > > proceeds in the background (see
> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> > for
> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
> > re-read
> > > > > > proceeds
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > background, the time required to launch solr now takes
> at
> > > > least
> > > > > > as
> > > > > > > > long
> > > > > > > > > > as
> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> > our
> > > > > > needs.
> > > > > > > > > > >
> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > > (though
> > > > > you
> > > > > > > can
> > > > > > > > > > return
> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> > also
> > > > > fixed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > 4.1
> > > > > > > > > > > branch here <
> > > https://issues.apache.org/jira/browse/SOLR-4022
> > > > >.
> > > > > > > > > > >
> > > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > > great.
> > > > > > Our
> > > > > > > > > > solution
> > > > > > > > > > > is as follows: The actual value of the popularity
> measure
> > > > (say,
> > > > > > > > reads)
> > > > > > > > > > that
> > > > > > > > > > > we want to report to the user is inserted into the
> search
> > > > > > response
> > > > > > > > > > > post-query by our query front-end. This value will then
> > be
> > > > the
> > > > > > > > > > > authoritative value at the time of the query. The value
> > of
> > > > the
> > > > > > > > > popularity
> > > > > > > > > > > measure that we use for boosting in the ranking of the
> > > search
> > > > > > > results
> > > > > > > > > is
> > > > > > > > > > > only updated when the value has changed enough so that
> > the
> > > > > impact
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > boost will be significant (say, more than 2%). This
> does
> > > > > require
> > > > > > > > > frequent
> > > > > > > > > > > re-indexing of the documents that have significant
> > changes
> > > in
> > > > > the
> > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > reads, but at least we won't have to update a document
> if
> > > it
> > > > > > moves
> > > > > > > > > from,
> > > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > > >
> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > > simoneg@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > > I'm planning to move a quite big Solr index to
> > SolrCloud.
> > > > > > > However,
> > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > > index, an external file field is used for popularity
> > > > ranking.
> > > > > > > > > > > >
> > > > > > > > > > > > Does SolrCloud supports external file fields? How
> does
> > it
> > > > > cope
> > > > > > > with
> > > > > > > > > > > > sharding and replication? Where should the external
> > file
> > > be
> > > > > > > placed
> > > > > > > > > now
> > > > > > > > > > > that
> > > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > > >
> > > > > > > > > > > > Are there otherwise other best practices to deal with
> > the
> > > > use
> > > > > > > cases
> > > > > > > > > > > > external file fields were used for, like
> > > > popularity/ranking,
> > > > > in
> > > > > > > > > > > SolrCloud?
> > > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > > Simone
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sincerely yours
> > > > > > > > > > Mikhail Khludnev
> > > > > > > > > > Principal Engineer,
> > > > > > > > > > Grid Dynamics
> > > > > > > > > >
> > > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > > >  <mk...@griddynamics.com>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sincerely yours
> > > > > > Mikhail Khludnev
> > > > > > Principal Engineer,
> > > > > > Grid Dynamics
> > > > > >
> > > > > > <http://www.griddynamics.com>
> > > > > >  <mk...@griddynamics.com>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > >  <mk...@griddynamics.com>
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Mikhail

To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
3985). But this wasn't good enough, because SOLR would still take very long
to restart when that was necessary.

I don't see how we could throw more hardware at the problem without making
it worse, really - the only solution here would be *fewer* shards, not
more.

IMO it would be ideal if the lucene/solr community could come up with a
good way of updating fields in a document without reindexing. This could be
by linking to some external data store, or in the lucene/solr internals. If
it would make things easier, a good first step would be to have dynamically
updateable numerical fields only.

/Martin

On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
>
> I don't think solrconfig.xml shed any light on. I've just found what I
> didn't get in your setup - the way of how to explicitly assigning core to
> collection. Now, I realized most of details after all!
> Ball is on your side, let us know whether you have managed your cores to
> commit one by one to avoid freeze, or could you eliminate pauses by
> allocating more hardware?
> Thanks in advance!
>
>
> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:
>
> > Mikhail,
> >
> > PSB
> >
> > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com> wrote:
> > >
> > > >
> > > > I wasn't aware until now that it is possible to send a commit to one
> > core
> > > > only. What we observed was the effect of curl
> > > > localhost:8080/solr/update?commit=true but perhaps we should
> experiment
> > > > with solr/coreN/update?commit=true. A quick trial run seems to
> indicate
> > > > that a commit to a single core causes commits on all cores.
> > > >
> > > You should see something like this in the log:
> > > ... SolrCmdDistributor .... Distrib commit to: ...
> > >
> > > Yup, a commit towards a single core results in a commit on all cores.
> >
> >
> > > >
> > > >
> > > > Perhaps I should clarify that we are using SOLR as a black box; we do
> > not
> > > > touch the code at all - we only install the distribution WAR file and
> > > > proceed from there.
> > > >
> > > I still don't understand how you deploy/launch Solr. How many jettys
> you
> > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
> > > shards= param for every request and distributes updates yourself? What
> > > collections do you create and with which settings?
> > >
> > > We let SOLR do the sharding using one collection with 16 SOLR cores
> > holding one shard each. We launch only one instance of jetty with the
> > folllowing arguments:
> >
> > -DnumShards=16
> > -DzkHost=<zookeeperhost:port>
> > -Xmx10G
> > -Xms10G
> > -Xmn2G
> > -server
> >
> > Would you like to see the solrconfig.xml?
> >
> > /Martin
> >
> >
> > > >
> > > >
> > > > > Also from my POV such deployments should start at least from *16*
> > 4-way
> > > > > vboxes, it's more expensive, but should be much better available
> > during
> > > > > cpu-consuming operations.
> > > > >
> > > >
> > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
> > > with
> > > > 16 cores? Or am I misunderstanding something :) ?
> > > >
> > > I prefer to start from 16 hosts with 4 cores each.
> > >
> > >
> > > >
> > > >
> > > > > Other details, if you use single jetty for all of them, are you
> sure
> > > that
> > > > > jetty's threadpool doesn't limit requests? is it large enough?
> > > > > You have 60G and set -Xmx=10G. are you sure that total size of
> cores
> > > > index
> > > > > directories is less than 45G?
> > > > >
> > > > > The total index size is 230 GB, so it won't fit in ram, but we're
> > using
> > > > an
> > > > SSD disk to minimize disk access time. We have tried putting the EFF
> > > onto a
> > > > ram disk, but this didn't have a measurable effect.
> > > >
> > > > Thanks,
> > > > /Martin
> > > >
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com>
> wrote:
> > > > >
> > > > > > Mikhail
> > > > > >
> > > > > > PSB
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Please find additional question from me below.
> > > > > > >
> > > > > > > Simone,
> > > > > > >
> > > > > > > I'm sorry for hijacking your thread. The only what I've heard
> > about
> > > > it
> > > > > at
> > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> > > replicate
> > > > > > those
> > > > > > > files as configs under solr home. And I'm really looking
> forward
> > to
> > > > > know
> > > > > > > how it works with huge files in production.
> > > > > > >
> > > > > > > Thank You, Guys!
> > > > > > >
> > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > > написал:
> > > > > > > >
> > > > > > > > Hi Mikhail
> > > > > > > >
> > > > > > > > Please see answers below.
> > > > > > > >
> > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > > >
> > > > > > > > > Martin,
> > > > > > > > >
> > > > > > > > > Thank you for telling your own "war-story". It's really
> > useful
> > > > for
> > > > > > > > > community.
> > > > > > > > > The first question might seems not really conscious, but
> > would
> > > > you
> > > > > > tell
> > > > > > > me
> > > > > > > > > what blocks searching during EFF reload, when it's
> triggered
> > by
> > > > > > handler
> > > > > > > or
> > > > > > > > > by listener?
> > > > > > > > >
> > > > > > > >
> > > > > > > > We continuously index new documents using CommitWithin to get
> > > > regular
> > > > > > > > commits. However, we observed that the EFFs were not re-read,
> > so
> > > we
> > > > > had
> > > > > > > to
> > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > force
> > > > > > reload.
> > > > > > > > When this is done, solr blocks. I can't tell you exactly why
> > it's
> > > > > doing
> > > > > > > > that (it was related to SOLR-3985).
> > > > > > >
> > > > > > > Is there a chance to get a thread dump when they are blocked?
> > > > > > >
> > > > > > >
> > > > > > Well I could try to recreate the situation. But the setup is
> fairly
> > > > > simple:
> > > > > > Create a large EFF in a largeish index with many shards. Issue a
> > > > commit,
> > > > > > and then try to do a search. Solr will not respond to the search
> > > before
> > > > > the
> > > > > > commit has completed, and this will take a long time.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > I don't really get the sentence about sequential commits
> and
> > > > number
> > > > > > of
> > > > > > > > > cores. Do I get right that file is replicated via
> Zookeeper?
> > > > > Doesn't
> > > > > > it
> > > > > > > > >
> > > > > > > >
> > > > > > > > Again, this is observed behavior. When we issue a commit on a
> > > > system
> > > > > > with
> > > > > > > a
> > > > > > > > system with many solr cores using EFFs, the system blocks
> for a
> > > > long
> > > > > > time
> > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
> > is a
> > > > > > symlink
> > > > > > > > from each cores index dir to the actual file, which is
> updated
> > by
> > > > an
> > > > > > > > external process.
> > > > > > >
> > > > > > > Hold on, I asked about Zookeeper because the subj mentions
> > > SolrCloud.
> > > > > > >
> > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
> > replicas
> > > of
> > > > > the
> > > > > > > same index?
> > > > > > >
> > > > > >
> > > > > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud.
> > I'm
> > > a
> > > > > bit
> > > > > > unsure about the terminology here, but we've got a single index
> > > divided
> > > > > > into 16 shard. Each shard is hosted in a solr core.
> > > > > >
> > > > > >
> > > > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > > > >
> > > > > > > No, we generate the EFF on the local solr host (there is only
> one
> > > > > > physical
> > > > > > host that holds all shards), so there is no need for NFS or
> copying
> > > > files
> > > > > > around. No need for Zookeeper either.
> > > > > >
> > > > > >
> > > > > > > how many cores you run per box?
> > > > > > >
> > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB
> of
> > > > RAM.
> > > > > We
> > > > > > run 16 solr cores on this box in Jetty.
> > > > > >
> > > > > >
> > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
> > heaps?
> > > > > > >
> > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for the
> > OS.
> > > > > >
> > > > > >
> > > > > > > I assume you use 64 bit linux and mmap directory. Please
> confirm
> > > > that.
> > > > > > >
> > > > > > >
> > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
> where
> > > > that
> > > > > > would be configured in solr - can you explain that?
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > causes scalability problem or long time to reload? Will it
> > help
> > > > if
> > > > > > > we'll
> > > > > > > > > have, let's say ExternalDatabaseField which will pull
> values
> > > from
> > > > > > jdbc.
> > > > > > > ie.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think the possibility of having some fields being retrieved
> > > from
> > > > an
> > > > > > > > external, dynamically updatable store would be really
> > > interesting.
> > > > > This
> > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > product
> > > > > (e.g.
> > > > > > > > Cassandra).
> > > > > > >
> > > > > > > Ok. Let's have it in mind as a possible direction.
> > > > > > >
> > > > > >
> > > > > > Alternatively, an API that would allow updating a single field
> for
> > a
> > > > > > document might be an option.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Again, this is a solr implementation detail that I can't
> answer
> > > :)
> > > > > > > >
> > > > > > > >
> > > > > > > > > Can you confirm that IDs in the file is ordered by the
> index
> > > term
> > > > > > > order?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > >
> > > > > > > >
> > > > > > > > > AFAIK it can impact load time.
> > > > > > > > >
> > > > > > > > Yes, it does
> > > > > > >
> > > > > > > Ok, I've got that you aware of it, and your IDs are just
> strings,
> > > not
> > > > > > > integers.
> > > > > > >
> > > > > > >
> > > > > > Yes, ids are strings.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Regarding your post-query solution can you tell me if query
> > > found
> > > > > > 10000
> > > > > > > > > docs, but I need to display only first page with 100 rows,
> > > > whether
> > > > > I
> > > > > > > need
> > > > > > > > > to pull all 10K results to frontend to order them by the
> > rank?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > In our architecture, the clients query an API that generates
> > the
> > > > SOLR
> > > > > > > > query, retrieves the relevant additional fields that we
> needs,
> > > and
> > > > > > > returns
> > > > > > > > the relevant JSON to the front-end.
> > > > > > > >
> > > > > > > > In our use case, results are returned from SOLR by the 10's,
> > not
> > > by
> > > > > the
> > > > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > > > thousands
> > > > > > of
> > > > > > > > results, it would be up to the implementation of the api to
> > > augment
> > > > > > only
> > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > >
> > > > > > > > Even so, patching up a JSON structure with 10000 results
> should
> > > be
> > > > > > > > possible.
> > > > > > >
> > > > > > > You are right. I'm concerned anyway because retrieving whole
> > result
> > > > is
> > > > > > > expensive, and not always possible.
> > > > > > >
> > > > > > >
> > > > > > In our case, getting the whole result is almost impossible,
> because
> > > > that
> > > > > > would be millions of documents, and returning the Nth result
> seems
> > to
> > > > be
> > > > > a
> > > > > > quadratic (or worse) operation in SOLR.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > I'm really appreciate if you comment on the questions
> above.
> > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085"Commit-free
> > > > > > > > > ExternalFileField" can help you?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It looks very interesting :) Does it make it possible to
> > avoid
> > > > > > > re-reading
> > > > > > > > the EFF on every commit, and only re-read the values that
> have
> > > > > actually
> > > > > > > > changed?
> > > > > > >
> > > > > > >
> > > > > > > You don't need commit (in SOLR-4085) to reload file content,
> but
> > > > after
> > > > > > > commit you need to read whole file and scan all key terms and
> > > > postings.
> > > > > > > That's because EFF sits on top of top level searcher. it's a
> > > > Solr-like
> > > > > > way.
> > > > > > > In some future we might have per-segment EFF, in this case
> > adding a
> > > > > > segment
> > > > > > > will trigger full file scan, but in the index only that new
> > segment
> > > > > will
> > > > > > be
> > > > > > > scanned. It should be faster. You know, straightforward sharing
> > > > > internal
> > > > > > > data structures between different index views/generations is
> not
> > > > > > possible.
> > > > > > > If you are asking about applying delta changes on external file
> > > > that's
> > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
> > feature
> > > > is
> > > > > > much
> > > > > > > more doubtful and vague, although it might be the next
> > contribution
> > > > > after
> > > > > > > SOLR-4085.
> > > > > > >
> > > > > > > >
> > > > > > > > /Martin
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> mak@issuu.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> you
> > > > what
> > > > > > > you're
> > > > > > > > > > hoping fore.
> > > > > > > > > >
> > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > >
> > > > > > > > > > The EFF is placed in the parent of the index directory in
> > > each
> > > > > > core;
> > > > > > > each
> > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> is
> > > > > > > responsible
> > > > > > > > > for.
> > > > > > > > > >
> > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> (doesn't
> > > > answer
> > > > > > > > > queries)
> > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > time
> > > to
> > > > > > > re-read
> > > > > > > > > the
> > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. the
> > EFF
> > > > is
> > > > > > > re-read
> > > > > > > > > by
> > > > > > > > > > each core sequentially). The contents of the EFF become
> > > active
> > > > > > after
> > > > > > > the
> > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > after
> > > > the
> > > > > > > file
> > > > > > > > > has
> > > > > > > > > > been updated.
> > > > > > > > > >
> > > > > > > > > > In our case, the EFF was quite large - around 450MB - and
> > we
> > > > use
> > > > > 16
> > > > > > > > > shards,
> > > > > > > > > > so when we triggered an external commit to force
> > re-reading,
> > > > the
> > > > > > > whole
> > > > > > > > > > system would block for several (10-15) minutes. This
> won't
> > > work
> > > > > in
> > > > > > a
> > > > > > > > > > production environment. The reason for the size of the
> EFF
> > is
> > > > > that
> > > > > > we
> > > > > > > > > have
> > > > > > > > > > around 7M documents in the index; each document has a 45
> > > > > character
> > > > > > > ID.
> > > > > > > > > >
> > > > > > > > > > We got some help to try to fix the problem so that the
> > > re-read
> > > > of
> > > > > > the
> > > > > > > EFF
> > > > > > > > > > proceeds in the background (see
> > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> for
> > > > > > > > > > a fix on the 4.1 branch). However, even though the
> re-read
> > > > > proceeds
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > background, the time required to launch solr now takes at
> > > least
> > > > > as
> > > > > > > long
> > > > > > > > > as
> > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> our
> > > > > needs.
> > > > > > > > > >
> > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > (though
> > > > you
> > > > > > can
> > > > > > > > > return
> > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> also
> > > > fixed
> > > > > in
> > > > > > > the
> > > > > > > > > 4.1
> > > > > > > > > > branch here <
> > https://issues.apache.org/jira/browse/SOLR-4022
> > > >.
> > > > > > > > > >
> > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > great.
> > > > > Our
> > > > > > > > > solution
> > > > > > > > > > is as follows: The actual value of the popularity measure
> > > (say,
> > > > > > > reads)
> > > > > > > > > that
> > > > > > > > > > we want to report to the user is inserted into the search
> > > > > response
> > > > > > > > > > post-query by our query front-end. This value will then
> be
> > > the
> > > > > > > > > > authoritative value at the time of the query. The value
> of
> > > the
> > > > > > > popularity
> > > > > > > > > > measure that we use for boosting in the ranking of the
> > search
> > > > > > results
> > > > > > > is
> > > > > > > > > > only updated when the value has changed enough so that
> the
> > > > impact
> > > > > > on
> > > > > > > the
> > > > > > > > > > boost will be significant (say, more than 2%). This does
> > > > require
> > > > > > > frequent
> > > > > > > > > > re-indexing of the documents that have significant
> changes
> > in
> > > > the
> > > > > > > number
> > > > > > > > > of
> > > > > > > > > > reads, but at least we won't have to update a document if
> > it
> > > > > moves
> > > > > > > from,
> > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > >
> > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > simoneg@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > > I'm planning to move a quite big Solr index to
> SolrCloud.
> > > > > > However,
> > > > > > > in
> > > > > > > > > > this
> > > > > > > > > > > index, an external file field is used for popularity
> > > ranking.
> > > > > > > > > > >
> > > > > > > > > > > Does SolrCloud supports external file fields? How does
> it
> > > > cope
> > > > > > with
> > > > > > > > > > > sharding and replication? Where should the external
> file
> > be
> > > > > > placed
> > > > > > > now
> > > > > > > > > > that
> > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > >
> > > > > > > > > > > Are there otherwise other best practices to deal with
> the
> > > use
> > > > > > cases
> > > > > > > > > > > external file fields were used for, like
> > > popularity/ranking,
> > > > in
> > > > > > > > > > SolrCloud?
> > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > Simone
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sincerely yours
> > > > > > > > > Mikhail Khludnev
> > > > > > > > > Principal Engineer,
> > > > > > > > > Grid Dynamics
> > > > > > > > >
> > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > >  <mk...@griddynamics.com>
> > > > > > > > >
> > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > > > написал:
> > > > > > >
> > > > > > > > Hi Mikhail
> > > > > > > >
> > > > > > > > Please see answers below.
> > > > > > > >
> > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > > >
> > > > > > > > > Martin,
> > > > > > > > >
> > > > > > > > > Thank you for telling your own "war-story". It's really
> > useful
> > > > for
> > > > > > > > > community.
> > > > > > > > > The first question might seems not really conscious, but
> > would
> > > > you
> > > > > > tell
> > > > > > > > me
> > > > > > > > > what blocks searching during EFF reload, when it's
> triggered
> > by
> > > > > > handler
> > > > > > > > or
> > > > > > > > > by listener?
> > > > > > > > >
> > > > > > > >
> > > > > > > > We continuously index new documents using CommitWithin to get
> > > > regular
> > > > > > > > commits. However, we observed that the EFFs were not re-read,
> > so
> > > we
> > > > > had
> > > > > > > to
> > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > force
> > > > > > reload.
> > > > > > > > When this is done, solr blocks. I can't tell you exactly why
> > it's
> > > > > doing
> > > > > > > > that (it was related to SOLR-3985).
> > > > > > > >
> > > > > > > >
> > > > > > > > > I don't really get the sentence about sequential commits
> and
> > > > number
> > > > > > of
> > > > > > > > > cores. Do I get right that file is replicated via
> Zookeeper?
> > > > > Doesn't
> > > > > > it
> > > > > > > > >
> > > > > > > >
> > > > > > > > Again, this is observed behavior. When we issue a commit on a
> > > > system
> > > > > > > with a
> > > > > > > > system with many solr cores using EFFs, the system blocks
> for a
> > > > long
> > > > > > time
> > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
> > is a
> > > > > > symlink
> > > > > > > > from each cores index dir to the actual file, which is
> updated
> > by
> > > > an
> > > > > > > > external process.
> > > > > > > >
> > > > > > > >
> > > > > > > > > causes scalability problem or long time to reload? Will it
> > help
> > > > if
> > > > > > > we'll
> > > > > > > > > have, let's say ExternalDatabaseField which will pull
> values
> > > from
> > > > > > jdbc.
> > > > > > > > ie.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think the possibility of having some fields being retrieved
> > > from
> > > > an
> > > > > > > > external, dynamically updatable store would be really
> > > interesting.
> > > > > This
> > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > product
> > > > > (e.g.
> > > > > > > > Cassandra).
> > > > > > > >
> > > > > > > >
> > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Again, this is a solr implementation detail that I can't
> answer
> > > :)
> > > > > > > >
> > > > > > > >
> > > > > > > > > Can you confirm that IDs in the file is ordered by the
> index
> > > term
> > > > > > > order?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > >
> > > > > > > >
> > > > > > > > > AFAIK it can impact load time.
> > > > > > > > >
> > > > > > > > Yes, it does.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Regarding your post-query solution can you tell me if query
> > > found
> > > > > > 10000
> > > > > > > > > docs, but I need to display only first page with 100 rows,
> > > > whether
> > > > > I
> > > > > > > need
> > > > > > > > > to pull all 10K results to frontend to order them by the
> > rank?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > In our architecture, the clients query an API that generates
> > the
> > > > SOLR
> > > > > > > > query, retrieves the relevant additional fields that we
> needs,
> > > and
> > > > > > > returns
> > > > > > > > the relevant JSON to the front-end.
> > > > > > > >
> > > > > > > > In our use case, results are returned from SOLR by the 10's,
> > not
> > > by
> > > > > the
> > > > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > > > thousands
> > > > > > of
> > > > > > > > results, it would be up to the implementation of the api to
> > > augment
> > > > > > only
> > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > >
> > > > > > > > Even so, patching up a JSON structure with 10000 results
> should
> > > be
> > > > > > > > possible.
> > > > > > > >
> > > > > > > >
> > > > > > > > > I'm really appreciate if you comment on the questions
> above.
> > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085"Commit-free
> > > > > > > > > ExternalFileField" can help you?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It looks very interesting :) Does it make it possible to
> > avoid
> > > > > > > re-reading
> > > > > > > > the EFF on every commit, and only re-read the values that
> have
> > > > > actually
> > > > > > > > changed?
> > > > > > > >
> > > > > > > > /Martin
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> mak@issuu.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> you
> > > > what
> > > > > > > you're
> > > > > > > > > > hoping fore.
> > > > > > > > > >
> > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > >
> > > > > > > > > > The EFF is placed in the parent of the index directory in
> > > each
> > > > > > core;
> > > > > > > > each
> > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> is
> > > > > > > responsible
> > > > > > > > > for.
> > > > > > > > > >
> > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> (doesn't
> > > > answer
> > > > > > > > > queries)
> > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > time
> > > to
> > > > > > > re-read
> > > > > > > > > the
> > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. the
> > EFF
> > > > is
> > > > > > > > re-read
> > > > > > > > > by
> > > > > > > > > > each core sequentially). The contents of the EFF become
> > > active
> > > > > > after
> > > > > > > > the
> > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > after
> > > > the
> > > > > > > file
> > > > > > > > > has
> > > > > > > > > > been updated.
> > > > > > > > > >
> > > > > > > > > > In our case, the EFF was quite large - around 450MB - and
> > we
> > > > use
> > > > > 16
> > > > > > > > > shards,
> > > > > > > > > > so when we triggered an external commit to force
> > re-reading,
> > > > the
> > > > > > > whole
> > > > > > > > > > system would block for several (10-15) minutes. This
> won't
> > > work
> > > > > in
> > > > > > a
> > > > > > > > > > production environment. The reason for the size of the
> EFF
> > is
> > > > > that
> > > > > > we
> > > > > > > > > have
> > > > > > > > > > around 7M documents in the index; each document has a 45
> > > > > character
> > > > > > > ID.
> > > > > > > > > >
> > > > > > > > > > We got some help to try to fix the problem so that the
> > > re-read
> > > > of
> > > > > > the
> > > > > > > > EFF
> > > > > > > > > > proceeds in the background (see
> > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> for
> > > > > > > > > > a fix on the 4.1 branch). However, even though the
> re-read
> > > > > proceeds
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > background, the time required to launch solr now takes at
> > > least
> > > > > as
> > > > > > > long
> > > > > > > > > as
> > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> our
> > > > > needs.
> > > > > > > > > >
> > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > (though
> > > > you
> > > > > > can
> > > > > > > > > return
> > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> also
> > > > fixed
> > > > > in
> > > > > > > the
> > > > > > > > > 4.1
> > > > > > > > > > branch here <
> > https://issues.apache.org/jira/browse/SOLR-4022
> > > >.
> > > > > > > > > >
> > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > great.
> > > > > Our
> > > > > > > > > solution
> > > > > > > > > > is as follows: The actual value of the popularity measure
> > > (say,
> > > > > > > reads)
> > > > > > > > > that
> > > > > > > > > > we want to report to the user is inserted into the search
> > > > > response
> > > > > > > > > > post-query by our query front-end. This value will then
> be
> > > the
> > > > > > > > > > authoritative value at the time of the query. The value
> of
> > > the
> > > > > > > > popularity
> > > > > > > > > > measure that we use for boosting in the ranking of the
> > search
> > > > > > results
> > > > > > > > is
> > > > > > > > > > only updated when the value has changed enough so that
> the
> > > > impact
> > > > > > on
> > > > > > > > the
> > > > > > > > > > boost will be significant (say, more than 2%). This does
> > > > require
> > > > > > > > frequent
> > > > > > > > > > re-indexing of the documents that have significant
> changes
> > in
> > > > the
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > reads, but at least we won't have to update a document if
> > it
> > > > > moves
> > > > > > > > from,
> > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > >
> > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > simoneg@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > > I'm planning to move a quite big Solr index to
> SolrCloud.
> > > > > > However,
> > > > > > > in
> > > > > > > > > > this
> > > > > > > > > > > index, an external file field is used for popularity
> > > ranking.
> > > > > > > > > > >
> > > > > > > > > > > Does SolrCloud supports external file fields? How does
> it
> > > > cope
> > > > > > with
> > > > > > > > > > > sharding and replication? Where should the external
> file
> > be
> > > > > > placed
> > > > > > > > now
> > > > > > > > > > that
> > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > >
> > > > > > > > > > > Are there otherwise other best practices to deal with
> the
> > > use
> > > > > > cases
> > > > > > > > > > > external file fields were used for, like
> > > popularity/ranking,
> > > > in
> > > > > > > > > > SolrCloud?
> > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > Simone
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sincerely yours
> > > > > > > > > Mikhail Khludnev
> > > > > > > > > Principal Engineer,
> > > > > > > > > Grid Dynamics
> > > > > > > > >
> > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > >  <mk...@griddynamics.com>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > >  <mk...@griddynamics.com>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,

I don't think solrconfig.xml shed any light on. I've just found what I
didn't get in your setup - the way of how to explicitly assigning core to
collection. Now, I realized most of details after all!
Ball is on your side, let us know whether you have managed your cores to
commit one by one to avoid freeze, or could you eliminate pauses by
allocating more hardware?
Thanks in advance!


On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <ma...@issuu.com> wrote:

> Mikhail,
>
> PSB
>
> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com> wrote:
> >
> > >
> > > I wasn't aware until now that it is possible to send a commit to one
> core
> > > only. What we observed was the effect of curl
> > > localhost:8080/solr/update?commit=true but perhaps we should experiment
> > > with solr/coreN/update?commit=true. A quick trial run seems to indicate
> > > that a commit to a single core causes commits on all cores.
> > >
> > You should see something like this in the log:
> > ... SolrCmdDistributor .... Distrib commit to: ...
> >
> > Yup, a commit towards a single core results in a commit on all cores.
>
>
> > >
> > >
> > > Perhaps I should clarify that we are using SOLR as a black box; we do
> not
> > > touch the code at all - we only install the distribution WAR file and
> > > proceed from there.
> > >
> > I still don't understand how you deploy/launch Solr. How many jettys you
> > start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
> > shards= param for every request and distributes updates yourself? What
> > collections do you create and with which settings?
> >
> > We let SOLR do the sharding using one collection with 16 SOLR cores
> holding one shard each. We launch only one instance of jetty with the
> folllowing arguments:
>
> -DnumShards=16
> -DzkHost=<zookeeperhost:port>
> -Xmx10G
> -Xms10G
> -Xmn2G
> -server
>
> Would you like to see the solrconfig.xml?
>
> /Martin
>
>
> > >
> > >
> > > > Also from my POV such deployments should start at least from *16*
> 4-way
> > > > vboxes, it's more expensive, but should be much better available
> during
> > > > cpu-consuming operations.
> > > >
> > >
> > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
> > with
> > > 16 cores? Or am I misunderstanding something :) ?
> > >
> > I prefer to start from 16 hosts with 4 cores each.
> >
> >
> > >
> > >
> > > > Other details, if you use single jetty for all of them, are you sure
> > that
> > > > jetty's threadpool doesn't limit requests? is it large enough?
> > > > You have 60G and set -Xmx=10G. are you sure that total size of cores
> > > index
> > > > directories is less than 45G?
> > > >
> > > > The total index size is 230 GB, so it won't fit in ram, but we're
> using
> > > an
> > > SSD disk to minimize disk access time. We have tried putting the EFF
> > onto a
> > > ram disk, but this didn't have a measurable effect.
> > >
> > > Thanks,
> > > /Martin
> > >
> > >
> > > > Thanks
> > > >
> > > >
> > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:
> > > >
> > > > > Mikhail
> > > > >
> > > > > PSB
> > > > >
> > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > > mkhludnev@griddynamics.com> wrote:
> > > > >
> > > > > > Martin,
> > > > > >
> > > > > > Please find additional question from me below.
> > > > > >
> > > > > > Simone,
> > > > > >
> > > > > > I'm sorry for hijacking your thread. The only what I've heard
> about
> > > it
> > > > at
> > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> > replicate
> > > > > those
> > > > > > files as configs under solr home. And I'm really looking forward
> to
> > > > know
> > > > > > how it works with huge files in production.
> > > > > >
> > > > > > Thank You, Guys!
> > > > > >
> > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > написал:
> > > > > > >
> > > > > > > Hi Mikhail
> > > > > > >
> > > > > > > Please see answers below.
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > >
> > > > > > > > Martin,
> > > > > > > >
> > > > > > > > Thank you for telling your own "war-story". It's really
> useful
> > > for
> > > > > > > > community.
> > > > > > > > The first question might seems not really conscious, but
> would
> > > you
> > > > > tell
> > > > > > me
> > > > > > > > what blocks searching during EFF reload, when it's triggered
> by
> > > > > handler
> > > > > > or
> > > > > > > > by listener?
> > > > > > > >
> > > > > > >
> > > > > > > We continuously index new documents using CommitWithin to get
> > > regular
> > > > > > > commits. However, we observed that the EFFs were not re-read,
> so
> > we
> > > > had
> > > > > > to
> > > > > > > do external commits (curl '.../solr/update?commit=true') to
> force
> > > > > reload.
> > > > > > > When this is done, solr blocks. I can't tell you exactly why
> it's
> > > > doing
> > > > > > > that (it was related to SOLR-3985).
> > > > > >
> > > > > > Is there a chance to get a thread dump when they are blocked?
> > > > > >
> > > > > >
> > > > > Well I could try to recreate the situation. But the setup is fairly
> > > > simple:
> > > > > Create a large EFF in a largeish index with many shards. Issue a
> > > commit,
> > > > > and then try to do a search. Solr will not respond to the search
> > before
> > > > the
> > > > > commit has completed, and this will take a long time.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I don't really get the sentence about sequential commits and
> > > number
> > > > > of
> > > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > > Doesn't
> > > > > it
> > > > > > > >
> > > > > > >
> > > > > > > Again, this is observed behavior. When we issue a commit on a
> > > system
> > > > > with
> > > > > > a
> > > > > > > system with many solr cores using EFFs, the system blocks for a
> > > long
> > > > > time
> > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
> is a
> > > > > symlink
> > > > > > > from each cores index dir to the actual file, which is updated
> by
> > > an
> > > > > > > external process.
> > > > > >
> > > > > > Hold on, I asked about Zookeeper because the subj mentions
> > SolrCloud.
> > > > > >
> > > > > > Do you use SolrCloud, SolrShards, or these cores are just
> replicas
> > of
> > > > the
> > > > > > same index?
> > > > > >
> > > > >
> > > > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud.
> I'm
> > a
> > > > bit
> > > > > unsure about the terminology here, but we've got a single index
> > divided
> > > > > into 16 shard. Each shard is hosted in a solr core.
> > > > >
> > > > >
> > > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > > >
> > > > > > No, we generate the EFF on the local solr host (there is only one
> > > > > physical
> > > > > host that holds all shards), so there is no need for NFS or copying
> > > files
> > > > > around. No need for Zookeeper either.
> > > > >
> > > > >
> > > > > > how many cores you run per box?
> > > > > >
> > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of
> > > RAM.
> > > > We
> > > > > run 16 solr cores on this box in Jetty.
> > > > >
> > > > >
> > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
> heaps?
> > > > > >
> > > > > > Yes. We've allocated 10GB for jetty, and left the rest for the
> OS.
> > > > >
> > > > >
> > > > > > I assume you use 64 bit linux and mmap directory. Please confirm
> > > that.
> > > > > >
> > > > > >
> > > > > We use 64-bit linux. I'm not sure about the mmap directory or where
> > > that
> > > > > would be configured in solr - can you explain that?
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > causes scalability problem or long time to reload? Will it
> help
> > > if
> > > > > > we'll
> > > > > > > > have, let's say ExternalDatabaseField which will pull values
> > from
> > > > > jdbc.
> > > > > > ie.
> > > > > > > >
> > > > > > >
> > > > > > > I think the possibility of having some fields being retrieved
> > from
> > > an
> > > > > > > external, dynamically updatable store would be really
> > interesting.
> > > > This
> > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> product
> > > > (e.g.
> > > > > > > Cassandra).
> > > > > >
> > > > > > Ok. Let's have it in mind as a possible direction.
> > > > > >
> > > > >
> > > > > Alternatively, an API that would allow updating a single field for
> a
> > > > > document might be an option.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > >
> > > > > > >
> > > > > > > Again, this is a solr implementation detail that I can't answer
> > :)
> > > > > > >
> > > > > > >
> > > > > > > > Can you confirm that IDs in the file is ordered by the index
> > term
> > > > > > order?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > >
> > > > > > >
> > > > > > > > AFAIK it can impact load time.
> > > > > > > >
> > > > > > > Yes, it does
> > > > > >
> > > > > > Ok, I've got that you aware of it, and your IDs are just strings,
> > not
> > > > > > integers.
> > > > > >
> > > > > >
> > > > > Yes, ids are strings.
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Regarding your post-query solution can you tell me if query
> > found
> > > > > 10000
> > > > > > > > docs, but I need to display only first page with 100 rows,
> > > whether
> > > > I
> > > > > > need
> > > > > > > > to pull all 10K results to frontend to order them by the
> rank?
> > > > > > > >
> > > > > > > >
> > > > > > > In our architecture, the clients query an API that generates
> the
> > > SOLR
> > > > > > > query, retrieves the relevant additional fields that we needs,
> > and
> > > > > > returns
> > > > > > > the relevant JSON to the front-end.
> > > > > > >
> > > > > > > In our use case, results are returned from SOLR by the 10's,
> not
> > by
> > > > the
> > > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > > thousands
> > > > > of
> > > > > > > results, it would be up to the implementation of the api to
> > augment
> > > > > only
> > > > > > > the results that needed to be returned to the front-end.
> > > > > > >
> > > > > > > Even so, patching up a JSON structure with 10000 results should
> > be
> > > > > > > possible.
> > > > > >
> > > > > > You are right. I'm concerned anyway because retrieving whole
> result
> > > is
> > > > > > expensive, and not always possible.
> > > > > >
> > > > > >
> > > > > In our case, getting the whole result is almost impossible, because
> > > that
> > > > > would be millions of documents, and returning the Nth result seems
> to
> > > be
> > > > a
> > > > > quadratic (or worse) operation in SOLR.
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > > PS: It's time to pitch, how much
> > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > > ExternalFileField" can help you?
> > > > > > > >
> > > > > > > >
> > > > > > > > It looks very interesting :) Does it make it possible to
> avoid
> > > > > > re-reading
> > > > > > > the EFF on every commit, and only re-read the values that have
> > > > actually
> > > > > > > changed?
> > > > > >
> > > > > >
> > > > > > You don't need commit (in SOLR-4085) to reload file content, but
> > > after
> > > > > > commit you need to read whole file and scan all key terms and
> > > postings.
> > > > > > That's because EFF sits on top of top level searcher. it's a
> > > Solr-like
> > > > > way.
> > > > > > In some future we might have per-segment EFF, in this case
> adding a
> > > > > segment
> > > > > > will trigger full file scan, but in the index only that new
> segment
> > > > will
> > > > > be
> > > > > > scanned. It should be faster. You know, straightforward sharing
> > > > internal
> > > > > > data structures between different index views/generations is not
> > > > > possible.
> > > > > > If you are asking about applying delta changes on external file
> > > that's
> > > > > > something what we did ourselves http://goo.gl/P8GFq . This
> feature
> > > is
> > > > > much
> > > > > > more doubtful and vague, although it might be the next
> contribution
> > > > after
> > > > > > SOLR-4085.
> > > > > >
> > > > > > >
> > > > > > > /Martin
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> > > what
> > > > > > you're
> > > > > > > > > hoping fore.
> > > > > > > > >
> > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > >
> > > > > > > > > The EFF is placed in the parent of the index directory in
> > each
> > > > > core;
> > > > > > each
> > > > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > > > responsible
> > > > > > > > for.
> > > > > > > > >
> > > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> > > answer
> > > > > > > > queries)
> > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> time
> > to
> > > > > > re-read
> > > > > > > > the
> > > > > > > > > EFF is multiplied by the number of cores in use (i.e. the
> EFF
> > > is
> > > > > > re-read
> > > > > > > > by
> > > > > > > > > each core sequentially). The contents of the EFF become
> > active
> > > > > after
> > > > > > the
> > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> after
> > > the
> > > > > > file
> > > > > > > > has
> > > > > > > > > been updated.
> > > > > > > > >
> > > > > > > > > In our case, the EFF was quite large - around 450MB - and
> we
> > > use
> > > > 16
> > > > > > > > shards,
> > > > > > > > > so when we triggered an external commit to force
> re-reading,
> > > the
> > > > > > whole
> > > > > > > > > system would block for several (10-15) minutes. This won't
> > work
> > > > in
> > > > > a
> > > > > > > > > production environment. The reason for the size of the EFF
> is
> > > > that
> > > > > we
> > > > > > > > have
> > > > > > > > > around 7M documents in the index; each document has a 45
> > > > character
> > > > > > ID.
> > > > > > > > >
> > > > > > > > > We got some help to try to fix the problem so that the
> > re-read
> > > of
> > > > > the
> > > > > > EFF
> > > > > > > > > proceeds in the background (see
> > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > > > proceeds
> > > > > > in
> > > > > > > > the
> > > > > > > > > background, the time required to launch solr now takes at
> > least
> > > > as
> > > > > > long
> > > > > > > > as
> > > > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > > > needs.
> > > > > > > > >
> > > > > > > > > The next issue is that you cannot sort on EFF fields
> (though
> > > you
> > > > > can
> > > > > > > > return
> > > > > > > > > them as values using &fl=field(my_eff_field). This is also
> > > fixed
> > > > in
> > > > > > the
> > > > > > > > 4.1
> > > > > > > > > branch here <
> https://issues.apache.org/jira/browse/SOLR-4022
> > >.
> > > > > > > > >
> > > > > > > > > So: Even after these fixes, EFF performance is not that
> > great.
> > > > Our
> > > > > > > > solution
> > > > > > > > > is as follows: The actual value of the popularity measure
> > (say,
> > > > > > reads)
> > > > > > > > that
> > > > > > > > > we want to report to the user is inserted into the search
> > > > response
> > > > > > > > > post-query by our query front-end. This value will then be
> > the
> > > > > > > > > authoritative value at the time of the query. The value of
> > the
> > > > > > popularity
> > > > > > > > > measure that we use for boosting in the ranking of the
> search
> > > > > results
> > > > > > is
> > > > > > > > > only updated when the value has changed enough so that the
> > > impact
> > > > > on
> > > > > > the
> > > > > > > > > boost will be significant (say, more than 2%). This does
> > > require
> > > > > > frequent
> > > > > > > > > re-indexing of the documents that have significant changes
> in
> > > the
> > > > > > number
> > > > > > > > of
> > > > > > > > > reads, but at least we won't have to update a document if
> it
> > > > moves
> > > > > > from,
> > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > >
> > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > >
> > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > simoneg@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > > > However,
> > > > > > in
> > > > > > > > > this
> > > > > > > > > > index, an external file field is used for popularity
> > ranking.
> > > > > > > > > >
> > > > > > > > > > Does SolrCloud supports external file fields? How does it
> > > cope
> > > > > with
> > > > > > > > > > sharding and replication? Where should the external file
> be
> > > > > placed
> > > > > > now
> > > > > > > > > that
> > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > >
> > > > > > > > > > Are there otherwise other best practices to deal with the
> > use
> > > > > cases
> > > > > > > > > > external file fields were used for, like
> > popularity/ranking,
> > > in
> > > > > > > > > SolrCloud?
> > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > > Simone
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sincerely yours
> > > > > > > > Mikhail Khludnev
> > > > > > > > Principal Engineer,
> > > > > > > > Grid Dynamics
> > > > > > > >
> > > > > > > > <http://www.griddynamics.com>
> > > > > > > >  <mk...@griddynamics.com>
> > > > > > > >
> > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > > написал:
> > > > > >
> > > > > > > Hi Mikhail
> > > > > > >
> > > > > > > Please see answers below.
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > > >
> > > > > > > > Martin,
> > > > > > > >
> > > > > > > > Thank you for telling your own "war-story". It's really
> useful
> > > for
> > > > > > > > community.
> > > > > > > > The first question might seems not really conscious, but
> would
> > > you
> > > > > tell
> > > > > > > me
> > > > > > > > what blocks searching during EFF reload, when it's triggered
> by
> > > > > handler
> > > > > > > or
> > > > > > > > by listener?
> > > > > > > >
> > > > > > >
> > > > > > > We continuously index new documents using CommitWithin to get
> > > regular
> > > > > > > commits. However, we observed that the EFFs were not re-read,
> so
> > we
> > > > had
> > > > > > to
> > > > > > > do external commits (curl '.../solr/update?commit=true') to
> force
> > > > > reload.
> > > > > > > When this is done, solr blocks. I can't tell you exactly why
> it's
> > > > doing
> > > > > > > that (it was related to SOLR-3985).
> > > > > > >
> > > > > > >
> > > > > > > > I don't really get the sentence about sequential commits and
> > > number
> > > > > of
> > > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > > Doesn't
> > > > > it
> > > > > > > >
> > > > > > >
> > > > > > > Again, this is observed behavior. When we issue a commit on a
> > > system
> > > > > > with a
> > > > > > > system with many solr cores using EFFs, the system blocks for a
> > > long
> > > > > time
> > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
> is a
> > > > > symlink
> > > > > > > from each cores index dir to the actual file, which is updated
> by
> > > an
> > > > > > > external process.
> > > > > > >
> > > > > > >
> > > > > > > > causes scalability problem or long time to reload? Will it
> help
> > > if
> > > > > > we'll
> > > > > > > > have, let's say ExternalDatabaseField which will pull values
> > from
> > > > > jdbc.
> > > > > > > ie.
> > > > > > > >
> > > > > > >
> > > > > > > I think the possibility of having some fields being retrieved
> > from
> > > an
> > > > > > > external, dynamically updatable store would be really
> > interesting.
> > > > This
> > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> product
> > > > (e.g.
> > > > > > > Cassandra).
> > > > > > >
> > > > > > >
> > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > >
> > > > > > >
> > > > > > > Again, this is a solr implementation detail that I can't answer
> > :)
> > > > > > >
> > > > > > >
> > > > > > > > Can you confirm that IDs in the file is ordered by the index
> > term
> > > > > > order?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > >
> > > > > > >
> > > > > > > > AFAIK it can impact load time.
> > > > > > > >
> > > > > > > Yes, it does.
> > > > > > >
> > > > > > >
> > > > > > > > Regarding your post-query solution can you tell me if query
> > found
> > > > > 10000
> > > > > > > > docs, but I need to display only first page with 100 rows,
> > > whether
> > > > I
> > > > > > need
> > > > > > > > to pull all 10K results to frontend to order them by the
> rank?
> > > > > > > >
> > > > > > > >
> > > > > > > In our architecture, the clients query an API that generates
> the
> > > SOLR
> > > > > > > query, retrieves the relevant additional fields that we needs,
> > and
> > > > > > returns
> > > > > > > the relevant JSON to the front-end.
> > > > > > >
> > > > > > > In our use case, results are returned from SOLR by the 10's,
> not
> > by
> > > > the
> > > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > > thousands
> > > > > of
> > > > > > > results, it would be up to the implementation of the api to
> > augment
> > > > > only
> > > > > > > the results that needed to be returned to the front-end.
> > > > > > >
> > > > > > > Even so, patching up a JSON structure with 10000 results should
> > be
> > > > > > > possible.
> > > > > > >
> > > > > > >
> > > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > > PS: It's time to pitch, how much
> > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > > ExternalFileField" can help you?
> > > > > > > >
> > > > > > > >
> > > > > > > > It looks very interesting :) Does it make it possible to
> avoid
> > > > > > re-reading
> > > > > > > the EFF on every commit, and only re-read the values that have
> > > > actually
> > > > > > > changed?
> > > > > > >
> > > > > > > /Martin
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> > > what
> > > > > > you're
> > > > > > > > > hoping fore.
> > > > > > > > >
> > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > >
> > > > > > > > > The EFF is placed in the parent of the index directory in
> > each
> > > > > core;
> > > > > > > each
> > > > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > > > responsible
> > > > > > > > for.
> > > > > > > > >
> > > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> > > answer
> > > > > > > > queries)
> > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> time
> > to
> > > > > > re-read
> > > > > > > > the
> > > > > > > > > EFF is multiplied by the number of cores in use (i.e. the
> EFF
> > > is
> > > > > > > re-read
> > > > > > > > by
> > > > > > > > > each core sequentially). The contents of the EFF become
> > active
> > > > > after
> > > > > > > the
> > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> after
> > > the
> > > > > > file
> > > > > > > > has
> > > > > > > > > been updated.
> > > > > > > > >
> > > > > > > > > In our case, the EFF was quite large - around 450MB - and
> we
> > > use
> > > > 16
> > > > > > > > shards,
> > > > > > > > > so when we triggered an external commit to force
> re-reading,
> > > the
> > > > > > whole
> > > > > > > > > system would block for several (10-15) minutes. This won't
> > work
> > > > in
> > > > > a
> > > > > > > > > production environment. The reason for the size of the EFF
> is
> > > > that
> > > > > we
> > > > > > > > have
> > > > > > > > > around 7M documents in the index; each document has a 45
> > > > character
> > > > > > ID.
> > > > > > > > >
> > > > > > > > > We got some help to try to fix the problem so that the
> > re-read
> > > of
> > > > > the
> > > > > > > EFF
> > > > > > > > > proceeds in the background (see
> > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > > > proceeds
> > > > > > in
> > > > > > > > the
> > > > > > > > > background, the time required to launch solr now takes at
> > least
> > > > as
> > > > > > long
> > > > > > > > as
> > > > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > > > needs.
> > > > > > > > >
> > > > > > > > > The next issue is that you cannot sort on EFF fields
> (though
> > > you
> > > > > can
> > > > > > > > return
> > > > > > > > > them as values using &fl=field(my_eff_field). This is also
> > > fixed
> > > > in
> > > > > > the
> > > > > > > > 4.1
> > > > > > > > > branch here <
> https://issues.apache.org/jira/browse/SOLR-4022
> > >.
> > > > > > > > >
> > > > > > > > > So: Even after these fixes, EFF performance is not that
> > great.
> > > > Our
> > > > > > > > solution
> > > > > > > > > is as follows: The actual value of the popularity measure
> > (say,
> > > > > > reads)
> > > > > > > > that
> > > > > > > > > we want to report to the user is inserted into the search
> > > > response
> > > > > > > > > post-query by our query front-end. This value will then be
> > the
> > > > > > > > > authoritative value at the time of the query. The value of
> > the
> > > > > > > popularity
> > > > > > > > > measure that we use for boosting in the ranking of the
> search
> > > > > results
> > > > > > > is
> > > > > > > > > only updated when the value has changed enough so that the
> > > impact
> > > > > on
> > > > > > > the
> > > > > > > > > boost will be significant (say, more than 2%). This does
> > > require
> > > > > > > frequent
> > > > > > > > > re-indexing of the documents that have significant changes
> in
> > > the
> > > > > > > number
> > > > > > > > of
> > > > > > > > > reads, but at least we won't have to update a document if
> it
> > > > moves
> > > > > > > from,
> > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > >
> > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > >
> > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > simoneg@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > > > However,
> > > > > > in
> > > > > > > > > this
> > > > > > > > > > index, an external file field is used for popularity
> > ranking.
> > > > > > > > > >
> > > > > > > > > > Does SolrCloud supports external file fields? How does it
> > > cope
> > > > > with
> > > > > > > > > > sharding and replication? Where should the external file
> be
> > > > > placed
> > > > > > > now
> > > > > > > > > that
> > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > >
> > > > > > > > > > Are there otherwise other best practices to deal with the
> > use
> > > > > cases
> > > > > > > > > > external file fields were used for, like
> > popularity/ranking,
> > > in
> > > > > > > > > SolrCloud?
> > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > > Simone
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sincerely yours
> > > > > > > > Mikhail Khludnev
> > > > > > > > Principal Engineer,
> > > > > > > > Grid Dynamics
> > > > > > > >
> > > > > > > > <http://www.griddynamics.com>
> > > > > > > >  <mk...@griddynamics.com>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > >  <mk...@griddynamics.com>
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Mikhail,

PSB

On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com> wrote:
>
> >
> > I wasn't aware until now that it is possible to send a commit to one core
> > only. What we observed was the effect of curl
> > localhost:8080/solr/update?commit=true but perhaps we should experiment
> > with solr/coreN/update?commit=true. A quick trial run seems to indicate
> > that a commit to a single core causes commits on all cores.
> >
> You should see something like this in the log:
> ... SolrCmdDistributor .... Distrib commit to: ...
>
> Yup, a commit towards a single core results in a commit on all cores.


> >
> >
> > Perhaps I should clarify that we are using SOLR as a black box; we do not
> > touch the code at all - we only install the distribution WAR file and
> > proceed from there.
> >
> I still don't understand how you deploy/launch Solr. How many jettys you
> start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
> shards= param for every request and distributes updates yourself? What
> collections do you create and with which settings?
>
> We let SOLR do the sharding using one collection with 16 SOLR cores
holding one shard each. We launch only one instance of jetty with the
folllowing arguments:

-DnumShards=16
-DzkHost=<zookeeperhost:port>
-Xmx10G
-Xms10G
-Xmn2G
-server

Would you like to see the solrconfig.xml?

/Martin


> >
> >
> > > Also from my POV such deployments should start at least from *16* 4-way
> > > vboxes, it's more expensive, but should be much better available during
> > > cpu-consuming operations.
> > >
> >
> > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
> with
> > 16 cores? Or am I misunderstanding something :) ?
> >
> I prefer to start from 16 hosts with 4 cores each.
>
>
> >
> >
> > > Other details, if you use single jetty for all of them, are you sure
> that
> > > jetty's threadpool doesn't limit requests? is it large enough?
> > > You have 60G and set -Xmx=10G. are you sure that total size of cores
> > index
> > > directories is less than 45G?
> > >
> > > The total index size is 230 GB, so it won't fit in ram, but we're using
> > an
> > SSD disk to minimize disk access time. We have tried putting the EFF
> onto a
> > ram disk, but this didn't have a measurable effect.
> >
> > Thanks,
> > /Martin
> >
> >
> > > Thanks
> > >
> > >
> > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:
> > >
> > > > Mikhail
> > > >
> > > > PSB
> > > >
> > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Please find additional question from me below.
> > > > >
> > > > > Simone,
> > > > >
> > > > > I'm sorry for hijacking your thread. The only what I've heard about
> > it
> > > at
> > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> replicate
> > > > those
> > > > > files as configs under solr home. And I'm really looking forward to
> > > know
> > > > > how it works with huge files in production.
> > > > >
> > > > > Thank You, Guys!
> > > > >
> > > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> написал:
> > > > > >
> > > > > > Hi Mikhail
> > > > > >
> > > > > > Please see answers below.
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Thank you for telling your own "war-story". It's really useful
> > for
> > > > > > > community.
> > > > > > > The first question might seems not really conscious, but would
> > you
> > > > tell
> > > > > me
> > > > > > > what blocks searching during EFF reload, when it's triggered by
> > > > handler
> > > > > or
> > > > > > > by listener?
> > > > > > >
> > > > > >
> > > > > > We continuously index new documents using CommitWithin to get
> > regular
> > > > > > commits. However, we observed that the EFFs were not re-read, so
> we
> > > had
> > > > > to
> > > > > > do external commits (curl '.../solr/update?commit=true') to force
> > > > reload.
> > > > > > When this is done, solr blocks. I can't tell you exactly why it's
> > > doing
> > > > > > that (it was related to SOLR-3985).
> > > > >
> > > > > Is there a chance to get a thread dump when they are blocked?
> > > > >
> > > > >
> > > > Well I could try to recreate the situation. But the setup is fairly
> > > simple:
> > > > Create a large EFF in a largeish index with many shards. Issue a
> > commit,
> > > > and then try to do a search. Solr will not respond to the search
> before
> > > the
> > > > commit has completed, and this will take a long time.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > I don't really get the sentence about sequential commits and
> > number
> > > > of
> > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > Doesn't
> > > > it
> > > > > > >
> > > > > >
> > > > > > Again, this is observed behavior. When we issue a commit on a
> > system
> > > > with
> > > > > a
> > > > > > system with many solr cores using EFFs, the system blocks for a
> > long
> > > > time
> > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > > > symlink
> > > > > > from each cores index dir to the actual file, which is updated by
> > an
> > > > > > external process.
> > > > >
> > > > > Hold on, I asked about Zookeeper because the subj mentions
> SolrCloud.
> > > > >
> > > > > Do you use SolrCloud, SolrShards, or these cores are just replicas
> of
> > > the
> > > > > same index?
> > > > >
> > > >
> > > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm
> a
> > > bit
> > > > unsure about the terminology here, but we've got a single index
> divided
> > > > into 16 shard. Each shard is hosted in a solr core.
> > > >
> > > >
> > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > >
> > > > > No, we generate the EFF on the local solr host (there is only one
> > > > physical
> > > > host that holds all shards), so there is no need for NFS or copying
> > files
> > > > around. No need for Zookeeper either.
> > > >
> > > >
> > > > > how many cores you run per box?
> > > > >
> > > > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of
> > RAM.
> > > We
> > > > run 16 solr cores on this box in Jetty.
> > > >
> > > >
> > > > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> > > > >
> > > > > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
> > > >
> > > >
> > > > > I assume you use 64 bit linux and mmap directory. Please confirm
> > that.
> > > > >
> > > > >
> > > > We use 64-bit linux. I'm not sure about the mmap directory or where
> > that
> > > > would be configured in solr - can you explain that?
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > causes scalability problem or long time to reload? Will it help
> > if
> > > > > we'll
> > > > > > > have, let's say ExternalDatabaseField which will pull values
> from
> > > > jdbc.
> > > > > ie.
> > > > > > >
> > > > > >
> > > > > > I think the possibility of having some fields being retrieved
> from
> > an
> > > > > > external, dynamically updatable store would be really
> interesting.
> > > This
> > > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > > (e.g.
> > > > > > Cassandra).
> > > > >
> > > > > Ok. Let's have it in mind as a possible direction.
> > > > >
> > > >
> > > > Alternatively, an API that would allow updating a single field for a
> > > > document might be an option.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > why all cores can't read these values simultaneously?
> > > > > > >
> > > > > >
> > > > > > Again, this is a solr implementation detail that I can't answer
> :)
> > > > > >
> > > > > >
> > > > > > > Can you confirm that IDs in the file is ordered by the index
> term
> > > > > order?
> > > > > > >
> > > > > >
> > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > >
> > > > > >
> > > > > > > AFAIK it can impact load time.
> > > > > > >
> > > > > > Yes, it does
> > > > >
> > > > > Ok, I've got that you aware of it, and your IDs are just strings,
> not
> > > > > integers.
> > > > >
> > > > >
> > > > Yes, ids are strings.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > Regarding your post-query solution can you tell me if query
> found
> > > > 10000
> > > > > > > docs, but I need to display only first page with 100 rows,
> > whether
> > > I
> > > > > need
> > > > > > > to pull all 10K results to frontend to order them by the rank?
> > > > > > >
> > > > > > >
> > > > > > In our architecture, the clients query an API that generates the
> > SOLR
> > > > > > query, retrieves the relevant additional fields that we needs,
> and
> > > > > returns
> > > > > > the relevant JSON to the front-end.
> > > > > >
> > > > > > In our use case, results are returned from SOLR by the 10's, not
> by
> > > the
> > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > thousands
> > > > of
> > > > > > results, it would be up to the implementation of the api to
> augment
> > > > only
> > > > > > the results that needed to be returned to the front-end.
> > > > > >
> > > > > > Even so, patching up a JSON structure with 10000 results should
> be
> > > > > > possible.
> > > > >
> > > > > You are right. I'm concerned anyway because retrieving whole result
> > is
> > > > > expensive, and not always possible.
> > > > >
> > > > >
> > > > In our case, getting the whole result is almost impossible, because
> > that
> > > > would be millions of documents, and returning the Nth result seems to
> > be
> > > a
> > > > quadratic (or worse) operation in SOLR.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > PS: It's time to pitch, how much
> > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > ExternalFileField" can help you?
> > > > > > >
> > > > > > >
> > > > > > > It looks very interesting :) Does it make it possible to avoid
> > > > > re-reading
> > > > > > the EFF on every commit, and only re-read the values that have
> > > actually
> > > > > > changed?
> > > > >
> > > > >
> > > > > You don't need commit (in SOLR-4085) to reload file content, but
> > after
> > > > > commit you need to read whole file and scan all key terms and
> > postings.
> > > > > That's because EFF sits on top of top level searcher. it's a
> > Solr-like
> > > > way.
> > > > > In some future we might have per-segment EFF, in this case adding a
> > > > segment
> > > > > will trigger full file scan, but in the index only that new segment
> > > will
> > > > be
> > > > > scanned. It should be faster. You know, straightforward sharing
> > > internal
> > > > > data structures between different index views/generations is not
> > > > possible.
> > > > > If you are asking about applying delta changes on external file
> > that's
> > > > > something what we did ourselves http://goo.gl/P8GFq . This feature
> > is
> > > > much
> > > > > more doubtful and vague, although it might be the next contribution
> > > after
> > > > > SOLR-4085.
> > > > >
> > > > > >
> > > > > > /Martin
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > > wrote:
> > > > > > >
> > > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> > what
> > > > > you're
> > > > > > > > hoping fore.
> > > > > > > >
> > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > >
> > > > > > > > The EFF is placed in the parent of the index directory in
> each
> > > > core;
> > > > > each
> > > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > > responsible
> > > > > > > for.
> > > > > > > >
> > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> > answer
> > > > > > > queries)
> > > > > > > > while re-reading the EFF. Even worse, it seems that the time
> to
> > > > > re-read
> > > > > > > the
> > > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
> > is
> > > > > re-read
> > > > > > > by
> > > > > > > > each core sequentially). The contents of the EFF become
> active
> > > > after
> > > > > the
> > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
> > the
> > > > > file
> > > > > > > has
> > > > > > > > been updated.
> > > > > > > >
> > > > > > > > In our case, the EFF was quite large - around 450MB - and we
> > use
> > > 16
> > > > > > > shards,
> > > > > > > > so when we triggered an external commit to force re-reading,
> > the
> > > > > whole
> > > > > > > > system would block for several (10-15) minutes. This won't
> work
> > > in
> > > > a
> > > > > > > > production environment. The reason for the size of the EFF is
> > > that
> > > > we
> > > > > > > have
> > > > > > > > around 7M documents in the index; each document has a 45
> > > character
> > > > > ID.
> > > > > > > >
> > > > > > > > We got some help to try to fix the problem so that the
> re-read
> > of
> > > > the
> > > > > EFF
> > > > > > > > proceeds in the background (see
> > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > > proceeds
> > > > > in
> > > > > > > the
> > > > > > > > background, the time required to launch solr now takes at
> least
> > > as
> > > > > long
> > > > > > > as
> > > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > > needs.
> > > > > > > >
> > > > > > > > The next issue is that you cannot sort on EFF fields (though
> > you
> > > > can
> > > > > > > return
> > > > > > > > them as values using &fl=field(my_eff_field). This is also
> > fixed
> > > in
> > > > > the
> > > > > > > 4.1
> > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022
> >.
> > > > > > > >
> > > > > > > > So: Even after these fixes, EFF performance is not that
> great.
> > > Our
> > > > > > > solution
> > > > > > > > is as follows: The actual value of the popularity measure
> (say,
> > > > > reads)
> > > > > > > that
> > > > > > > > we want to report to the user is inserted into the search
> > > response
> > > > > > > > post-query by our query front-end. This value will then be
> the
> > > > > > > > authoritative value at the time of the query. The value of
> the
> > > > > popularity
> > > > > > > > measure that we use for boosting in the ranking of the search
> > > > results
> > > > > is
> > > > > > > > only updated when the value has changed enough so that the
> > impact
> > > > on
> > > > > the
> > > > > > > > boost will be significant (say, more than 2%). This does
> > require
> > > > > frequent
> > > > > > > > re-indexing of the documents that have significant changes in
> > the
> > > > > number
> > > > > > > of
> > > > > > > > reads, but at least we won't have to update a document if it
> > > moves
> > > > > from,
> > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > >
> > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > >
> > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > simoneg@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > > However,
> > > > > in
> > > > > > > > this
> > > > > > > > > index, an external file field is used for popularity
> ranking.
> > > > > > > > >
> > > > > > > > > Does SolrCloud supports external file fields? How does it
> > cope
> > > > with
> > > > > > > > > sharding and replication? Where should the external file be
> > > > placed
> > > > > now
> > > > > > > > that
> > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > >
> > > > > > > > > Are there otherwise other best practices to deal with the
> use
> > > > cases
> > > > > > > > > external file fields were used for, like
> popularity/ranking,
> > in
> > > > > > > > SolrCloud?
> > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Simone
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sincerely yours
> > > > > > > Mikhail Khludnev
> > > > > > > Principal Engineer,
> > > > > > > Grid Dynamics
> > > > > > >
> > > > > > > <http://www.griddynamics.com>
> > > > > > >  <mk...@griddynamics.com>
> > > > > > >
> > > > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> > написал:
> > > > >
> > > > > > Hi Mikhail
> > > > > >
> > > > > > Please see answers below.
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Thank you for telling your own "war-story". It's really useful
> > for
> > > > > > > community.
> > > > > > > The first question might seems not really conscious, but would
> > you
> > > > tell
> > > > > > me
> > > > > > > what blocks searching during EFF reload, when it's triggered by
> > > > handler
> > > > > > or
> > > > > > > by listener?
> > > > > > >
> > > > > >
> > > > > > We continuously index new documents using CommitWithin to get
> > regular
> > > > > > commits. However, we observed that the EFFs were not re-read, so
> we
> > > had
> > > > > to
> > > > > > do external commits (curl '.../solr/update?commit=true') to force
> > > > reload.
> > > > > > When this is done, solr blocks. I can't tell you exactly why it's
> > > doing
> > > > > > that (it was related to SOLR-3985).
> > > > > >
> > > > > >
> > > > > > > I don't really get the sentence about sequential commits and
> > number
> > > > of
> > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > Doesn't
> > > > it
> > > > > > >
> > > > > >
> > > > > > Again, this is observed behavior. When we issue a commit on a
> > system
> > > > > with a
> > > > > > system with many solr cores using EFFs, the system blocks for a
> > long
> > > > time
> > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > > > symlink
> > > > > > from each cores index dir to the actual file, which is updated by
> > an
> > > > > > external process.
> > > > > >
> > > > > >
> > > > > > > causes scalability problem or long time to reload? Will it help
> > if
> > > > > we'll
> > > > > > > have, let's say ExternalDatabaseField which will pull values
> from
> > > > jdbc.
> > > > > > ie.
> > > > > > >
> > > > > >
> > > > > > I think the possibility of having some fields being retrieved
> from
> > an
> > > > > > external, dynamically updatable store would be really
> interesting.
> > > This
> > > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > > (e.g.
> > > > > > Cassandra).
> > > > > >
> > > > > >
> > > > > > > why all cores can't read these values simultaneously?
> > > > > > >
> > > > > >
> > > > > > Again, this is a solr implementation detail that I can't answer
> :)
> > > > > >
> > > > > >
> > > > > > > Can you confirm that IDs in the file is ordered by the index
> term
> > > > > order?
> > > > > > >
> > > > > >
> > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > >
> > > > > >
> > > > > > > AFAIK it can impact load time.
> > > > > > >
> > > > > > Yes, it does.
> > > > > >
> > > > > >
> > > > > > > Regarding your post-query solution can you tell me if query
> found
> > > > 10000
> > > > > > > docs, but I need to display only first page with 100 rows,
> > whether
> > > I
> > > > > need
> > > > > > > to pull all 10K results to frontend to order them by the rank?
> > > > > > >
> > > > > > >
> > > > > > In our architecture, the clients query an API that generates the
> > SOLR
> > > > > > query, retrieves the relevant additional fields that we needs,
> and
> > > > > returns
> > > > > > the relevant JSON to the front-end.
> > > > > >
> > > > > > In our use case, results are returned from SOLR by the 10's, not
> by
> > > the
> > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > thousands
> > > > of
> > > > > > results, it would be up to the implementation of the api to
> augment
> > > > only
> > > > > > the results that needed to be returned to the front-end.
> > > > > >
> > > > > > Even so, patching up a JSON structure with 10000 results should
> be
> > > > > > possible.
> > > > > >
> > > > > >
> > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > PS: It's time to pitch, how much
> > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > ExternalFileField" can help you?
> > > > > > >
> > > > > > >
> > > > > > > It looks very interesting :) Does it make it possible to avoid
> > > > > re-reading
> > > > > > the EFF on every commit, and only re-read the values that have
> > > actually
> > > > > > changed?
> > > > > >
> > > > > > /Martin
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > > wrote:
> > > > > > >
> > > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> > what
> > > > > you're
> > > > > > > > hoping fore.
> > > > > > > >
> > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > >
> > > > > > > > The EFF is placed in the parent of the index directory in
> each
> > > > core;
> > > > > > each
> > > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > > responsible
> > > > > > > for.
> > > > > > > >
> > > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> > answer
> > > > > > > queries)
> > > > > > > > while re-reading the EFF. Even worse, it seems that the time
> to
> > > > > re-read
> > > > > > > the
> > > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
> > is
> > > > > > re-read
> > > > > > > by
> > > > > > > > each core sequentially). The contents of the EFF become
> active
> > > > after
> > > > > > the
> > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
> > the
> > > > > file
> > > > > > > has
> > > > > > > > been updated.
> > > > > > > >
> > > > > > > > In our case, the EFF was quite large - around 450MB - and we
> > use
> > > 16
> > > > > > > shards,
> > > > > > > > so when we triggered an external commit to force re-reading,
> > the
> > > > > whole
> > > > > > > > system would block for several (10-15) minutes. This won't
> work
> > > in
> > > > a
> > > > > > > > production environment. The reason for the size of the EFF is
> > > that
> > > > we
> > > > > > > have
> > > > > > > > around 7M documents in the index; each document has a 45
> > > character
> > > > > ID.
> > > > > > > >
> > > > > > > > We got some help to try to fix the problem so that the
> re-read
> > of
> > > > the
> > > > > > EFF
> > > > > > > > proceeds in the background (see
> > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > > proceeds
> > > > > in
> > > > > > > the
> > > > > > > > background, the time required to launch solr now takes at
> least
> > > as
> > > > > long
> > > > > > > as
> > > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > > needs.
> > > > > > > >
> > > > > > > > The next issue is that you cannot sort on EFF fields (though
> > you
> > > > can
> > > > > > > return
> > > > > > > > them as values using &fl=field(my_eff_field). This is also
> > fixed
> > > in
> > > > > the
> > > > > > > 4.1
> > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022
> >.
> > > > > > > >
> > > > > > > > So: Even after these fixes, EFF performance is not that
> great.
> > > Our
> > > > > > > solution
> > > > > > > > is as follows: The actual value of the popularity measure
> (say,
> > > > > reads)
> > > > > > > that
> > > > > > > > we want to report to the user is inserted into the search
> > > response
> > > > > > > > post-query by our query front-end. This value will then be
> the
> > > > > > > > authoritative value at the time of the query. The value of
> the
> > > > > > popularity
> > > > > > > > measure that we use for boosting in the ranking of the search
> > > > results
> > > > > > is
> > > > > > > > only updated when the value has changed enough so that the
> > impact
> > > > on
> > > > > > the
> > > > > > > > boost will be significant (say, more than 2%). This does
> > require
> > > > > > frequent
> > > > > > > > re-indexing of the documents that have significant changes in
> > the
> > > > > > number
> > > > > > > of
> > > > > > > > reads, but at least we won't have to update a document if it
> > > moves
> > > > > > from,
> > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > >
> > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > >
> > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > simoneg@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > > However,
> > > > > in
> > > > > > > > this
> > > > > > > > > index, an external file field is used for popularity
> ranking.
> > > > > > > > >
> > > > > > > > > Does SolrCloud supports external file fields? How does it
> > cope
> > > > with
> > > > > > > > > sharding and replication? Where should the external file be
> > > > placed
> > > > > > now
> > > > > > > > that
> > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > >
> > > > > > > > > Are there otherwise other best practices to deal with the
> use
> > > > cases
> > > > > > > > > external file fields were used for, like
> popularity/ranking,
> > in
> > > > > > > > SolrCloud?
> > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Simone
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sincerely yours
> > > > > > > Mikhail Khludnev
> > > > > > > Principal Engineer,
> > > > > > > Grid Dynamics
> > > > > > >
> > > > > > > <http://www.griddynamics.com>
> > > > > > >  <mk...@griddynamics.com>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <ma...@issuu.com> wrote:

>
> I wasn't aware until now that it is possible to send a commit to one core
> only. What we observed was the effect of curl
> localhost:8080/solr/update?commit=true but perhaps we should experiment
> with solr/coreN/update?commit=true. A quick trial run seems to indicate
> that a commit to a single core causes commits on all cores.
>
You should see something like this in the log:
... SolrCmdDistributor .... Distrib commit to: ...

>
>
> Perhaps I should clarify that we are using SOLR as a black box; we do not
> touch the code at all - we only install the distribution WAR file and
> proceed from there.
>
I still don't understand how you deploy/launch Solr. How many jettys you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
shards= param for every request and distributes updates yourself? What
collections do you create and with which settings?


>
>
> > Also from my POV such deployments should start at least from *16* 4-way
> > vboxes, it's more expensive, but should be much better available during
> > cpu-consuming operations.
> >
>
> Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
> 16 cores? Or am I misunderstanding something :) ?
>
I prefer to start from 16 hosts with 4 cores each.


>
>
> > Other details, if you use single jetty for all of them, are you sure that
> > jetty's threadpool doesn't limit requests? is it large enough?
> > You have 60G and set -Xmx=10G. are you sure that total size of cores
> index
> > directories is less than 45G?
> >
> > The total index size is 230 GB, so it won't fit in ram, but we're using
> an
> SSD disk to minimize disk access time. We have tried putting the EFF onto a
> ram disk, but this didn't have a measurable effect.
>
> Thanks,
> /Martin
>
>
> > Thanks
> >
> >
> > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:
> >
> > > Mikhail
> > >
> > > PSB
> > >
> > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > Martin,
> > > >
> > > > Please find additional question from me below.
> > > >
> > > > Simone,
> > > >
> > > > I'm sorry for hijacking your thread. The only what I've heard about
> it
> > at
> > > > recent ApacheCon sessions is that Zookeeper is supposed to replicate
> > > those
> > > > files as configs under solr home. And I'm really looking forward to
> > know
> > > > how it works with huge files in production.
> > > >
> > > > Thank You, Guys!
> > > >
> > > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> > > > >
> > > > > Hi Mikhail
> > > > >
> > > > > Please see answers below.
> > > > >
> > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > mkhludnev@griddynamics.com> wrote:
> > > > >
> > > > > > Martin,
> > > > > >
> > > > > > Thank you for telling your own "war-story". It's really useful
> for
> > > > > > community.
> > > > > > The first question might seems not really conscious, but would
> you
> > > tell
> > > > me
> > > > > > what blocks searching during EFF reload, when it's triggered by
> > > handler
> > > > or
> > > > > > by listener?
> > > > > >
> > > > >
> > > > > We continuously index new documents using CommitWithin to get
> regular
> > > > > commits. However, we observed that the EFFs were not re-read, so we
> > had
> > > > to
> > > > > do external commits (curl '.../solr/update?commit=true') to force
> > > reload.
> > > > > When this is done, solr blocks. I can't tell you exactly why it's
> > doing
> > > > > that (it was related to SOLR-3985).
> > > >
> > > > Is there a chance to get a thread dump when they are blocked?
> > > >
> > > >
> > > Well I could try to recreate the situation. But the setup is fairly
> > simple:
> > > Create a large EFF in a largeish index with many shards. Issue a
> commit,
> > > and then try to do a search. Solr will not respond to the search before
> > the
> > > commit has completed, and this will take a long time.
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > > I don't really get the sentence about sequential commits and
> number
> > > of
> > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > Doesn't
> > > it
> > > > > >
> > > > >
> > > > > Again, this is observed behavior. When we issue a commit on a
> system
> > > with
> > > > a
> > > > > system with many solr cores using EFFs, the system blocks for a
> long
> > > time
> > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > > symlink
> > > > > from each cores index dir to the actual file, which is updated by
> an
> > > > > external process.
> > > >
> > > > Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
> > > >
> > > > Do you use SolrCloud, SolrShards, or these cores are just replicas of
> > the
> > > > same index?
> > > >
> > >
> > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
> > bit
> > > unsure about the terminology here, but we've got a single index divided
> > > into 16 shard. Each shard is hosted in a solr core.
> > >
> > >
> > > > Also, about simlink - Don't you share that file via some NFS?
> > > >
> > > > No, we generate the EFF on the local solr host (there is only one
> > > physical
> > > host that holds all shards), so there is no need for NFS or copying
> files
> > > around. No need for Zookeeper either.
> > >
> > >
> > > > how many cores you run per box?
> > > >
> > > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of
> RAM.
> > We
> > > run 16 solr cores on this box in Jetty.
> > >
> > >
> > > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> > > >
> > > > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
> > >
> > >
> > > > I assume you use 64 bit linux and mmap directory. Please confirm
> that.
> > > >
> > > >
> > > We use 64-bit linux. I'm not sure about the mmap directory or where
> that
> > > would be configured in solr - can you explain that?
> > >
> > > >
> > > > >
> > > > >
> > > > > > causes scalability problem or long time to reload? Will it help
> if
> > > > we'll
> > > > > > have, let's say ExternalDatabaseField which will pull values from
> > > jdbc.
> > > > ie.
> > > > > >
> > > > >
> > > > > I think the possibility of having some fields being retrieved from
> an
> > > > > external, dynamically updatable store would be really interesting.
> > This
> > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > (e.g.
> > > > > Cassandra).
> > > >
> > > > Ok. Let's have it in mind as a possible direction.
> > > >
> > >
> > > Alternatively, an API that would allow updating a single field for a
> > > document might be an option.
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > > why all cores can't read these values simultaneously?
> > > > > >
> > > > >
> > > > > Again, this is a solr implementation detail that I can't answer :)
> > > > >
> > > > >
> > > > > > Can you confirm that IDs in the file is ordered by the index term
> > > > order?
> > > > > >
> > > > >
> > > > > Yes, we sorted the files (standard UNIX sort).
> > > > >
> > > > >
> > > > > > AFAIK it can impact load time.
> > > > > >
> > > > > Yes, it does
> > > >
> > > > Ok, I've got that you aware of it, and your IDs are just strings, not
> > > > integers.
> > > >
> > > >
> > > Yes, ids are strings.
> > >
> > > >
> > > > >
> > > > >
> > > > > > Regarding your post-query solution can you tell me if query found
> > > 10000
> > > > > > docs, but I need to display only first page with 100 rows,
> whether
> > I
> > > > need
> > > > > > to pull all 10K results to frontend to order them by the rank?
> > > > > >
> > > > > >
> > > > > In our architecture, the clients query an API that generates the
> SOLR
> > > > > query, retrieves the relevant additional fields that we needs, and
> > > > returns
> > > > > the relevant JSON to the front-end.
> > > > >
> > > > > In our use case, results are returned from SOLR by the 10's, not by
> > the
> > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > thousands
> > > of
> > > > > results, it would be up to the implementation of the api to augment
> > > only
> > > > > the results that needed to be returned to the front-end.
> > > > >
> > > > > Even so, patching up a JSON structure with 10000 results should be
> > > > > possible.
> > > >
> > > > You are right. I'm concerned anyway because retrieving whole result
> is
> > > > expensive, and not always possible.
> > > >
> > > >
> > > In our case, getting the whole result is almost impossible, because
> that
> > > would be millions of documents, and returning the Nth result seems to
> be
> > a
> > > quadratic (or worse) operation in SOLR.
> > >
> > > >
> > > > >
> > > > >
> > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > PS: It's time to pitch, how much
> > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > ExternalFileField" can help you?
> > > > > >
> > > > > >
> > > > > > It looks very interesting :) Does it make it possible to avoid
> > > > re-reading
> > > > > the EFF on every commit, and only re-read the values that have
> > actually
> > > > > changed?
> > > >
> > > >
> > > > You don't need commit (in SOLR-4085) to reload file content, but
> after
> > > > commit you need to read whole file and scan all key terms and
> postings.
> > > > That's because EFF sits on top of top level searcher. it's a
> Solr-like
> > > way.
> > > > In some future we might have per-segment EFF, in this case adding a
> > > segment
> > > > will trigger full file scan, but in the index only that new segment
> > will
> > > be
> > > > scanned. It should be faster. You know, straightforward sharing
> > internal
> > > > data structures between different index views/generations is not
> > > possible.
> > > > If you are asking about applying delta changes on external file
> that's
> > > > something what we did ourselves http://goo.gl/P8GFq . This feature
> is
> > > much
> > > > more doubtful and vague, although it might be the next contribution
> > after
> > > > SOLR-4085.
> > > >
> > > > >
> > > > > /Martin
> > > > >
> > > > >
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > wrote:
> > > > > >
> > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> what
> > > > you're
> > > > > > > hoping fore.
> > > > > > >
> > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > >
> > > > > > > The EFF is placed in the parent of the index directory in each
> > > core;
> > > > each
> > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > responsible
> > > > > > for.
> > > > > > >
> > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> answer
> > > > > > queries)
> > > > > > > while re-reading the EFF. Even worse, it seems that the time to
> > > > re-read
> > > > > > the
> > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
> is
> > > > re-read
> > > > > > by
> > > > > > > each core sequentially). The contents of the EFF become active
> > > after
> > > > the
> > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
> the
> > > > file
> > > > > > has
> > > > > > > been updated.
> > > > > > >
> > > > > > > In our case, the EFF was quite large - around 450MB - and we
> use
> > 16
> > > > > > shards,
> > > > > > > so when we triggered an external commit to force re-reading,
> the
> > > > whole
> > > > > > > system would block for several (10-15) minutes. This won't work
> > in
> > > a
> > > > > > > production environment. The reason for the size of the EFF is
> > that
> > > we
> > > > > > have
> > > > > > > around 7M documents in the index; each document has a 45
> > character
> > > > ID.
> > > > > > >
> > > > > > > We got some help to try to fix the problem so that the re-read
> of
> > > the
> > > > EFF
> > > > > > > proceeds in the background (see
> > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > proceeds
> > > > in
> > > > > > the
> > > > > > > background, the time required to launch solr now takes at least
> > as
> > > > long
> > > > > > as
> > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > needs.
> > > > > > >
> > > > > > > The next issue is that you cannot sort on EFF fields (though
> you
> > > can
> > > > > > return
> > > > > > > them as values using &fl=field(my_eff_field). This is also
> fixed
> > in
> > > > the
> > > > > > 4.1
> > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > > >
> > > > > > > So: Even after these fixes, EFF performance is not that great.
> > Our
> > > > > > solution
> > > > > > > is as follows: The actual value of the popularity measure (say,
> > > > reads)
> > > > > > that
> > > > > > > we want to report to the user is inserted into the search
> > response
> > > > > > > post-query by our query front-end. This value will then be the
> > > > > > > authoritative value at the time of the query. The value of the
> > > > popularity
> > > > > > > measure that we use for boosting in the ranking of the search
> > > results
> > > > is
> > > > > > > only updated when the value has changed enough so that the
> impact
> > > on
> > > > the
> > > > > > > boost will be significant (say, more than 2%). This does
> require
> > > > frequent
> > > > > > > re-indexing of the documents that have significant changes in
> the
> > > > number
> > > > > > of
> > > > > > > reads, but at least we won't have to update a document if it
> > moves
> > > > from,
> > > > > > > say, 1000000 to 1000001 reads.
> > > > > > >
> > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > >
> > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > simoneg@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > However,
> > > > in
> > > > > > > this
> > > > > > > > index, an external file field is used for popularity ranking.
> > > > > > > >
> > > > > > > > Does SolrCloud supports external file fields? How does it
> cope
> > > with
> > > > > > > > sharding and replication? Where should the external file be
> > > placed
> > > > now
> > > > > > > that
> > > > > > > > the index folder is not local but in the cloud?
> > > > > > > >
> > > > > > > > Are there otherwise other best practices to deal with the use
> > > cases
> > > > > > > > external file fields were used for, like popularity/ranking,
> in
> > > > > > > SolrCloud?
> > > > > > > > Custom ValueSources going to something external?
> > > > > > > >
> > > > > > > > Thanks in advance,
> > > > > > > > Simone
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sincerely yours
> > > > > > Mikhail Khludnev
> > > > > > Principal Engineer,
> > > > > > Grid Dynamics
> > > > > >
> > > > > > <http://www.griddynamics.com>
> > > > > >  <mk...@griddynamics.com>
> > > > > >
> > > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com>
> написал:
> > > >
> > > > > Hi Mikhail
> > > > >
> > > > > Please see answers below.
> > > > >
> > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > mkhludnev@griddynamics.com> wrote:
> > > > >
> > > > > > Martin,
> > > > > >
> > > > > > Thank you for telling your own "war-story". It's really useful
> for
> > > > > > community.
> > > > > > The first question might seems not really conscious, but would
> you
> > > tell
> > > > > me
> > > > > > what blocks searching during EFF reload, when it's triggered by
> > > handler
> > > > > or
> > > > > > by listener?
> > > > > >
> > > > >
> > > > > We continuously index new documents using CommitWithin to get
> regular
> > > > > commits. However, we observed that the EFFs were not re-read, so we
> > had
> > > > to
> > > > > do external commits (curl '.../solr/update?commit=true') to force
> > > reload.
> > > > > When this is done, solr blocks. I can't tell you exactly why it's
> > doing
> > > > > that (it was related to SOLR-3985).
> > > > >
> > > > >
> > > > > > I don't really get the sentence about sequential commits and
> number
> > > of
> > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > Doesn't
> > > it
> > > > > >
> > > > >
> > > > > Again, this is observed behavior. When we issue a commit on a
> system
> > > > with a
> > > > > system with many solr cores using EFFs, the system blocks for a
> long
> > > time
> > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > > symlink
> > > > > from each cores index dir to the actual file, which is updated by
> an
> > > > > external process.
> > > > >
> > > > >
> > > > > > causes scalability problem or long time to reload? Will it help
> if
> > > > we'll
> > > > > > have, let's say ExternalDatabaseField which will pull values from
> > > jdbc.
> > > > > ie.
> > > > > >
> > > > >
> > > > > I think the possibility of having some fields being retrieved from
> an
> > > > > external, dynamically updatable store would be really interesting.
> > This
> > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > (e.g.
> > > > > Cassandra).
> > > > >
> > > > >
> > > > > > why all cores can't read these values simultaneously?
> > > > > >
> > > > >
> > > > > Again, this is a solr implementation detail that I can't answer :)
> > > > >
> > > > >
> > > > > > Can you confirm that IDs in the file is ordered by the index term
> > > > order?
> > > > > >
> > > > >
> > > > > Yes, we sorted the files (standard UNIX sort).
> > > > >
> > > > >
> > > > > > AFAIK it can impact load time.
> > > > > >
> > > > > Yes, it does.
> > > > >
> > > > >
> > > > > > Regarding your post-query solution can you tell me if query found
> > > 10000
> > > > > > docs, but I need to display only first page with 100 rows,
> whether
> > I
> > > > need
> > > > > > to pull all 10K results to frontend to order them by the rank?
> > > > > >
> > > > > >
> > > > > In our architecture, the clients query an API that generates the
> SOLR
> > > > > query, retrieves the relevant additional fields that we needs, and
> > > > returns
> > > > > the relevant JSON to the front-end.
> > > > >
> > > > > In our use case, results are returned from SOLR by the 10's, not by
> > the
> > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > thousands
> > > of
> > > > > results, it would be up to the implementation of the api to augment
> > > only
> > > > > the results that needed to be returned to the front-end.
> > > > >
> > > > > Even so, patching up a JSON structure with 10000 results should be
> > > > > possible.
> > > > >
> > > > >
> > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > PS: It's time to pitch, how much
> > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > ExternalFileField" can help you?
> > > > > >
> > > > > >
> > > > > > It looks very interesting :) Does it make it possible to avoid
> > > > re-reading
> > > > > the EFF on every commit, and only re-read the values that have
> > actually
> > > > > changed?
> > > > >
> > > > > /Martin
> > > > >
> > > > >
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> > wrote:
> > > > > >
> > > > > > > Solr 4.0 does support using EFFs, but it might not give you
> what
> > > > you're
> > > > > > > hoping fore.
> > > > > > >
> > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > >
> > > > > > > The EFF is placed in the parent of the index directory in each
> > > core;
> > > > > each
> > > > > > > core reads the entire EFF and picks out the IDs that it is
> > > > responsible
> > > > > > for.
> > > > > > >
> > > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't
> answer
> > > > > > queries)
> > > > > > > while re-reading the EFF. Even worse, it seems that the time to
> > > > re-read
> > > > > > the
> > > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
> is
> > > > > re-read
> > > > > > by
> > > > > > > each core sequentially). The contents of the EFF become active
> > > after
> > > > > the
> > > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
> the
> > > > file
> > > > > > has
> > > > > > > been updated.
> > > > > > >
> > > > > > > In our case, the EFF was quite large - around 450MB - and we
> use
> > 16
> > > > > > shards,
> > > > > > > so when we triggered an external commit to force re-reading,
> the
> > > > whole
> > > > > > > system would block for several (10-15) minutes. This won't work
> > in
> > > a
> > > > > > > production environment. The reason for the size of the EFF is
> > that
> > > we
> > > > > > have
> > > > > > > around 7M documents in the index; each document has a 45
> > character
> > > > ID.
> > > > > > >
> > > > > > > We got some help to try to fix the problem so that the re-read
> of
> > > the
> > > > > EFF
> > > > > > > proceeds in the background (see
> > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > > a fix on the 4.1 branch). However, even though the re-read
> > proceeds
> > > > in
> > > > > > the
> > > > > > > background, the time required to launch solr now takes at least
> > as
> > > > long
> > > > > > as
> > > > > > > re-reading the EFFs. Again, this is not good enough for our
> > needs.
> > > > > > >
> > > > > > > The next issue is that you cannot sort on EFF fields (though
> you
> > > can
> > > > > > return
> > > > > > > them as values using &fl=field(my_eff_field). This is also
> fixed
> > in
> > > > the
> > > > > > 4.1
> > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > > >
> > > > > > > So: Even after these fixes, EFF performance is not that great.
> > Our
> > > > > > solution
> > > > > > > is as follows: The actual value of the popularity measure (say,
> > > > reads)
> > > > > > that
> > > > > > > we want to report to the user is inserted into the search
> > response
> > > > > > > post-query by our query front-end. This value will then be the
> > > > > > > authoritative value at the time of the query. The value of the
> > > > > popularity
> > > > > > > measure that we use for boosting in the ranking of the search
> > > results
> > > > > is
> > > > > > > only updated when the value has changed enough so that the
> impact
> > > on
> > > > > the
> > > > > > > boost will be significant (say, more than 2%). This does
> require
> > > > > frequent
> > > > > > > re-indexing of the documents that have significant changes in
> the
> > > > > number
> > > > > > of
> > > > > > > reads, but at least we won't have to update a document if it
> > moves
> > > > > from,
> > > > > > > say, 1000000 to 1000001 reads.
> > > > > > >
> > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > >
> > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > simoneg@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > > However,
> > > > in
> > > > > > > this
> > > > > > > > index, an external file field is used for popularity ranking.
> > > > > > > >
> > > > > > > > Does SolrCloud supports external file fields? How does it
> cope
> > > with
> > > > > > > > sharding and replication? Where should the external file be
> > > placed
> > > > > now
> > > > > > > that
> > > > > > > > the index folder is not local but in the cloud?
> > > > > > > >
> > > > > > > > Are there otherwise other best practices to deal with the use
> > > cases
> > > > > > > > external file fields were used for, like popularity/ranking,
> in
> > > > > > > SolrCloud?
> > > > > > > > Custom ValueSources going to something external?
> > > > > > > >
> > > > > > > > Thanks in advance,
> > > > > > > > Simone
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sincerely yours
> > > > > > Mikhail Khludnev
> > > > > > Principal Engineer,
> > > > > > Grid Dynamics
> > > > > >
> > > > > > <http://www.griddynamics.com>
> > > > > >  <mk...@griddynamics.com>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Mikhail

I appreciate your input, it's very useful :)

On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
> This deployment seems a little bit confusing to me. You have 16-way fairy
> virtual "box", and send 16 request for really heavy operation at the same
> moment, it does not surprise me that you loosing it for some period of
> time. At that time you should have more than 16 in load average metrics.
> I suggest to send commit to those cores one-by-one and have inconsistency
> and some sort of blinking as a trade-off for availability. In this case
> only single virtual CPU will be fully consumed by the commit's _thread
> divergence action_ and others will serve requests.
>

I wasn't aware until now that it is possible to send a commit to one core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should experiment
with solr/coreN/update?commit=true. A quick trial run seems to indicate
that a commit to a single core causes commits on all cores.


Perhaps I should clarify that we are using SOLR as a black box; we do not
touch the code at all - we only install the distribution WAR file and
proceed from there.


> Also from my POV such deployments should start at least from *16* 4-way
> vboxes, it's more expensive, but should be much better available during
> cpu-consuming operations.
>

Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
16 cores? Or am I misunderstanding something :) ?


> Other details, if you use single jetty for all of them, are you sure that
> jetty's threadpool doesn't limit requests? is it large enough?
> You have 60G and set -Xmx=10G. are you sure that total size of cores index
> directories is less than 45G?
>
> The total index size is 230 GB, so it won't fit in ram, but we're using an
SSD disk to minimize disk access time. We have tried putting the EFF onto a
ram disk, but this didn't have a measurable effect.

Thanks,
/Martin


> Thanks
>
>
> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:
>
> > Mikhail
> >
> > PSB
> >
> > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > Martin,
> > >
> > > Please find additional question from me below.
> > >
> > > Simone,
> > >
> > > I'm sorry for hijacking your thread. The only what I've heard about it
> at
> > > recent ApacheCon sessions is that Zookeeper is supposed to replicate
> > those
> > > files as configs under solr home. And I'm really looking forward to
> know
> > > how it works with huge files in production.
> > >
> > > Thank You, Guys!
> > >
> > > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> > > >
> > > > Hi Mikhail
> > > >
> > > > Please see answers below.
> > > >
> > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Thank you for telling your own "war-story". It's really useful for
> > > > > community.
> > > > > The first question might seems not really conscious, but would you
> > tell
> > > me
> > > > > what blocks searching during EFF reload, when it's triggered by
> > handler
> > > or
> > > > > by listener?
> > > > >
> > > >
> > > > We continuously index new documents using CommitWithin to get regular
> > > > commits. However, we observed that the EFFs were not re-read, so we
> had
> > > to
> > > > do external commits (curl '.../solr/update?commit=true') to force
> > reload.
> > > > When this is done, solr blocks. I can't tell you exactly why it's
> doing
> > > > that (it was related to SOLR-3985).
> > >
> > > Is there a chance to get a thread dump when they are blocked?
> > >
> > >
> > Well I could try to recreate the situation. But the setup is fairly
> simple:
> > Create a large EFF in a largeish index with many shards. Issue a commit,
> > and then try to do a search. Solr will not respond to the search before
> the
> > commit has completed, and this will take a long time.
> >
> >
> > >
> > > >
> > > >
> > > > > I don't really get the sentence about sequential commits and number
> > of
> > > > > cores. Do I get right that file is replicated via Zookeeper?
> Doesn't
> > it
> > > > >
> > > >
> > > > Again, this is observed behavior. When we issue a commit on a system
> > with
> > > a
> > > > system with many solr cores using EFFs, the system blocks for a long
> > time
> > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > symlink
> > > > from each cores index dir to the actual file, which is updated by an
> > > > external process.
> > >
> > > Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
> > >
> > > Do you use SolrCloud, SolrShards, or these cores are just replicas of
> the
> > > same index?
> > >
> >
> > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
> bit
> > unsure about the terminology here, but we've got a single index divided
> > into 16 shard. Each shard is hosted in a solr core.
> >
> >
> > > Also, about simlink - Don't you share that file via some NFS?
> > >
> > > No, we generate the EFF on the local solr host (there is only one
> > physical
> > host that holds all shards), so there is no need for NFS or copying files
> > around. No need for Zookeeper either.
> >
> >
> > > how many cores you run per box?
> > >
> > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM.
> We
> > run 16 solr cores on this box in Jetty.
> >
> >
> > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> > >
> > > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
> >
> >
> > > I assume you use 64 bit linux and mmap directory. Please confirm that.
> > >
> > >
> > We use 64-bit linux. I'm not sure about the mmap directory or where that
> > would be configured in solr - can you explain that?
> >
> > >
> > > >
> > > >
> > > > > causes scalability problem or long time to reload? Will it help if
> > > we'll
> > > > > have, let's say ExternalDatabaseField which will pull values from
> > jdbc.
> > > ie.
> > > > >
> > > >
> > > > I think the possibility of having some fields being retrieved from an
> > > > external, dynamically updatable store would be really interesting.
> This
> > > > could be JDBC, something in-memory like redis, or a NoSql product
> (e.g.
> > > > Cassandra).
> > >
> > > Ok. Let's have it in mind as a possible direction.
> > >
> >
> > Alternatively, an API that would allow updating a single field for a
> > document might be an option.
> >
> >
> > >
> > > >
> > > >
> > > > > why all cores can't read these values simultaneously?
> > > > >
> > > >
> > > > Again, this is a solr implementation detail that I can't answer :)
> > > >
> > > >
> > > > > Can you confirm that IDs in the file is ordered by the index term
> > > order?
> > > > >
> > > >
> > > > Yes, we sorted the files (standard UNIX sort).
> > > >
> > > >
> > > > > AFAIK it can impact load time.
> > > > >
> > > > Yes, it does
> > >
> > > Ok, I've got that you aware of it, and your IDs are just strings, not
> > > integers.
> > >
> > >
> > Yes, ids are strings.
> >
> > >
> > > >
> > > >
> > > > > Regarding your post-query solution can you tell me if query found
> > 10000
> > > > > docs, but I need to display only first page with 100 rows, whether
> I
> > > need
> > > > > to pull all 10K results to frontend to order them by the rank?
> > > > >
> > > > >
> > > > In our architecture, the clients query an API that generates the SOLR
> > > > query, retrieves the relevant additional fields that we needs, and
> > > returns
> > > > the relevant JSON to the front-end.
> > > >
> > > > In our use case, results are returned from SOLR by the 10's, not by
> the
> > > > 1000's, so it is a manageable job. Even so, if solr returned
> thousands
> > of
> > > > results, it would be up to the implementation of the api to augment
> > only
> > > > the results that needed to be returned to the front-end.
> > > >
> > > > Even so, patching up a JSON structure with 10000 results should be
> > > > possible.
> > >
> > > You are right. I'm concerned anyway because retrieving whole result is
> > > expensive, and not always possible.
> > >
> > >
> > In our case, getting the whole result is almost impossible, because that
> > would be millions of documents, and returning the Nth result seems to be
> a
> > quadratic (or worse) operation in SOLR.
> >
> > >
> > > >
> > > >
> > > > > I'm really appreciate if you comment on the questions above.
> > > > > PS: It's time to pitch, how much
> > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > ExternalFileField" can help you?
> > > > >
> > > > >
> > > > > It looks very interesting :) Does it make it possible to avoid
> > > re-reading
> > > > the EFF on every commit, and only re-read the values that have
> actually
> > > > changed?
> > >
> > >
> > > You don't need commit (in SOLR-4085) to reload file content, but after
> > > commit you need to read whole file and scan all key terms and postings.
> > > That's because EFF sits on top of top level searcher. it's a Solr-like
> > way.
> > > In some future we might have per-segment EFF, in this case adding a
> > segment
> > > will trigger full file scan, but in the index only that new segment
> will
> > be
> > > scanned. It should be faster. You know, straightforward sharing
> internal
> > > data structures between different index views/generations is not
> > possible.
> > > If you are asking about applying delta changes on external file that's
> > > something what we did ourselves http://goo.gl/P8GFq . This feature is
> > much
> > > more doubtful and vague, although it might be the next contribution
> after
> > > SOLR-4085.
> > >
> > > >
> > > > /Martin
> > > >
> > > >
> > > > >
> > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> wrote:
> > > > >
> > > > > > Solr 4.0 does support using EFFs, but it might not give you what
> > > you're
> > > > > > hoping fore.
> > > > > >
> > > > > > We tried using Solr Cloud, and have given up again.
> > > > > >
> > > > > > The EFF is placed in the parent of the index directory in each
> > core;
> > > each
> > > > > > core reads the entire EFF and picks out the IDs that it is
> > > responsible
> > > > > for.
> > > > > >
> > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > > queries)
> > > > > > while re-reading the EFF. Even worse, it seems that the time to
> > > re-read
> > > > > the
> > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> > > re-read
> > > > > by
> > > > > > each core sequentially). The contents of the EFF become active
> > after
> > > the
> > > > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> > > file
> > > > > has
> > > > > > been updated.
> > > > > >
> > > > > > In our case, the EFF was quite large - around 450MB - and we use
> 16
> > > > > shards,
> > > > > > so when we triggered an external commit to force re-reading, the
> > > whole
> > > > > > system would block for several (10-15) minutes. This won't work
> in
> > a
> > > > > > production environment. The reason for the size of the EFF is
> that
> > we
> > > > > have
> > > > > > around 7M documents in the index; each document has a 45
> character
> > > ID.
> > > > > >
> > > > > > We got some help to try to fix the problem so that the re-read of
> > the
> > > EFF
> > > > > > proceeds in the background (see
> > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > a fix on the 4.1 branch). However, even though the re-read
> proceeds
> > > in
> > > > > the
> > > > > > background, the time required to launch solr now takes at least
> as
> > > long
> > > > > as
> > > > > > re-reading the EFFs. Again, this is not good enough for our
> needs.
> > > > > >
> > > > > > The next issue is that you cannot sort on EFF fields (though you
> > can
> > > > > return
> > > > > > them as values using &fl=field(my_eff_field). This is also fixed
> in
> > > the
> > > > > 4.1
> > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > >
> > > > > > So: Even after these fixes, EFF performance is not that great.
> Our
> > > > > solution
> > > > > > is as follows: The actual value of the popularity measure (say,
> > > reads)
> > > > > that
> > > > > > we want to report to the user is inserted into the search
> response
> > > > > > post-query by our query front-end. This value will then be the
> > > > > > authoritative value at the time of the query. The value of the
> > > popularity
> > > > > > measure that we use for boosting in the ranking of the search
> > results
> > > is
> > > > > > only updated when the value has changed enough so that the impact
> > on
> > > the
> > > > > > boost will be significant (say, more than 2%). This does require
> > > frequent
> > > > > > re-indexing of the documents that have significant changes in the
> > > number
> > > > > of
> > > > > > reads, but at least we won't have to update a document if it
> moves
> > > from,
> > > > > > say, 1000000 to 1000001 reads.
> > > > > >
> > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > >
> > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> simoneg@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > However,
> > > in
> > > > > > this
> > > > > > > index, an external file field is used for popularity ranking.
> > > > > > >
> > > > > > > Does SolrCloud supports external file fields? How does it cope
> > with
> > > > > > > sharding and replication? Where should the external file be
> > placed
> > > now
> > > > > > that
> > > > > > > the index folder is not local but in the cloud?
> > > > > > >
> > > > > > > Are there otherwise other best practices to deal with the use
> > cases
> > > > > > > external file fields were used for, like popularity/ranking, in
> > > > > > SolrCloud?
> > > > > > > Custom ValueSources going to something external?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Simone
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > >  <mk...@griddynamics.com>
> > > > >
> > >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> > >
> > > > Hi Mikhail
> > > >
> > > > Please see answers below.
> > > >
> > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Thank you for telling your own "war-story". It's really useful for
> > > > > community.
> > > > > The first question might seems not really conscious, but would you
> > tell
> > > > me
> > > > > what blocks searching during EFF reload, when it's triggered by
> > handler
> > > > or
> > > > > by listener?
> > > > >
> > > >
> > > > We continuously index new documents using CommitWithin to get regular
> > > > commits. However, we observed that the EFFs were not re-read, so we
> had
> > > to
> > > > do external commits (curl '.../solr/update?commit=true') to force
> > reload.
> > > > When this is done, solr blocks. I can't tell you exactly why it's
> doing
> > > > that (it was related to SOLR-3985).
> > > >
> > > >
> > > > > I don't really get the sentence about sequential commits and number
> > of
> > > > > cores. Do I get right that file is replicated via Zookeeper?
> Doesn't
> > it
> > > > >
> > > >
> > > > Again, this is observed behavior. When we issue a commit on a system
> > > with a
> > > > system with many solr cores using EFFs, the system blocks for a long
> > time
> > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > symlink
> > > > from each cores index dir to the actual file, which is updated by an
> > > > external process.
> > > >
> > > >
> > > > > causes scalability problem or long time to reload? Will it help if
> > > we'll
> > > > > have, let's say ExternalDatabaseField which will pull values from
> > jdbc.
> > > > ie.
> > > > >
> > > >
> > > > I think the possibility of having some fields being retrieved from an
> > > > external, dynamically updatable store would be really interesting.
> This
> > > > could be JDBC, something in-memory like redis, or a NoSql product
> (e.g.
> > > > Cassandra).
> > > >
> > > >
> > > > > why all cores can't read these values simultaneously?
> > > > >
> > > >
> > > > Again, this is a solr implementation detail that I can't answer :)
> > > >
> > > >
> > > > > Can you confirm that IDs in the file is ordered by the index term
> > > order?
> > > > >
> > > >
> > > > Yes, we sorted the files (standard UNIX sort).
> > > >
> > > >
> > > > > AFAIK it can impact load time.
> > > > >
> > > > Yes, it does.
> > > >
> > > >
> > > > > Regarding your post-query solution can you tell me if query found
> > 10000
> > > > > docs, but I need to display only first page with 100 rows, whether
> I
> > > need
> > > > > to pull all 10K results to frontend to order them by the rank?
> > > > >
> > > > >
> > > > In our architecture, the clients query an API that generates the SOLR
> > > > query, retrieves the relevant additional fields that we needs, and
> > > returns
> > > > the relevant JSON to the front-end.
> > > >
> > > > In our use case, results are returned from SOLR by the 10's, not by
> the
> > > > 1000's, so it is a manageable job. Even so, if solr returned
> thousands
> > of
> > > > results, it would be up to the implementation of the api to augment
> > only
> > > > the results that needed to be returned to the front-end.
> > > >
> > > > Even so, patching up a JSON structure with 10000 results should be
> > > > possible.
> > > >
> > > >
> > > > > I'm really appreciate if you comment on the questions above.
> > > > > PS: It's time to pitch, how much
> > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > ExternalFileField" can help you?
> > > > >
> > > > >
> > > > > It looks very interesting :) Does it make it possible to avoid
> > > re-reading
> > > > the EFF on every commit, and only re-read the values that have
> actually
> > > > changed?
> > > >
> > > > /Martin
> > > >
> > > >
> > > > >
> > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com>
> wrote:
> > > > >
> > > > > > Solr 4.0 does support using EFFs, but it might not give you what
> > > you're
> > > > > > hoping fore.
> > > > > >
> > > > > > We tried using Solr Cloud, and have given up again.
> > > > > >
> > > > > > The EFF is placed in the parent of the index directory in each
> > core;
> > > > each
> > > > > > core reads the entire EFF and picks out the IDs that it is
> > > responsible
> > > > > for.
> > > > > >
> > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > > queries)
> > > > > > while re-reading the EFF. Even worse, it seems that the time to
> > > re-read
> > > > > the
> > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> > > > re-read
> > > > > by
> > > > > > each core sequentially). The contents of the EFF become active
> > after
> > > > the
> > > > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> > > file
> > > > > has
> > > > > > been updated.
> > > > > >
> > > > > > In our case, the EFF was quite large - around 450MB - and we use
> 16
> > > > > shards,
> > > > > > so when we triggered an external commit to force re-reading, the
> > > whole
> > > > > > system would block for several (10-15) minutes. This won't work
> in
> > a
> > > > > > production environment. The reason for the size of the EFF is
> that
> > we
> > > > > have
> > > > > > around 7M documents in the index; each document has a 45
> character
> > > ID.
> > > > > >
> > > > > > We got some help to try to fix the problem so that the re-read of
> > the
> > > > EFF
> > > > > > proceeds in the background (see
> > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > > a fix on the 4.1 branch). However, even though the re-read
> proceeds
> > > in
> > > > > the
> > > > > > background, the time required to launch solr now takes at least
> as
> > > long
> > > > > as
> > > > > > re-reading the EFFs. Again, this is not good enough for our
> needs.
> > > > > >
> > > > > > The next issue is that you cannot sort on EFF fields (though you
> > can
> > > > > return
> > > > > > them as values using &fl=field(my_eff_field). This is also fixed
> in
> > > the
> > > > > 4.1
> > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > >
> > > > > > So: Even after these fixes, EFF performance is not that great.
> Our
> > > > > solution
> > > > > > is as follows: The actual value of the popularity measure (say,
> > > reads)
> > > > > that
> > > > > > we want to report to the user is inserted into the search
> response
> > > > > > post-query by our query front-end. This value will then be the
> > > > > > authoritative value at the time of the query. The value of the
> > > > popularity
> > > > > > measure that we use for boosting in the ranking of the search
> > results
> > > > is
> > > > > > only updated when the value has changed enough so that the impact
> > on
> > > > the
> > > > > > boost will be significant (say, more than 2%). This does require
> > > > frequent
> > > > > > re-indexing of the documents that have significant changes in the
> > > > number
> > > > > of
> > > > > > reads, but at least we won't have to update a document if it
> moves
> > > > from,
> > > > > > say, 1000000 to 1000001 reads.
> > > > > >
> > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > >
> > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> simoneg@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > However,
> > > in
> > > > > > this
> > > > > > > index, an external file field is used for popularity ranking.
> > > > > > >
> > > > > > > Does SolrCloud supports external file fields? How does it cope
> > with
> > > > > > > sharding and replication? Where should the external file be
> > placed
> > > > now
> > > > > > that
> > > > > > > the index folder is not local but in the cloud?
> > > > > > >
> > > > > > > Are there otherwise other best practices to deal with the use
> > cases
> > > > > > > external file fields were used for, like popularity/ranking, in
> > > > > > SolrCloud?
> > > > > > > Custom ValueSources going to something external?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Simone
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > >  <mk...@griddynamics.com>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,
This deployment seems a little bit confusing to me. You have 16-way fairy
virtual "box", and send 16 request for really heavy operation at the same
moment, it does not surprise me that you loosing it for some period of
time. At that time you should have more than 16 in load average metrics.
I suggest to send commit to those cores one-by-one and have inconsistency
and some sort of blinking as a trade-off for availability. In this case
only single virtual CPU will be fully consumed by the commit's _thread
divergence action_ and others will serve requests.
Also from my POV such deployments should start at least from *16* 4-way
vboxes, it's more expensive, but should be much better available during
cpu-consuming operations.
Other details, if you use single jetty for all of them, are you sure that
jetty's threadpool doesn't limit requests? is it large enough?
You have 60G and set -Xmx=10G. are you sure that total size of cores index
directories is less than 45G?

Thanks


On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:

> Mikhail
>
> PSB
>
> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > Martin,
> >
> > Please find additional question from me below.
> >
> > Simone,
> >
> > I'm sorry for hijacking your thread. The only what I've heard about it at
> > recent ApacheCon sessions is that Zookeeper is supposed to replicate
> those
> > files as configs under solr home. And I'm really looking forward to know
> > how it works with huge files in production.
> >
> > Thank You, Guys!
> >
> > 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> > >
> > > Hi Mikhail
> > >
> > > Please see answers below.
> > >
> > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > Martin,
> > > >
> > > > Thank you for telling your own "war-story". It's really useful for
> > > > community.
> > > > The first question might seems not really conscious, but would you
> tell
> > me
> > > > what blocks searching during EFF reload, when it's triggered by
> handler
> > or
> > > > by listener?
> > > >
> > >
> > > We continuously index new documents using CommitWithin to get regular
> > > commits. However, we observed that the EFFs were not re-read, so we had
> > to
> > > do external commits (curl '.../solr/update?commit=true') to force
> reload.
> > > When this is done, solr blocks. I can't tell you exactly why it's doing
> > > that (it was related to SOLR-3985).
> >
> > Is there a chance to get a thread dump when they are blocked?
> >
> >
> Well I could try to recreate the situation. But the setup is fairly simple:
> Create a large EFF in a largeish index with many shards. Issue a commit,
> and then try to do a search. Solr will not respond to the search before the
> commit has completed, and this will take a long time.
>
>
> >
> > >
> > >
> > > > I don't really get the sentence about sequential commits and number
> of
> > > > cores. Do I get right that file is replicated via Zookeeper? Doesn't
> it
> > > >
> > >
> > > Again, this is observed behavior. When we issue a commit on a system
> with
> > a
> > > system with many solr cores using EFFs, the system blocks for a long
> time
> > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> symlink
> > > from each cores index dir to the actual file, which is updated by an
> > > external process.
> >
> > Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
> >
> > Do you use SolrCloud, SolrShards, or these cores are just replicas of the
> > same index?
> >
>
> Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit
> unsure about the terminology here, but we've got a single index divided
> into 16 shard. Each shard is hosted in a solr core.
>
>
> > Also, about simlink - Don't you share that file via some NFS?
> >
> > No, we generate the EFF on the local solr host (there is only one
> physical
> host that holds all shards), so there is no need for NFS or copying files
> around. No need for Zookeeper either.
>
>
> > how many cores you run per box?
> >
> This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM. We
> run 16 solr cores on this box in Jetty.
>
>
> > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> >
> > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
>
>
> > I assume you use 64 bit linux and mmap directory. Please confirm that.
> >
> >
> We use 64-bit linux. I'm not sure about the mmap directory or where that
> would be configured in solr - can you explain that?
>
> >
> > >
> > >
> > > > causes scalability problem or long time to reload? Will it help if
> > we'll
> > > > have, let's say ExternalDatabaseField which will pull values from
> jdbc.
> > ie.
> > > >
> > >
> > > I think the possibility of having some fields being retrieved from an
> > > external, dynamically updatable store would be really interesting. This
> > > could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> > > Cassandra).
> >
> > Ok. Let's have it in mind as a possible direction.
> >
>
> Alternatively, an API that would allow updating a single field for a
> document might be an option.
>
>
> >
> > >
> > >
> > > > why all cores can't read these values simultaneously?
> > > >
> > >
> > > Again, this is a solr implementation detail that I can't answer :)
> > >
> > >
> > > > Can you confirm that IDs in the file is ordered by the index term
> > order?
> > > >
> > >
> > > Yes, we sorted the files (standard UNIX sort).
> > >
> > >
> > > > AFAIK it can impact load time.
> > > >
> > > Yes, it does
> >
> > Ok, I've got that you aware of it, and your IDs are just strings, not
> > integers.
> >
> >
> Yes, ids are strings.
>
> >
> > >
> > >
> > > > Regarding your post-query solution can you tell me if query found
> 10000
> > > > docs, but I need to display only first page with 100 rows, whether I
> > need
> > > > to pull all 10K results to frontend to order them by the rank?
> > > >
> > > >
> > > In our architecture, the clients query an API that generates the SOLR
> > > query, retrieves the relevant additional fields that we needs, and
> > returns
> > > the relevant JSON to the front-end.
> > >
> > > In our use case, results are returned from SOLR by the 10's, not by the
> > > 1000's, so it is a manageable job. Even so, if solr returned thousands
> of
> > > results, it would be up to the implementation of the api to augment
> only
> > > the results that needed to be returned to the front-end.
> > >
> > > Even so, patching up a JSON structure with 10000 results should be
> > > possible.
> >
> > You are right. I'm concerned anyway because retrieving whole result is
> > expensive, and not always possible.
> >
> >
> In our case, getting the whole result is almost impossible, because that
> would be millions of documents, and returning the Nth result seems to be a
> quadratic (or worse) operation in SOLR.
>
> >
> > >
> > >
> > > > I'm really appreciate if you comment on the questions above.
> > > > PS: It's time to pitch, how much
> > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > ExternalFileField" can help you?
> > > >
> > > >
> > > > It looks very interesting :) Does it make it possible to avoid
> > re-reading
> > > the EFF on every commit, and only re-read the values that have actually
> > > changed?
> >
> >
> > You don't need commit (in SOLR-4085) to reload file content, but after
> > commit you need to read whole file and scan all key terms and postings.
> > That's because EFF sits on top of top level searcher. it's a Solr-like
> way.
> > In some future we might have per-segment EFF, in this case adding a
> segment
> > will trigger full file scan, but in the index only that new segment will
> be
> > scanned. It should be faster. You know, straightforward sharing internal
> > data structures between different index views/generations is not
> possible.
> > If you are asking about applying delta changes on external file that's
> > something what we did ourselves http://goo.gl/P8GFq . This feature is
> much
> > more doubtful and vague, although it might be the next contribution after
> > SOLR-4085.
> >
> > >
> > > /Martin
> > >
> > >
> > > >
> > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> > > >
> > > > > Solr 4.0 does support using EFFs, but it might not give you what
> > you're
> > > > > hoping fore.
> > > > >
> > > > > We tried using Solr Cloud, and have given up again.
> > > > >
> > > > > The EFF is placed in the parent of the index directory in each
> core;
> > each
> > > > > core reads the entire EFF and picks out the IDs that it is
> > responsible
> > > > for.
> > > > >
> > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > queries)
> > > > > while re-reading the EFF. Even worse, it seems that the time to
> > re-read
> > > > the
> > > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> > re-read
> > > > by
> > > > > each core sequentially). The contents of the EFF become active
> after
> > the
> > > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> > file
> > > > has
> > > > > been updated.
> > > > >
> > > > > In our case, the EFF was quite large - around 450MB - and we use 16
> > > > shards,
> > > > > so when we triggered an external commit to force re-reading, the
> > whole
> > > > > system would block for several (10-15) minutes. This won't work in
> a
> > > > > production environment. The reason for the size of the EFF is that
> we
> > > > have
> > > > > around 7M documents in the index; each document has a 45 character
> > ID.
> > > > >
> > > > > We got some help to try to fix the problem so that the re-read of
> the
> > EFF
> > > > > proceeds in the background (see
> > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > a fix on the 4.1 branch). However, even though the re-read proceeds
> > in
> > > > the
> > > > > background, the time required to launch solr now takes at least as
> > long
> > > > as
> > > > > re-reading the EFFs. Again, this is not good enough for our needs.
> > > > >
> > > > > The next issue is that you cannot sort on EFF fields (though you
> can
> > > > return
> > > > > them as values using &fl=field(my_eff_field). This is also fixed in
> > the
> > > > 4.1
> > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > >
> > > > > So: Even after these fixes, EFF performance is not that great. Our
> > > > solution
> > > > > is as follows: The actual value of the popularity measure (say,
> > reads)
> > > > that
> > > > > we want to report to the user is inserted into the search response
> > > > > post-query by our query front-end. This value will then be the
> > > > > authoritative value at the time of the query. The value of the
> > popularity
> > > > > measure that we use for boosting in the ranking of the search
> results
> > is
> > > > > only updated when the value has changed enough so that the impact
> on
> > the
> > > > > boost will be significant (say, more than 2%). This does require
> > frequent
> > > > > re-indexing of the documents that have significant changes in the
> > number
> > > > of
> > > > > reads, but at least we won't have to update a document if it moves
> > from,
> > > > > say, 1000000 to 1000001 reads.
> > > > >
> > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > >
> > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <simoneg@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> However,
> > in
> > > > > this
> > > > > > index, an external file field is used for popularity ranking.
> > > > > >
> > > > > > Does SolrCloud supports external file fields? How does it cope
> with
> > > > > > sharding and replication? Where should the external file be
> placed
> > now
> > > > > that
> > > > > > the index folder is not local but in the cloud?
> > > > > >
> > > > > > Are there otherwise other best practices to deal with the use
> cases
> > > > > > external file fields were used for, like popularity/ranking, in
> > > > > SolrCloud?
> > > > > > Custom ValueSources going to something external?
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Simone
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > >  <mk...@griddynamics.com>
> > > >
> >  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> >
> > > Hi Mikhail
> > >
> > > Please see answers below.
> > >
> > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > Martin,
> > > >
> > > > Thank you for telling your own "war-story". It's really useful for
> > > > community.
> > > > The first question might seems not really conscious, but would you
> tell
> > > me
> > > > what blocks searching during EFF reload, when it's triggered by
> handler
> > > or
> > > > by listener?
> > > >
> > >
> > > We continuously index new documents using CommitWithin to get regular
> > > commits. However, we observed that the EFFs were not re-read, so we had
> > to
> > > do external commits (curl '.../solr/update?commit=true') to force
> reload.
> > > When this is done, solr blocks. I can't tell you exactly why it's doing
> > > that (it was related to SOLR-3985).
> > >
> > >
> > > > I don't really get the sentence about sequential commits and number
> of
> > > > cores. Do I get right that file is replicated via Zookeeper? Doesn't
> it
> > > >
> > >
> > > Again, this is observed behavior. When we issue a commit on a system
> > with a
> > > system with many solr cores using EFFs, the system blocks for a long
> time
> > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> symlink
> > > from each cores index dir to the actual file, which is updated by an
> > > external process.
> > >
> > >
> > > > causes scalability problem or long time to reload? Will it help if
> > we'll
> > > > have, let's say ExternalDatabaseField which will pull values from
> jdbc.
> > > ie.
> > > >
> > >
> > > I think the possibility of having some fields being retrieved from an
> > > external, dynamically updatable store would be really interesting. This
> > > could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> > > Cassandra).
> > >
> > >
> > > > why all cores can't read these values simultaneously?
> > > >
> > >
> > > Again, this is a solr implementation detail that I can't answer :)
> > >
> > >
> > > > Can you confirm that IDs in the file is ordered by the index term
> > order?
> > > >
> > >
> > > Yes, we sorted the files (standard UNIX sort).
> > >
> > >
> > > > AFAIK it can impact load time.
> > > >
> > > Yes, it does.
> > >
> > >
> > > > Regarding your post-query solution can you tell me if query found
> 10000
> > > > docs, but I need to display only first page with 100 rows, whether I
> > need
> > > > to pull all 10K results to frontend to order them by the rank?
> > > >
> > > >
> > > In our architecture, the clients query an API that generates the SOLR
> > > query, retrieves the relevant additional fields that we needs, and
> > returns
> > > the relevant JSON to the front-end.
> > >
> > > In our use case, results are returned from SOLR by the 10's, not by the
> > > 1000's, so it is a manageable job. Even so, if solr returned thousands
> of
> > > results, it would be up to the implementation of the api to augment
> only
> > > the results that needed to be returned to the front-end.
> > >
> > > Even so, patching up a JSON structure with 10000 results should be
> > > possible.
> > >
> > >
> > > > I'm really appreciate if you comment on the questions above.
> > > > PS: It's time to pitch, how much
> > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > ExternalFileField" can help you?
> > > >
> > > >
> > > > It looks very interesting :) Does it make it possible to avoid
> > re-reading
> > > the EFF on every commit, and only re-read the values that have actually
> > > changed?
> > >
> > > /Martin
> > >
> > >
> > > >
> > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> > > >
> > > > > Solr 4.0 does support using EFFs, but it might not give you what
> > you're
> > > > > hoping fore.
> > > > >
> > > > > We tried using Solr Cloud, and have given up again.
> > > > >
> > > > > The EFF is placed in the parent of the index directory in each
> core;
> > > each
> > > > > core reads the entire EFF and picks out the IDs that it is
> > responsible
> > > > for.
> > > > >
> > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > queries)
> > > > > while re-reading the EFF. Even worse, it seems that the time to
> > re-read
> > > > the
> > > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> > > re-read
> > > > by
> > > > > each core sequentially). The contents of the EFF become active
> after
> > > the
> > > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> > file
> > > > has
> > > > > been updated.
> > > > >
> > > > > In our case, the EFF was quite large - around 450MB - and we use 16
> > > > shards,
> > > > > so when we triggered an external commit to force re-reading, the
> > whole
> > > > > system would block for several (10-15) minutes. This won't work in
> a
> > > > > production environment. The reason for the size of the EFF is that
> we
> > > > have
> > > > > around 7M documents in the index; each document has a 45 character
> > ID.
> > > > >
> > > > > We got some help to try to fix the problem so that the re-read of
> the
> > > EFF
> > > > > proceeds in the background (see
> > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > > a fix on the 4.1 branch). However, even though the re-read proceeds
> > in
> > > > the
> > > > > background, the time required to launch solr now takes at least as
> > long
> > > > as
> > > > > re-reading the EFFs. Again, this is not good enough for our needs.
> > > > >
> > > > > The next issue is that you cannot sort on EFF fields (though you
> can
> > > > return
> > > > > them as values using &fl=field(my_eff_field). This is also fixed in
> > the
> > > > 4.1
> > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > >
> > > > > So: Even after these fixes, EFF performance is not that great. Our
> > > > solution
> > > > > is as follows: The actual value of the popularity measure (say,
> > reads)
> > > > that
> > > > > we want to report to the user is inserted into the search response
> > > > > post-query by our query front-end. This value will then be the
> > > > > authoritative value at the time of the query. The value of the
> > > popularity
> > > > > measure that we use for boosting in the ranking of the search
> results
> > > is
> > > > > only updated when the value has changed enough so that the impact
> on
> > > the
> > > > > boost will be significant (say, more than 2%). This does require
> > > frequent
> > > > > re-indexing of the documents that have significant changes in the
> > > number
> > > > of
> > > > > reads, but at least we won't have to update a document if it moves
> > > from,
> > > > > say, 1000000 to 1000001 reads.
> > > > >
> > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > >
> > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <simoneg@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> However,
> > in
> > > > > this
> > > > > > index, an external file field is used for popularity ranking.
> > > > > >
> > > > > > Does SolrCloud supports external file fields? How does it cope
> with
> > > > > > sharding and replication? Where should the external file be
> placed
> > > now
> > > > > that
> > > > > > the index folder is not local but in the cloud?
> > > > > >
> > > > > > Are there otherwise other best practices to deal with the use
> cases
> > > > > > external file fields were used for, like popularity/ranking, in
> > > > > SolrCloud?
> > > > > > Custom ValueSources going to something external?
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Simone
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > >  <mk...@griddynamics.com>
> > > >
> > >
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

On Wed, Nov 21, 2012 at 7:08 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:
>
> >  I'm not sure about the mmap directory or where that
> > would be configured in solr - can you explain that?
> >
>
> You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir
> should be org.apache.lucene.store.MMapDirectory
>
> It says '
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
'

/Martin

--
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <ma...@issuu.com> wrote:

>  I'm not sure about the mmap directory or where that
> would be configured in solr - can you explain that?
>

You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir
should be org.apache.lucene.store.MMapDirectory

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Mikhail

PSB

On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
>
> Please find additional question from me below.
>
> Simone,
>
> I'm sorry for hijacking your thread. The only what I've heard about it at
> recent ApacheCon sessions is that Zookeeper is supposed to replicate those
> files as configs under solr home. And I'm really looking forward to know
> how it works with huge files in production.
>
> Thank You, Guys!
>
> 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
> >
> > Hi Mikhail
> >
> > Please see answers below.
> >
> > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > Martin,
> > >
> > > Thank you for telling your own "war-story". It's really useful for
> > > community.
> > > The first question might seems not really conscious, but would you tell
> me
> > > what blocks searching during EFF reload, when it's triggered by handler
> or
> > > by listener?
> > >
> >
> > We continuously index new documents using CommitWithin to get regular
> > commits. However, we observed that the EFFs were not re-read, so we had
> to
> > do external commits (curl '.../solr/update?commit=true') to force reload.
> > When this is done, solr blocks. I can't tell you exactly why it's doing
> > that (it was related to SOLR-3985).
>
> Is there a chance to get a thread dump when they are blocked?
>
>
Well I could try to recreate the situation. But the setup is fairly simple:
Create a large EFF in a largeish index with many shards. Issue a commit,
and then try to do a search. Solr will not respond to the search before the
commit has completed, and this will take a long time.


>
> >
> >
> > > I don't really get the sentence about sequential commits and number of
> > > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> > >
> >
> > Again, this is observed behavior. When we issue a commit on a system with
> a
> > system with many solr cores using EFFs, the system blocks for a long time
> > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> > from each cores index dir to the actual file, which is updated by an
> > external process.
>
> Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
>
> Do you use SolrCloud, SolrShards, or these cores are just replicas of the
> same index?
>

Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit
unsure about the terminology here, but we've got a single index divided
into 16 shard. Each shard is hosted in a solr core.


> Also, about simlink - Don't you share that file via some NFS?
>
> No, we generate the EFF on the local solr host (there is only one physical
host that holds all shards), so there is no need for NFS or copying files
around. No need for Zookeeper either.


> how many cores you run per box?
>
This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM. We
run 16 solr cores on this box in Jetty.


> Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
>
> Yes. We've allocated 10GB for jetty, and left the rest for the OS.


> I assume you use 64 bit linux and mmap directory. Please confirm that.
>
>
We use 64-bit linux. I'm not sure about the mmap directory or where that
would be configured in solr - can you explain that?

>
> >
> >
> > > causes scalability problem or long time to reload? Will it help if
> we'll
> > > have, let's say ExternalDatabaseField which will pull values from jdbc.
> ie.
> > >
> >
> > I think the possibility of having some fields being retrieved from an
> > external, dynamically updatable store would be really interesting. This
> > could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> > Cassandra).
>
> Ok. Let's have it in mind as a possible direction.
>

Alternatively, an API that would allow updating a single field for a
document might be an option.


>
> >
> >
> > > why all cores can't read these values simultaneously?
> > >
> >
> > Again, this is a solr implementation detail that I can't answer :)
> >
> >
> > > Can you confirm that IDs in the file is ordered by the index term
> order?
> > >
> >
> > Yes, we sorted the files (standard UNIX sort).
> >
> >
> > > AFAIK it can impact load time.
> > >
> > Yes, it does
>
> Ok, I've got that you aware of it, and your IDs are just strings, not
> integers.
>
>
Yes, ids are strings.

>
> >
> >
> > > Regarding your post-query solution can you tell me if query found 10000
> > > docs, but I need to display only first page with 100 rows, whether I
> need
> > > to pull all 10K results to frontend to order them by the rank?
> > >
> > >
> > In our architecture, the clients query an API that generates the SOLR
> > query, retrieves the relevant additional fields that we needs, and
> returns
> > the relevant JSON to the front-end.
> >
> > In our use case, results are returned from SOLR by the 10's, not by the
> > 1000's, so it is a manageable job. Even so, if solr returned thousands of
> > results, it would be up to the implementation of the api to augment only
> > the results that needed to be returned to the front-end.
> >
> > Even so, patching up a JSON structure with 10000 results should be
> > possible.
>
> You are right. I'm concerned anyway because retrieving whole result is
> expensive, and not always possible.
>
>
In our case, getting the whole result is almost impossible, because that
would be millions of documents, and returning the Nth result seems to be a
quadratic (or worse) operation in SOLR.

>
> >
> >
> > > I'm really appreciate if you comment on the questions above.
> > > PS: It's time to pitch, how much
> > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > ExternalFileField" can help you?
> > >
> > >
> > > It looks very interesting :) Does it make it possible to avoid
> re-reading
> > the EFF on every commit, and only re-read the values that have actually
> > changed?
>
>
> You don't need commit (in SOLR-4085) to reload file content, but after
> commit you need to read whole file and scan all key terms and postings.
> That's because EFF sits on top of top level searcher. it's a Solr-like way.
> In some future we might have per-segment EFF, in this case adding a segment
> will trigger full file scan, but in the index only that new segment will be
> scanned. It should be faster. You know, straightforward sharing internal
> data structures between different index views/generations is not possible.
> If you are asking about applying delta changes on external file that's
> something what we did ourselves http://goo.gl/P8GFq . This feature is much
> more doubtful and vague, although it might be the next contribution after
> SOLR-4085.
>
> >
> > /Martin
> >
> >
> > >
> > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> > >
> > > > Solr 4.0 does support using EFFs, but it might not give you what
> you're
> > > > hoping fore.
> > > >
> > > > We tried using Solr Cloud, and have given up again.
> > > >
> > > > The EFF is placed in the parent of the index directory in each core;
> each
> > > > core reads the entire EFF and picks out the IDs that it is
> responsible
> > > for.
> > > >
> > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > queries)
> > > > while re-reading the EFF. Even worse, it seems that the time to
> re-read
> > > the
> > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> re-read
> > > by
> > > > each core sequentially). The contents of the EFF become active after
> the
> > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> file
> > > has
> > > > been updated.
> > > >
> > > > In our case, the EFF was quite large - around 450MB - and we use 16
> > > shards,
> > > > so when we triggered an external commit to force re-reading, the
> whole
> > > > system would block for several (10-15) minutes. This won't work in a
> > > > production environment. The reason for the size of the EFF is that we
> > > have
> > > > around 7M documents in the index; each document has a 45 character
> ID.
> > > >
> > > > We got some help to try to fix the problem so that the re-read of the
> EFF
> > > > proceeds in the background (see
> > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > a fix on the 4.1 branch). However, even though the re-read proceeds
> in
> > > the
> > > > background, the time required to launch solr now takes at least as
> long
> > > as
> > > > re-reading the EFFs. Again, this is not good enough for our needs.
> > > >
> > > > The next issue is that you cannot sort on EFF fields (though you can
> > > return
> > > > them as values using &fl=field(my_eff_field). This is also fixed in
> the
> > > 4.1
> > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > >
> > > > So: Even after these fixes, EFF performance is not that great. Our
> > > solution
> > > > is as follows: The actual value of the popularity measure (say,
> reads)
> > > that
> > > > we want to report to the user is inserted into the search response
> > > > post-query by our query front-end. This value will then be the
> > > > authoritative value at the time of the query. The value of the
> popularity
> > > > measure that we use for boosting in the ranking of the search results
> is
> > > > only updated when the value has changed enough so that the impact on
> the
> > > > boost will be significant (say, more than 2%). This does require
> frequent
> > > > re-indexing of the documents that have significant changes in the
> number
> > > of
> > > > reads, but at least we won't have to update a document if it moves
> from,
> > > > say, 1000000 to 1000001 reads.
> > > >
> > > > /Martin Koch - ISSUU - senior systems architect.
> > > >
> > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > > I'm planning to move a quite big Solr index to SolrCloud. However,
> in
> > > > this
> > > > > index, an external file field is used for popularity ranking.
> > > > >
> > > > > Does SolrCloud supports external file fields? How does it cope with
> > > > > sharding and replication? Where should the external file be placed
> now
> > > > that
> > > > > the index folder is not local but in the cloud?
> > > > >
> > > > > Are there otherwise other best practices to deal with the use cases
> > > > > external file fields were used for, like popularity/ranking, in
> > > > SolrCloud?
> > > > > Custom ValueSources going to something external?
> > > > >
> > > > > Thanks in advance,
> > > > > Simone
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
>  20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
>
> > Hi Mikhail
> >
> > Please see answers below.
> >
> > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > Martin,
> > >
> > > Thank you for telling your own "war-story". It's really useful for
> > > community.
> > > The first question might seems not really conscious, but would you tell
> > me
> > > what blocks searching during EFF reload, when it's triggered by handler
> > or
> > > by listener?
> > >
> >
> > We continuously index new documents using CommitWithin to get regular
> > commits. However, we observed that the EFFs were not re-read, so we had
> to
> > do external commits (curl '.../solr/update?commit=true') to force reload.
> > When this is done, solr blocks. I can't tell you exactly why it's doing
> > that (it was related to SOLR-3985).
> >
> >
> > > I don't really get the sentence about sequential commits and number of
> > > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> > >
> >
> > Again, this is observed behavior. When we issue a commit on a system
> with a
> > system with many solr cores using EFFs, the system blocks for a long time
> > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> > from each cores index dir to the actual file, which is updated by an
> > external process.
> >
> >
> > > causes scalability problem or long time to reload? Will it help if
> we'll
> > > have, let's say ExternalDatabaseField which will pull values from jdbc.
> > ie.
> > >
> >
> > I think the possibility of having some fields being retrieved from an
> > external, dynamically updatable store would be really interesting. This
> > could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> > Cassandra).
> >
> >
> > > why all cores can't read these values simultaneously?
> > >
> >
> > Again, this is a solr implementation detail that I can't answer :)
> >
> >
> > > Can you confirm that IDs in the file is ordered by the index term
> order?
> > >
> >
> > Yes, we sorted the files (standard UNIX sort).
> >
> >
> > > AFAIK it can impact load time.
> > >
> > Yes, it does.
> >
> >
> > > Regarding your post-query solution can you tell me if query found 10000
> > > docs, but I need to display only first page with 100 rows, whether I
> need
> > > to pull all 10K results to frontend to order them by the rank?
> > >
> > >
> > In our architecture, the clients query an API that generates the SOLR
> > query, retrieves the relevant additional fields that we needs, and
> returns
> > the relevant JSON to the front-end.
> >
> > In our use case, results are returned from SOLR by the 10's, not by the
> > 1000's, so it is a manageable job. Even so, if solr returned thousands of
> > results, it would be up to the implementation of the api to augment only
> > the results that needed to be returned to the front-end.
> >
> > Even so, patching up a JSON structure with 10000 results should be
> > possible.
> >
> >
> > > I'm really appreciate if you comment on the questions above.
> > > PS: It's time to pitch, how much
> > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > ExternalFileField" can help you?
> > >
> > >
> > > It looks very interesting :) Does it make it possible to avoid
> re-reading
> > the EFF on every commit, and only re-read the values that have actually
> > changed?
> >
> > /Martin
> >
> >
> > >
> > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> > >
> > > > Solr 4.0 does support using EFFs, but it might not give you what
> you're
> > > > hoping fore.
> > > >
> > > > We tried using Solr Cloud, and have given up again.
> > > >
> > > > The EFF is placed in the parent of the index directory in each core;
> > each
> > > > core reads the entire EFF and picks out the IDs that it is
> responsible
> > > for.
> > > >
> > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > queries)
> > > > while re-reading the EFF. Even worse, it seems that the time to
> re-read
> > > the
> > > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> > re-read
> > > by
> > > > each core sequentially). The contents of the EFF become active after
> > the
> > > > first EXTERNAL commit (commitWithin does NOT work here) after the
> file
> > > has
> > > > been updated.
> > > >
> > > > In our case, the EFF was quite large - around 450MB - and we use 16
> > > shards,
> > > > so when we triggered an external commit to force re-reading, the
> whole
> > > > system would block for several (10-15) minutes. This won't work in a
> > > > production environment. The reason for the size of the EFF is that we
> > > have
> > > > around 7M documents in the index; each document has a 45 character
> ID.
> > > >
> > > > We got some help to try to fix the problem so that the re-read of the
> > EFF
> > > > proceeds in the background (see
> > > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > > a fix on the 4.1 branch). However, even though the re-read proceeds
> in
> > > the
> > > > background, the time required to launch solr now takes at least as
> long
> > > as
> > > > re-reading the EFFs. Again, this is not good enough for our needs.
> > > >
> > > > The next issue is that you cannot sort on EFF fields (though you can
> > > return
> > > > them as values using &fl=field(my_eff_field). This is also fixed in
> the
> > > 4.1
> > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > >
> > > > So: Even after these fixes, EFF performance is not that great. Our
> > > solution
> > > > is as follows: The actual value of the popularity measure (say,
> reads)
> > > that
> > > > we want to report to the user is inserted into the search response
> > > > post-query by our query front-end. This value will then be the
> > > > authoritative value at the time of the query. The value of the
> > popularity
> > > > measure that we use for boosting in the ranking of the search results
> > is
> > > > only updated when the value has changed enough so that the impact on
> > the
> > > > boost will be significant (say, more than 2%). This does require
> > frequent
> > > > re-indexing of the documents that have significant changes in the
> > number
> > > of
> > > > reads, but at least we won't have to update a document if it moves
> > from,
> > > > say, 1000000 to 1000001 reads.
> > > >
> > > > /Martin Koch - ISSUU - senior systems architect.
> > > >
> > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > > I'm planning to move a quite big Solr index to SolrCloud. However,
> in
> > > > this
> > > > > index, an external file field is used for popularity ranking.
> > > > >
> > > > > Does SolrCloud supports external file fields? How does it cope with
> > > > > sharding and replication? Where should the external file be placed
> > now
> > > > that
> > > > > the index folder is not local but in the cloud?
> > > > >
> > > > > Are there otherwise other best practices to deal with the use cases
> > > > > external file fields were used for, like popularity/ranking, in
> > > > SolrCloud?
> > > > > Custom ValueSources going to something external?
> > > > >
> > > > > Thanks in advance,
> > > > > Simone
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
> >
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,

Please find additional question from me below.

Simone,

I'm sorry for hijacking your thread. The only what I've heard about it at
recent ApacheCon sessions is that Zookeeper is supposed to replicate those
files as configs under solr home. And I'm really looking forward to know
how it works with huge files in production.

Thank You, Guys!

20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:
>
> Hi Mikhail
>
> Please see answers below.
>
> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > Martin,
> >
> > Thank you for telling your own "war-story". It's really useful for
> > community.
> > The first question might seems not really conscious, but would you tell
me
> > what blocks searching during EFF reload, when it's triggered by handler
or
> > by listener?
> >
>
> We continuously index new documents using CommitWithin to get regular
> commits. However, we observed that the EFFs were not re-read, so we had to
> do external commits (curl '.../solr/update?commit=true') to force reload.
> When this is done, solr blocks. I can't tell you exactly why it's doing
> that (it was related to SOLR-3985).

Is there a chance to get a thread dump when they are blocked?


>
>
> > I don't really get the sentence about sequential commits and number of
> > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> >
>
> Again, this is observed behavior. When we issue a commit on a system with
a
> system with many solr cores using EFFs, the system blocks for a long time
> (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> from each cores index dir to the actual file, which is updated by an
> external process.

Hold on, I asked about Zookeeper because the subj mentions SolrCloud.

Do you use SolrCloud, SolrShards, or these cores are just replicas of the
same index?
Also, about simlink - Don't you share that file via some NFS?

how many cores you run per box?

Do boxes has plenty of ram to cache filesystem beside of jvm heaps?

I assume you use 64 bit linux and mmap directory. Please confirm that.


>
>
> > causes scalability problem or long time to reload? Will it help if we'll
> > have, let's say ExternalDatabaseField which will pull values from jdbc.
ie.
> >
>
> I think the possibility of having some fields being retrieved from an
> external, dynamically updatable store would be really interesting. This
> could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> Cassandra).

Ok. Let's have it in mind as a possible direction.

>
>
> > why all cores can't read these values simultaneously?
> >
>
> Again, this is a solr implementation detail that I can't answer :)
>
>
> > Can you confirm that IDs in the file is ordered by the index term order?
> >
>
> Yes, we sorted the files (standard UNIX sort).
>
>
> > AFAIK it can impact load time.
> >
> Yes, it does

Ok, I've got that you aware of it, and your IDs are just strings, not
integers.


>
>
> > Regarding your post-query solution can you tell me if query found 10000
> > docs, but I need to display only first page with 100 rows, whether I
need
> > to pull all 10K results to frontend to order them by the rank?
> >
> >
> In our architecture, the clients query an API that generates the SOLR
> query, retrieves the relevant additional fields that we needs, and returns
> the relevant JSON to the front-end.
>
> In our use case, results are returned from SOLR by the 10's, not by the
> 1000's, so it is a manageable job. Even so, if solr returned thousands of
> results, it would be up to the implementation of the api to augment only
> the results that needed to be returned to the front-end.
>
> Even so, patching up a JSON structure with 10000 results should be
> possible.

You are right. I'm concerned anyway because retrieving whole result is
expensive, and not always possible.


>
>
> > I'm really appreciate if you comment on the questions above.
> > PS: It's time to pitch, how much
> > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > ExternalFileField" can help you?
> >
> >
> > It looks very interesting :) Does it make it possible to avoid
re-reading
> the EFF on every commit, and only re-read the values that have actually
> changed?


You don't need commit (in SOLR-4085) to reload file content, but after
commit you need to read whole file and scan all key terms and postings.
That's because EFF sits on top of top level searcher. it's a Solr-like way.
In some future we might have per-segment EFF, in this case adding a segment
will trigger full file scan, but in the index only that new segment will be
scanned. It should be faster. You know, straightforward sharing internal
data structures between different index views/generations is not possible.
If you are asking about applying delta changes on external file that's
something what we did ourselves http://goo.gl/P8GFq . This feature is much
more doubtful and vague, although it might be the next contribution after
SOLR-4085.

>
> /Martin
>
>
> >
> > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> >
> > > Solr 4.0 does support using EFFs, but it might not give you what
you're
> > > hoping fore.
> > >
> > > We tried using Solr Cloud, and have given up again.
> > >
> > > The EFF is placed in the parent of the index directory in each core;
each
> > > core reads the entire EFF and picks out the IDs that it is responsible
> > for.
> > >
> > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > queries)
> > > while re-reading the EFF. Even worse, it seems that the time to
re-read
> > the
> > > EFF is multiplied by the number of cores in use (i.e. the EFF is
re-read
> > by
> > > each core sequentially). The contents of the EFF become active after
the
> > > first EXTERNAL commit (commitWithin does NOT work here) after the file
> > has
> > > been updated.
> > >
> > > In our case, the EFF was quite large - around 450MB - and we use 16
> > shards,
> > > so when we triggered an external commit to force re-reading, the whole
> > > system would block for several (10-15) minutes. This won't work in a
> > > production environment. The reason for the size of the EFF is that we
> > have
> > > around 7M documents in the index; each document has a 45 character ID.
> > >
> > > We got some help to try to fix the problem so that the re-read of the
EFF
> > > proceeds in the background (see
> > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > a fix on the 4.1 branch). However, even though the re-read proceeds in
> > the
> > > background, the time required to launch solr now takes at least as
long
> > as
> > > re-reading the EFFs. Again, this is not good enough for our needs.
> > >
> > > The next issue is that you cannot sort on EFF fields (though you can
> > return
> > > them as values using &fl=field(my_eff_field). This is also fixed in
the
> > 4.1
> > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > >
> > > So: Even after these fixes, EFF performance is not that great. Our
> > solution
> > > is as follows: The actual value of the popularity measure (say, reads)
> > that
> > > we want to report to the user is inserted into the search response
> > > post-query by our query front-end. This value will then be the
> > > authoritative value at the time of the query. The value of the
popularity
> > > measure that we use for boosting in the ranking of the search results
is
> > > only updated when the value has changed enough so that the impact on
the
> > > boost will be significant (say, more than 2%). This does require
frequent
> > > re-indexing of the documents that have significant changes in the
number
> > of
> > > reads, but at least we won't have to update a document if it moves
from,
> > > say, 1000000 to 1000001 reads.
> > >
> > > /Martin Koch - ISSUU - senior systems architect.
> > >
> > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org>
> > wrote:
> > >
> > > > Hi all,
> > > > I'm planning to move a quite big Solr index to SolrCloud. However,
in
> > > this
> > > > index, an external file field is used for popularity ranking.
> > > >
> > > > Does SolrCloud supports external file fields? How does it cope with
> > > > sharding and replication? Where should the external file be placed
now
> > > that
> > > > the index folder is not local but in the cloud?
> > > >
> > > > Are there otherwise other best practices to deal with the use cases
> > > > external file fields were used for, like popularity/ranking, in
> > > SolrCloud?
> > > > Custom ValueSources going to something external?
> > > >
> > > > Thanks in advance,
> > > > Simone
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
 20.11.2012 18:06 пользователь "Martin Koch" <ma...@issuu.com> написал:

> Hi Mikhail
>
> Please see answers below.
>
> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > Martin,
> >
> > Thank you for telling your own "war-story". It's really useful for
> > community.
> > The first question might seems not really conscious, but would you tell
> me
> > what blocks searching during EFF reload, when it's triggered by handler
> or
> > by listener?
> >
>
> We continuously index new documents using CommitWithin to get regular
> commits. However, we observed that the EFFs were not re-read, so we had to
> do external commits (curl '.../solr/update?commit=true') to force reload.
> When this is done, solr blocks. I can't tell you exactly why it's doing
> that (it was related to SOLR-3985).
>
>
> > I don't really get the sentence about sequential commits and number of
> > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> >
>
> Again, this is observed behavior. When we issue a commit on a system with a
> system with many solr cores using EFFs, the system blocks for a long time
> (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> from each cores index dir to the actual file, which is updated by an
> external process.
>
>
> > causes scalability problem or long time to reload? Will it help if we'll
> > have, let's say ExternalDatabaseField which will pull values from jdbc.
> ie.
> >
>
> I think the possibility of having some fields being retrieved from an
> external, dynamically updatable store would be really interesting. This
> could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> Cassandra).
>
>
> > why all cores can't read these values simultaneously?
> >
>
> Again, this is a solr implementation detail that I can't answer :)
>
>
> > Can you confirm that IDs in the file is ordered by the index term order?
> >
>
> Yes, we sorted the files (standard UNIX sort).
>
>
> > AFAIK it can impact load time.
> >
> Yes, it does.
>
>
> > Regarding your post-query solution can you tell me if query found 10000
> > docs, but I need to display only first page with 100 rows, whether I need
> > to pull all 10K results to frontend to order them by the rank?
> >
> >
> In our architecture, the clients query an API that generates the SOLR
> query, retrieves the relevant additional fields that we needs, and returns
> the relevant JSON to the front-end.
>
> In our use case, results are returned from SOLR by the 10's, not by the
> 1000's, so it is a manageable job. Even so, if solr returned thousands of
> results, it would be up to the implementation of the api to augment only
> the results that needed to be returned to the front-end.
>
> Even so, patching up a JSON structure with 10000 results should be
> possible.
>
>
> > I'm really appreciate if you comment on the questions above.
> > PS: It's time to pitch, how much
> > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > ExternalFileField" can help you?
> >
> >
> > It looks very interesting :) Does it make it possible to avoid re-reading
> the EFF on every commit, and only re-read the values that have actually
> changed?
>
> /Martin
>
>
> >
> > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
> >
> > > Solr 4.0 does support using EFFs, but it might not give you what you're
> > > hoping fore.
> > >
> > > We tried using Solr Cloud, and have given up again.
> > >
> > > The EFF is placed in the parent of the index directory in each core;
> each
> > > core reads the entire EFF and picks out the IDs that it is responsible
> > for.
> > >
> > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > queries)
> > > while re-reading the EFF. Even worse, it seems that the time to re-read
> > the
> > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> re-read
> > by
> > > each core sequentially). The contents of the EFF become active after
> the
> > > first EXTERNAL commit (commitWithin does NOT work here) after the file
> > has
> > > been updated.
> > >
> > > In our case, the EFF was quite large - around 450MB - and we use 16
> > shards,
> > > so when we triggered an external commit to force re-reading, the whole
> > > system would block for several (10-15) minutes. This won't work in a
> > > production environment. The reason for the size of the EFF is that we
> > have
> > > around 7M documents in the index; each document has a 45 character ID.
> > >
> > > We got some help to try to fix the problem so that the re-read of the
> EFF
> > > proceeds in the background (see
> > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > a fix on the 4.1 branch). However, even though the re-read proceeds in
> > the
> > > background, the time required to launch solr now takes at least as long
> > as
> > > re-reading the EFFs. Again, this is not good enough for our needs.
> > >
> > > The next issue is that you cannot sort on EFF fields (though you can
> > return
> > > them as values using &fl=field(my_eff_field). This is also fixed in the
> > 4.1
> > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > >
> > > So: Even after these fixes, EFF performance is not that great. Our
> > solution
> > > is as follows: The actual value of the popularity measure (say, reads)
> > that
> > > we want to report to the user is inserted into the search response
> > > post-query by our query front-end. This value will then be the
> > > authoritative value at the time of the query. The value of the
> popularity
> > > measure that we use for boosting in the ranking of the search results
> is
> > > only updated when the value has changed enough so that the impact on
> the
> > > boost will be significant (say, more than 2%). This does require
> frequent
> > > re-indexing of the documents that have significant changes in the
> number
> > of
> > > reads, but at least we won't have to update a document if it moves
> from,
> > > say, 1000000 to 1000001 reads.
> > >
> > > /Martin Koch - ISSUU - senior systems architect.
> > >
> > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org>
> > wrote:
> > >
> > > > Hi all,
> > > > I'm planning to move a quite big Solr index to SolrCloud. However, in
> > > this
> > > > index, an external file field is used for popularity ranking.
> > > >
> > > > Does SolrCloud supports external file fields? How does it cope with
> > > > sharding and replication? Where should the external file be placed
> now
> > > that
> > > > the index folder is not local but in the cloud?
> > > >
> > > > Are there otherwise other best practices to deal with the use cases
> > > > external file fields were used for, like popularity/ranking, in
> > > SolrCloud?
> > > > Custom ValueSources going to something external?
> > > >
> > > > Thanks in advance,
> > > > Simone
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Hi Mikhail

Please see answers below.

On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
>
> Thank you for telling your own "war-story". It's really useful for
> community.
> The first question might seems not really conscious, but would you tell me
> what blocks searching during EFF reload, when it's triggered by handler or
> by listener?
>

We continuously index new documents using CommitWithin to get regular
commits. However, we observed that the EFFs were not re-read, so we had to
do external commits (curl '.../solr/update?commit=true') to force reload.
When this is done, solr blocks. I can't tell you exactly why it's doing
that (it was related to SOLR-3985).


> I don't really get the sentence about sequential commits and number of
> cores. Do I get right that file is replicated via Zookeeper? Doesn't it
>

Again, this is observed behavior. When we issue a commit on a system with a
system with many solr cores using EFFs, the system blocks for a long time
(15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
from each cores index dir to the actual file, which is updated by an
external process.


> causes scalability problem or long time to reload? Will it help if we'll
> have, let's say ExternalDatabaseField which will pull values from jdbc. ie.
>

I think the possibility of having some fields being retrieved from an
external, dynamically updatable store would be really interesting. This
could be JDBC, something in-memory like redis, or a NoSql product (e.g.
Cassandra).


> why all cores can't read these values simultaneously?
>

Again, this is a solr implementation detail that I can't answer :)


> Can you confirm that IDs in the file is ordered by the index term order?
>

Yes, we sorted the files (standard UNIX sort).


> AFAIK it can impact load time.
>
Yes, it does.


> Regarding your post-query solution can you tell me if query found 10000
> docs, but I need to display only first page with 100 rows, whether I need
> to pull all 10K results to frontend to order them by the rank?
>
>
In our architecture, the clients query an API that generates the SOLR
query, retrieves the relevant additional fields that we needs, and returns
the relevant JSON to the front-end.

In our use case, results are returned from SOLR by the 10's, not by the
1000's, so it is a manageable job. Even so, if solr returned thousands of
results, it would be up to the implementation of the api to augment only
the results that needed to be returned to the front-end.

Even so, patching up a JSON structure with 10000 results should be
possible.


> I'm really appreciate if you comment on the questions above.
> PS: It's time to pitch, how much
> https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> ExternalFileField" can help you?
>
>
> It looks very interesting :) Does it make it possible to avoid re-reading
the EFF on every commit, and only re-read the values that have actually
changed?

/Martin


>
> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:
>
> > Solr 4.0 does support using EFFs, but it might not give you what you're
> > hoping fore.
> >
> > We tried using Solr Cloud, and have given up again.
> >
> > The EFF is placed in the parent of the index directory in each core; each
> > core reads the entire EFF and picks out the IDs that it is responsible
> for.
> >
> > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> queries)
> > while re-reading the EFF. Even worse, it seems that the time to re-read
> the
> > EFF is multiplied by the number of cores in use (i.e. the EFF is re-read
> by
> > each core sequentially). The contents of the EFF become active after the
> > first EXTERNAL commit (commitWithin does NOT work here) after the file
> has
> > been updated.
> >
> > In our case, the EFF was quite large - around 450MB - and we use 16
> shards,
> > so when we triggered an external commit to force re-reading, the whole
> > system would block for several (10-15) minutes. This won't work in a
> > production environment. The reason for the size of the EFF is that we
> have
> > around 7M documents in the index; each document has a 45 character ID.
> >
> > We got some help to try to fix the problem so that the re-read of the EFF
> > proceeds in the background (see
> > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > a fix on the 4.1 branch). However, even though the re-read proceeds in
> the
> > background, the time required to launch solr now takes at least as long
> as
> > re-reading the EFFs. Again, this is not good enough for our needs.
> >
> > The next issue is that you cannot sort on EFF fields (though you can
> return
> > them as values using &fl=field(my_eff_field). This is also fixed in the
> 4.1
> > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> >
> > So: Even after these fixes, EFF performance is not that great. Our
> solution
> > is as follows: The actual value of the popularity measure (say, reads)
> that
> > we want to report to the user is inserted into the search response
> > post-query by our query front-end. This value will then be the
> > authoritative value at the time of the query. The value of the popularity
> > measure that we use for boosting in the ranking of the search results is
> > only updated when the value has changed enough so that the impact on the
> > boost will be significant (say, more than 2%). This does require frequent
> > re-indexing of the documents that have significant changes in the number
> of
> > reads, but at least we won't have to update a document if it moves from,
> > say, 1000000 to 1000001 reads.
> >
> > /Martin Koch - ISSUU - senior systems architect.
> >
> > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org>
> wrote:
> >
> > > Hi all,
> > > I'm planning to move a quite big Solr index to SolrCloud. However, in
> > this
> > > index, an external file field is used for popularity ranking.
> > >
> > > Does SolrCloud supports external file fields? How does it cope with
> > > sharding and replication? Where should the external file be placed now
> > that
> > > the index folder is not local but in the cloud?
> > >
> > > Are there otherwise other best practices to deal with the use cases
> > > external file fields were used for, like popularity/ranking, in
> > SolrCloud?
> > > Custom ValueSources going to something external?
> > >
> > > Thanks in advance,
> > > Simone
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: SolrCloud and exernal file fields

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Martin,

Thank you for telling your own "war-story". It's really useful for
community.
The first question might seems not really conscious, but would you tell me
what blocks searching during EFF reload, when it's triggered by handler or
by listener?
I don't really get the sentence about sequential commits and number of
cores. Do I get right that file is replicated via Zookeeper? Doesn't it
causes scalability problem or long time to reload? Will it help if we'll
have, let's say ExternalDatabaseField which will pull values from jdbc. ie.
why all cores can't read these values simultaneously?
Can you confirm that IDs in the file is ordered by the index term order?
AFAIK it can impact load time.
Regarding your post-query solution can you tell me if query found 10000
docs, but I need to display only first page with 100 rows, whether I need
to pull all 10K results to frontend to order them by the rank?

I'm really appreciate if you comment on the questions above.
PS: It's time to pitch, how much
https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
ExternalFileField" can help you?



On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <ma...@issuu.com> wrote:

> Solr 4.0 does support using EFFs, but it might not give you what you're
> hoping fore.
>
> We tried using Solr Cloud, and have given up again.
>
> The EFF is placed in the parent of the index directory in each core; each
> core reads the entire EFF and picks out the IDs that it is responsible for.
>
> In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
> while re-reading the EFF. Even worse, it seems that the time to re-read the
> EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
> each core sequentially). The contents of the EFF become active after the
> first EXTERNAL commit (commitWithin does NOT work here) after the file has
> been updated.
>
> In our case, the EFF was quite large - around 450MB - and we use 16 shards,
> so when we triggered an external commit to force re-reading, the whole
> system would block for several (10-15) minutes. This won't work in a
> production environment. The reason for the size of the EFF is that we have
> around 7M documents in the index; each document has a 45 character ID.
>
> We got some help to try to fix the problem so that the re-read of the EFF
> proceeds in the background (see
> here<https://issues.apache.org/jira/browse/SOLR-3985> for
> a fix on the 4.1 branch). However, even though the re-read proceeds in the
> background, the time required to launch solr now takes at least as long as
> re-reading the EFFs. Again, this is not good enough for our needs.
>
> The next issue is that you cannot sort on EFF fields (though you can return
> them as values using &fl=field(my_eff_field). This is also fixed in the 4.1
> branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
>
> So: Even after these fixes, EFF performance is not that great. Our solution
> is as follows: The actual value of the popularity measure (say, reads) that
> we want to report to the user is inserted into the search response
> post-query by our query front-end. This value will then be the
> authoritative value at the time of the query. The value of the popularity
> measure that we use for boosting in the ranking of the search results is
> only updated when the value has changed enough so that the impact on the
> boost will be significant (say, more than 2%). This does require frequent
> re-indexing of the documents that have significant changes in the number of
> reads, but at least we won't have to update a document if it moves from,
> say, 1000000 to 1000001 reads.
>
> /Martin Koch - ISSUU - senior systems architect.
>
> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org> wrote:
>
> > Hi all,
> > I'm planning to move a quite big Solr index to SolrCloud. However, in
> this
> > index, an external file field is used for popularity ranking.
> >
> > Does SolrCloud supports external file fields? How does it cope with
> > sharding and replication? Where should the external file be placed now
> that
> > the index folder is not local but in the cloud?
> >
> > Are there otherwise other best practices to deal with the use cases
> > external file fields were used for, like popularity/ranking, in
> SolrCloud?
> > Custom ValueSources going to something external?
> >
> > Thanks in advance,
> > Simone
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

The short answer is no; the number was chosen in an attempt to get as many
cores working in parallel to complete the search faster, but I realize that
there is an overhead incurred by distribution and merging the results.
We've now gone to 8 shards and will be monitoring performance.

/Martin

On Thu, Nov 22, 2012 at 3:53 PM, Yonik Seeley <yo...@lucidworks.com> wrote:

> On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch <ma...@issuu.com> wrote:
> > around 7M documents in the index; each document has a 45 character ID.
>
> 7M documents isn't that large.  Is there a reason why you need so many
> shards (16 in your case) on a single box?
>
> -Yonik
> http://lucidworks.com
>

Re: SolrCloud and exernal file fields

Posted by Yonik Seeley <yo...@lucidworks.com>.

On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch <ma...@issuu.com> wrote:
> around 7M documents in the index; each document has a 45 character ID.

7M documents isn't that large.  Is there a reason why you need so many
shards (16 in your case) on a single box?

-Yonik
http://lucidworks.com

Re: SolrCloud and exernal file fields

Posted by Simone Gianni <si...@apache.org>.

Hi Martin,
thanks for sharing your experience with EFF and saving me a lot of time
figuring it out myself, I was afraid of exactly this kind of problems.

Mikhail, thanks for expanding the thread with even more useful informations!

Simone


2012/11/20 Martin Koch <ma...@issuu.com>

> Solr 4.0 does support using EFFs, but it might not give you what you're
> hoping fore.
>
> We tried using Solr Cloud, and have given up again.
>
> The EFF is placed in the parent of the index directory in each core; each
> core reads the entire EFF and picks out the IDs that it is responsible for.
>
> In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
> while re-reading the EFF. Even worse, it seems that the time to re-read the
> EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
> each core sequentially). The contents of the EFF become active after the
> first EXTERNAL commit (commitWithin does NOT work here) after the file has
> been updated.
>
> In our case, the EFF was quite large - around 450MB - and we use 16 shards,
> so when we triggered an external commit to force re-reading, the whole
> system would block for several (10-15) minutes. This won't work in a
> production environment. The reason for the size of the EFF is that we have
> around 7M documents in the index; each document has a 45 character ID.
>
> We got some help to try to fix the problem so that the re-read of the EFF
> proceeds in the background (see
> here<https://issues.apache.org/jira/browse/SOLR-3985> for
> a fix on the 4.1 branch). However, even though the re-read proceeds in the
> background, the time required to launch solr now takes at least as long as
> re-reading the EFFs. Again, this is not good enough for our needs.
>
> The next issue is that you cannot sort on EFF fields (though you can return
> them as values using &fl=field(my_eff_field). This is also fixed in the 4.1
> branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
>
> So: Even after these fixes, EFF performance is not that great. Our solution
> is as follows: The actual value of the popularity measure (say, reads) that
> we want to report to the user is inserted into the search response
> post-query by our query front-end. This value will then be the
> authoritative value at the time of the query. The value of the popularity
> measure that we use for boosting in the ranking of the search results is
> only updated when the value has changed enough so that the impact on the
> boost will be significant (say, more than 2%). This does require frequent
> re-indexing of the documents that have significant changes in the number of
> reads, but at least we won't have to update a document if it moves from,
> say, 1000000 to 1000001 reads.
>
> /Martin Koch - ISSUU - senior systems architect.
>
> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org> wrote:
>
> > Hi all,
> > I'm planning to move a quite big Solr index to SolrCloud. However, in
> this
> > index, an external file field is used for popularity ranking.
> >
> > Does SolrCloud supports external file fields? How does it cope with
> > sharding and replication? Where should the external file be placed now
> that
> > the index folder is not local but in the cloud?
> >
> > Are there otherwise other best practices to deal with the use cases
> > external file fields were used for, like popularity/ranking, in
> SolrCloud?
> > Custom ValueSources going to something external?
> >
> > Thanks in advance,
> > Simone
> >
>

Re: SolrCloud and exernal file fields

Posted by Martin Koch <ma...@issuu.com>.

Solr 4.0 does support using EFFs, but it might not give you what you're
hoping fore.

We tried using Solr Cloud, and have given up again.

The EFF is placed in the parent of the index directory in each core; each
core reads the entire EFF and picks out the IDs that it is responsible for.

In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
while re-reading the EFF. Even worse, it seems that the time to re-read the
EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
each core sequentially). The contents of the EFF become active after the
first EXTERNAL commit (commitWithin does NOT work here) after the file has
been updated.

In our case, the EFF was quite large - around 450MB - and we use 16 shards,
so when we triggered an external commit to force re-reading, the whole
system would block for several (10-15) minutes. This won't work in a
production environment. The reason for the size of the EFF is that we have
around 7M documents in the index; each document has a 45 character ID.

We got some help to try to fix the problem so that the re-read of the EFF
proceeds in the background (see
here<https://issues.apache.org/jira/browse/SOLR-3985> for
a fix on the 4.1 branch). However, even though the re-read proceeds in the
background, the time required to launch solr now takes at least as long as
re-reading the EFFs. Again, this is not good enough for our needs.

The next issue is that you cannot sort on EFF fields (though you can return
them as values using &fl=field(my_eff_field). This is also fixed in the 4.1
branch here <https://issues.apache.org/jira/browse/SOLR-4022>.

So: Even after these fixes, EFF performance is not that great. Our solution
is as follows: The actual value of the popularity measure (say, reads) that
we want to report to the user is inserted into the search response
post-query by our query front-end. This value will then be the
authoritative value at the time of the query. The value of the popularity
measure that we use for boosting in the ranking of the search results is
only updated when the value has changed enough so that the impact on the
boost will be significant (say, more than 2%). This does require frequent
re-indexing of the documents that have significant changes in the number of
reads, but at least we won't have to update a document if it moves from,
say, 1000000 to 1000001 reads.

/Martin Koch - ISSUU - senior systems architect.

On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <si...@apache.org> wrote:

> Hi all,
> I'm planning to move a quite big Solr index to SolrCloud. However, in this
> index, an external file field is used for popularity ranking.
>
> Does SolrCloud supports external file fields? How does it cope with
> sharding and replication? Where should the external file be placed now that
> the index folder is not local but in the cloud?
>
> Are there otherwise other best practices to deal with the use cases
> external file fields were used for, like popularity/ranking, in SolrCloud?
> Custom ValueSources going to something external?
>
> Thanks in advance,
> Simone
>