You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/07/05 18:58:46 UTC

simple setup help (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

Can someone walk a noob through setting up a dataimport handler?
I need to index a coldfusion website

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Original Message-----
From: Lorenzo Fundaró [mailto:lorenzo.fundaro@dawandamail.com] 
Sent: Tuesday, July 05, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: deploy solr on cloud providers

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

@Tomas and @Steven

I am a bit skeptical about this two statements:

If a node just disappears you should be fine in terms of data
> availability, since Solr in "SolrCloud" replicates the data as it 
> comes it (before sending the http response)


and

>
> You shouldn't "need" to move the storage as SolrCloud will replicate 
> all data to the new node and anything in the transaction log will 
> already be distributed through the rest of the machines..


because according to the official documentation here
<Caution-https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance>:
(Write side fault tolerant -> recovery)

If a leader goes down, it may have sent requests to some replicas and not
> others. So when a new potential leader is identified, it runs a synch 
> process against the other replicas. If this is successful, everything 
> should be consistent, the leader registers as active, and normal 
> actions proceed


I think there is a possibility that an update is not sent by the leader but is kept in the local disk and after it comes up again it can sync the non-sent data.

Furthermore:

Achieved Replication Factor
> When using a replication factor greater than one, an update request 
> may succeed on the shard leader but fail on one or more of the 
> replicas. For instance, consider a collection with one shard and 
> replication factor of three. In this case, you have a shard leader and two additional replicas.
> If an update request succeeds on the leader but fails on both 
> replicas, for whatever reason, the update request is still considered 
> successful from the perspective of the client. The replicas that 
> missed the update will sync with the leader when they recover.


They have implemented this parameter called *min_rf* that you can use
(client-side) to make sure that your update was replicated to at least one replica (e.g.: min_rf > 1).

This is why my concern about moving storage around, because then I know when the shard leader comes back, solrcloud will run sync process for those documents that couldn't be sent to the replicas.

Am I missing something or misunderstood the documentation ?

Cheers !







On 5 July 2016 at 19:49, Davis, Daniel (NIH/NLM) [C] <da...@nih.gov>
wrote:

> Lorenzo, this probably comes late, but my systems guys just don't want to
> give me real disk.   Although RAID-5 or LVM on-top of JBOD may be better
> than Amazon EBS, Amazon EBS is still much closer to real disk in terms of
> IOPS and latency than NFS ;)    I even ran a mini test (not an official
> benchmark), and found the response time for random reads to be better.
>
> If you are a young/smallish company, this may be all in the cloud, but 
> if you are in a large organization like mine, you may also need to 
> allow for other architectures, such as a "virtual" Netapp in the cloud 
> that communicates with a physical Netapp on-premises, and the throughput/latency
> of that.   The most important thing is to actually measure the numbers you
> are getting, both for search and for simply raw I/O, or to get your
> systems/storage guys to measure those numbers.     If you get your
> systems/storage guys to just measure storage - you will want to care 
> about three things for indexing primarily:
>
>         Sequential Write Throughput
>         Random Read Throughput
>         Random Read Response Time/Latency
>
> Hope this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor), Office of 
> Computer and Communications Systems, National Library of Medicine, NIH
>
>
>
> -----Original Message-----
> From: Lorenzo Fundaró [Caution-mailto:lorenzo.fundaro@dawandamail.com]
> Sent: Tuesday, July 05, 2016 3:20 AM
> To: solr-user@lucene.apache.org
> Subject: Re: deploy solr on cloud providers
>
> Hi Shawn. Actually what im trying to find out is whether this is the 
> best approach for deploying solr in the cloud. I believe solrcloud 
> solves a lot of problems in terms of High Availability but when it 
> comes to storage there seems to be a limitation that can be workaround 
> of course but it's a bit cumbersome and i was wondering if there is a 
> better option for this or if im missing something with the way I'm 
> doing it. I wonder if there are some proved experience about how to 
> solve the storage problem when deploying in the cloud. Any advise or 
> point to some enlightening documentation will be appreciated. Thanks.
> On Jul 4, 2016 18:27, "Shawn Heisey" <ap...@elyograg.org> wrote:
>
> > On 7/4/2016 10:18 AM, Lorenzo Fundaró wrote:
> > > when deploying solr (in solrcloud mode) in the cloud one has to 
> > > take care of storage, and as far as I understand it can be a 
> > > problem because the storage should go wherever the node is 
> > > created. If we have for example, a node on EC2 with its own 
> > > persistent disk, this node happens to be the leader and at some 
> > > point crashes but couldn't make the replication of the data that 
> > > has in the transaction log, how do we do in that case ? Ideally 
> > > the new node must use the leftover data that the death node left, 
> > > but this is a bit cumbersome in my opinion. What are the best practices for this ?
> >
> > I can't make any sense of this.  What is the *exact* problem you 
> > need to solve?  The details can be very important.
> >
> > We might be dealing with this:
> >
> > Caution-http://people.apache.org/~hossman/#xyproblem
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 

--
Lorenzo Fundaro
Backend Engineer
E-Mail: lorenzo.fundaro@dawandamail.com

Fax       + 49 - (0)30 - 25 76 08 52
Tel        + 49 - (0)179 - 51 10 982

DaWanda GmbH
Windscheidstraße 18
10627 Berlin

Geschäftsführer: Claudia Helming und Niels Nüssler AG Charlottenburg HRB 104695 B Caution-http://www.dawanda.com


CLASSIFICATION: UNCLASSIFIED