You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ali Nazemian <al...@gmail.com> on 2014/05/26 15:50:15 UTC

Using SolrCloud with RDBMS or without

Hi everybody,

I was wondering which scenario (or the combination) would be better for my
application. From the aspect of performance, scalability and high
availability. Here is my application:

Suppose I am going to have more than 10m documents and it grows every day.
(probably in 1 years it reaches to more than 100m docs. I want to use Solr
as tool for indexing these documents but the problem is I have some data
fields that could change frequently. (not too much but it could change)

Scenarios:

1- Using SolrCloud as database for all data. (even the one that could be
changed)

2- Using SolrCloud as database for static data and using RDBMS (such as
oracle) for storing dynamic fields.

3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
data.

Best regards.

-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

Posted by Ali Nazemian <al...@gmail.com>.

The fact that I ignore Cassandra is because of it seems Cassandra is
perfect when you have too much write operation. In my case it is true that
I have some update operation but for sure read operations are much more
than write ones. By the way there are probably more scenarios for my
application. My question would be which one is probably the best?
Best regards.


On Mon, May 26, 2014 at 6:27 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> You could also consider DataStax Enterprise, which integrates Apache
> Cassandra as the primary database and Solr for indexing and query.
>
> See:
> http://www.datastax.com/what-we-offer/products-services/
> datastax-enterprise
>
> -- Jack Krupansky
>
> -----Original Message----- From: Ali Nazemian
> Sent: Monday, May 26, 2014 9:50 AM
> To: solr-user@lucene.apache.org
> Subject: Using SolrCloud with RDBMS or without
>
>
> Hi everybody,
>
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
>
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)
>
> Scenarios:
>
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)
>
> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.
>
> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.
>
> Best regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

Posted by Jack Krupansky <ja...@basetechnology.com>.

You could also consider DataStax Enterprise, which integrates Apache 
Cassandra as the primary database and Solr for indexing and query.

See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

-- Jack Krupansky

-----Original Message----- 
From: Ali Nazemian
Sent: Monday, May 26, 2014 9:50 AM
To: solr-user@lucene.apache.org
Subject: Using SolrCloud with RDBMS or without

Hi everybody,

I was wondering which scenario (or the combination) would be better for my
application. From the aspect of performance, scalability and high
availability. Here is my application:

Suppose I am going to have more than 10m documents and it grows every day.
(probably in 1 years it reaches to more than 100m docs. I want to use Solr
as tool for indexing these documents but the problem is I have some data
fields that could change frequently. (not too much but it could change)

Scenarios:

1- Using SolrCloud as database for all data. (even the one that could be
changed)

2- Using SolrCloud as database for static data and using RDBMS (such as
oracle) for storing dynamic fields.

3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
data.

Best regards.

-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/26/2014 1:48 PM, Ali Nazemian wrote:
> Dear Shawn,
> Hi and thank you for you reply.
> Could you please tell me about the performance and scalability of the
> mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
> Would it scale linearly if I add another 4 machines to that? I mean when
> the documents number increases from 10m to 100m documents.

I am completely unable to give you any kind of definitive answer to
that.  The only way to estimate what kind of performance and scalability
to expect with your data is to actually build a test system with your data.

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Thanks,
Shawn

Re: Using SolrCloud with RDBMS or without

Posted by Ali Nazemian <al...@gmail.com>.

Dear Shawn,
Hi and thank you for you reply.
Could you please tell me about the performance and scalability of the
mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
Would it scale linearly if I add another 4 machines to that? I mean when
the documents number increases from 10m to 100m documents.
Regards.


On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> > I was wondering which scenario (or the combination) would be better for
> my
> > application. From the aspect of performance, scalability and high
> > availability. Here is my application:
> >
> > Suppose I am going to have more than 10m documents and it grows every
> day.
> > (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> > as tool for indexing these documents but the problem is I have some data
> > fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
> > Scenarios:
> >
> > 1- Using SolrCloud as database for all data. (even the one that could be
> > changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
> > 2- Using SolrCloud as database for static data and using RDBMS (such as
> > oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
> > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> > data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>
>


-- 
A.Nazemian

RE: Using SolrCloud with RDBMS or without

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Few things will help here if you can clarify what is acceptable in terms of indexing hours & what is the use case for indexing

·         Are you looking to re-index all data (say 100 m) frequently that you need indexing hours to be on lower side (<10 or <5 etc.). If so how many reasonable hours you expect it take

·         Or you can afford to not re-index all data and add incremental indexing  (Not sure how frequently your schema fields gets changed as mentioned by you)



Also as Eric pointed out using SolrJ and using Parallelism you can achieve indexing quickly. We recently had a use case where we indexed around 10m docs from database in  less than ½ hr.



Thanks,

Susheel



-----Original Message-----
From: Ali Nazemian [mailto:alinazemian@gmail.com]
Sent: Monday, May 26, 2014 2:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Using SolrCloud with RDBMS or without



Dear Erick,

Thank you for you reply.

Some parts of documents come from Nutch crawler and the other parts come from processing those documents.

I really need it to be as fast as possible and 10 hours for indexing is not acceptable for my application.

Regards.





On Mon, May 26, 2014 at 9:25 PM, Erick Erickson <er...@gmail.com>>wrote:



> What you haven't told us is where the data comes from. But until you

> put some numbers to it, it's hard to decide.

>

> I tend to prefer storing the data somewhere else, filesystem, whatever

> and indexing to Solr when data changes. Even if that means re-indexing

> the entire corpus. I don't like going to more complicated solutions

> until that proves untenable.

>

> Backup/restore solutions for filesystems, DBs, whatever are are a very

> mature technology, I rely on that first to store my original source.

>

> Now you can re-index at will.

>

> So let's claim your data comes in from some stream somewhere. I'd

> 1> store it to the file system.

> 2> write a program to pull it off the file system and index.

> 3> Your comment about MapReduceIndexerTool is germane. You can

> 3> re-index

> all that data very quickly. And it'll find files on your file system

> for you too!

>

> But I wouldn't even go there until I'd tried indexing my 10M docs

> straight with SolrJ or similar. If you can index your 10M docs in 1

> hour and, by extrapolation your 100M docs in 10 hours, is that good

> enough?

> I don't know, it's your problem space after all ;). And is it

> acceptable to not see changes to the schema until tomorrow morning? If

> so, there's no need to get more complicated....

>

> Best,

> Erick

>

> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <so...@elyograg.org>> wrote:

> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:

> >> I was wondering which scenario (or the combination) would be better

> >> for

> my

> >> application. From the aspect of performance, scalability and high

> >> availability. Here is my application:

> >>

> >> Suppose I am going to have more than 10m documents and it grows

> >> every

> day.

> >> (probably in 1 years it reaches to more than 100m docs. I want to

> >> use

> Solr

> >> as tool for indexing these documents but the problem is I have some

> >> data fields that could change frequently. (not too much but it

> >> could change)

> >

> > Choosing which database software to use to hold your data is a

> > problem with many possible solutions.  Everyone will have a

> > different answer for you.  Each solution has strengths and

> > weaknesses, and in the end, only you can really know what your requirements are.

> >

> >> Scenarios:

> >>

> >> 1- Using SolrCloud as database for all data. (even the one that

> >> could be

> >> changed)

> >

> > If you choose to use Solr as a NoSQL, I would strongly recommend

> > that you have two Solr installs.  The first install would be purely

> > for data storage and would have no indexed fields.  If you can get

> > machines with enough RAM, it would also probably be preferable to

> > use a single index (or SolrCloud with one shard) for that install.

> > The other install would be for searching.  Sharding would not be an

> > issue on that index.  The reason that I make this recommendation is

> > that when you use Solr for searching, you have to do a complete

> > reindex if you change your search schema.  It's difficult to reindex

> > if the search index is also your canonical data source.

> >

> >> 2- Using SolrCloud as database for static data and using RDBMS

> >> (such as

> >> oracle) for storing dynamic fields.

> >

> > I don't think it would be a good idea to have two canonical data

> > sources.  Pick one.  As already mentioned, Solr is better as a

> > search technology, serving up pointers to data in another data

> > source, than as a database.

> >

> > If you want to use RDBMS technology, why would you spend all that

> > money on Oracle?  Just use one of the free databases.  Our really

> > large Solr index comes from a database.  At one time that database was in Oracle.

> > When my employer purchased the company with that database, we

> > thought we were obtaining a full Oracle license.  It turns out we

> > weren't.  It would have cost about half a million dollars to buy

> > that license, so we switched to MySQL.

> >

> > Since making that move to MySQL, performance is actually *better*.

> > The source table for our data has 96 million rows right now, growing

> > at a rate of a few million per year.  This is completely in line

> > with your

> > 100 million document requirement.  For the massive table that feeds

> > Solr, we might switch to MongoDB, but that has not been decided yet.

> >

> > Later we switched from EasyAsk to Solr, a move that has *also* given

> > us better performance.  Because both MySQL and Solr are free, we've

> > achieved a substantial cost savings.

> >

> >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce)

> >> for

> all

> >> data.

> >

> > I have no experience with this technology, but I think that if you

> > are thinking about a database on HDFS, you're probably actually

> > talking about HBase, the Apache implementation of Google's BigTable.

> >

> > Thanks,

> > Shawn

> >

>







--

A.Nazemian

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

Re: Using SolrCloud with RDBMS or without

Posted by Ali Nazemian <al...@gmail.com>.

Dear Erick,
Thank you for you reply.
Some parts of documents come from Nutch crawler and the other parts come
from processing those documents.
I really need it to be as fast as possible and 10 hours for indexing is not
acceptable for my application.
Regards.


On Mon, May 26, 2014 at 9:25 PM, Erick Erickson <er...@gmail.com>wrote:

> What you haven't told us is where the data comes from. But until
> you put some numbers to it, it's hard to decide.
>
> I tend to prefer storing the data somewhere else, filesystem, whatever
> and indexing to Solr when data changes. Even if that means re-indexing
> the entire corpus. I don't like going to more complicated solutions until
> that proves untenable.
>
> Backup/restore solutions for filesystems, DBs, whatever are are a very
> mature technology, I rely on that first to store my original source.
>
> Now you can re-index at will.
>
> So let's claim your data comes in from some stream somewhere. I'd
> 1> store it to the file system.
> 2> write a program to pull it off the file system and index.
> 3> Your comment about MapReduceIndexerTool is germane. You can re-index
> all that data very quickly. And it'll find files on your file system
> for you too!
>
> But I wouldn't even go there until I'd tried
> indexing my 10M docs straight with SolrJ or similar. If you can index
> your 10M docs
> in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good
> enough?
> I don't know, it's your problem space after all ;). And is it acceptable
> to not
> see changes to the schema until tomorrow morning? If so, there's no need
> to get
> more complicated....
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <so...@elyograg.org> wrote:
> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> >> I was wondering which scenario (or the combination) would be better for
> my
> >> application. From the aspect of performance, scalability and high
> >> availability. Here is my application:
> >>
> >> Suppose I am going to have more than 10m documents and it grows every
> day.
> >> (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> >> as tool for indexing these documents but the problem is I have some data
> >> fields that could change frequently. (not too much but it could change)
> >
> > Choosing which database software to use to hold your data is a problem
> > with many possible solutions.  Everyone will have a different answer for
> > you.  Each solution has strengths and weaknesses, and in the end, only
> > you can really know what your requirements are.
> >
> >> Scenarios:
> >>
> >> 1- Using SolrCloud as database for all data. (even the one that could be
> >> changed)
> >
> > If you choose to use Solr as a NoSQL, I would strongly recommend that
> > you have two Solr installs.  The first install would be purely for data
> > storage and would have no indexed fields.  If you can get machines with
> > enough RAM, it would also probably be preferable to use a single index
> > (or SolrCloud with one shard) for that install.  The other install would
> > be for searching.  Sharding would not be an issue on that index.  The
> > reason that I make this recommendation is that when you use Solr for
> > searching, you have to do a complete reindex if you change your search
> > schema.  It's difficult to reindex if the search index is also your
> > canonical data source.
> >
> >> 2- Using SolrCloud as database for static data and using RDBMS (such as
> >> oracle) for storing dynamic fields.
> >
> > I don't think it would be a good idea to have two canonical data
> > sources.  Pick one.  As already mentioned, Solr is better as a search
> > technology, serving up pointers to data in another data source, than as
> > a database.
> >
> > If you want to use RDBMS technology, why would you spend all that money
> > on Oracle?  Just use one of the free databases.  Our really large Solr
> > index comes from a database.  At one time that database was in Oracle.
> > When my employer purchased the company with that database, we thought we
> > were obtaining a full Oracle license.  It turns out we weren't.  It
> > would have cost about half a million dollars to buy that license, so we
> > switched to MySQL.
> >
> > Since making that move to MySQL, performance is actually *better*.  The
> > source table for our data has 96 million rows right now, growing at a
> > rate of a few million per year.  This is completely in line with your
> > 100 million document requirement.  For the massive table that feeds
> > Solr, we might switch to MongoDB, but that has not been decided yet.
> >
> > Later we switched from EasyAsk to Solr, a move that has *also* given us
> > better performance.  Because both MySQL and Solr are free, we've
> > achieved a substantial cost savings.
> >
> >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for
> all
> >> data.
> >
> > I have no experience with this technology, but I think that if you are
> > thinking about a database on HDFS, you're probably actually talking
> > about HBase, the Apache implementation of Google's BigTable.
> >
> > Thanks,
> > Shawn
> >
>



-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

Posted by Erick Erickson <er...@gmail.com>.

What you haven't told us is where the data comes from. But until
you put some numbers to it, it's hard to decide.

I tend to prefer storing the data somewhere else, filesystem, whatever
and indexing to Solr when data changes. Even if that means re-indexing
the entire corpus. I don't like going to more complicated solutions until
that proves untenable.

Backup/restore solutions for filesystems, DBs, whatever are are a very
mature technology, I rely on that first to store my original source.

Now you can re-index at will.

So let's claim your data comes in from some stream somewhere. I'd
1> store it to the file system.
2> write a program to pull it off the file system and index.
3> Your comment about MapReduceIndexerTool is germane. You can re-index
all that data very quickly. And it'll find files on your file system
for you too!

But I wouldn't even go there until I'd tried
indexing my 10M docs straight with SolrJ or similar. If you can index
your 10M docs
in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good enough?
I don't know, it's your problem space after all ;). And is it acceptable to not
see changes to the schema until tomorrow morning? If so, there's no need to get
more complicated....

Best,
Erick

On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
>> I was wondering which scenario (or the combination) would be better for my
>> application. From the aspect of performance, scalability and high
>> availability. Here is my application:
>>
>> Suppose I am going to have more than 10m documents and it grows every day.
>> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
>> as tool for indexing these documents but the problem is I have some data
>> fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
>> Scenarios:
>>
>> 1- Using SolrCloud as database for all data. (even the one that could be
>> changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
>> 2- Using SolrCloud as database for static data and using RDBMS (such as
>> oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
>> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
>> data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>

Re: Using SolrCloud with RDBMS or without

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
> 
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)

Choosing which database software to use to hold your data is a problem
with many possible solutions.  Everyone will have a different answer for
you.  Each solution has strengths and weaknesses, and in the end, only
you can really know what your requirements are.

> Scenarios:
> 
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)

If you choose to use Solr as a NoSQL, I would strongly recommend that
you have two Solr installs.  The first install would be purely for data
storage and would have no indexed fields.  If you can get machines with
enough RAM, it would also probably be preferable to use a single index
(or SolrCloud with one shard) for that install.  The other install would
be for searching.  Sharding would not be an issue on that index.  The
reason that I make this recommendation is that when you use Solr for
searching, you have to do a complete reindex if you change your search
schema.  It's difficult to reindex if the search index is also your
canonical data source.

> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.

I don't think it would be a good idea to have two canonical data
sources.  Pick one.  As already mentioned, Solr is better as a search
technology, serving up pointers to data in another data source, than as
a database.

If you want to use RDBMS technology, why would you spend all that money
on Oracle?  Just use one of the free databases.  Our really large Solr
index comes from a database.  At one time that database was in Oracle.
When my employer purchased the company with that database, we thought we
were obtaining a full Oracle license.  It turns out we weren't.  It
would have cost about half a million dollars to buy that license, so we
switched to MySQL.

Since making that move to MySQL, performance is actually *better*.  The
source table for our data has 96 million rows right now, growing at a
rate of a few million per year.  This is completely in line with your
100 million document requirement.  For the massive table that feeds
Solr, we might switch to MongoDB, but that has not been decided yet.

Later we switched from EasyAsk to Solr, a move that has *also* given us
better performance.  Because both MySQL and Solr are free, we've
achieved a substantial cost savings.

> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.

I have no experience with this technology, but I think that if you are
thinking about a database on HDFS, you're probably actually talking
about HBase, the Apache implementation of Google's BigTable.

Thanks,
Shawn