You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2010/01/20 20:02:38 UTC

Re: Data currently stored in Solr index. Should it be moved to HDFS?

Hello,

Reading large result sets from Solr is not the way we typically advise people to use Solr. It's not designed for that (nor is Lucene, the search library at its core).  There is some work being done right now about getting Solr better at retrieveing large result sets, but my feeling is you'd be better of avoiding Solr and getting data to your MR jobs from files stored in HDFS.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: "Ranganathan, Sharmila" <sr...@library.rochester.edu>
> To: common-user@hadoop.apache.org
> Sent: Tue, January 19, 2010 5:15:36 PM
> Subject: Data currently stored in Solr index. Should it be moved to HDFS? 
> 
> Hi,
> 
> 
> 
> Our application stores GBs of data in Lucene Solr index. It reads from
> Solr index and does some processing on the data and stores it back in
> Solr as index. It is stored in Solr index so that faceted search is
> possible.  The process of reading from solr, processing data and writing
> back to index is very slow. So we are looking at some parallel
> programming frameworks. Hadoop MapReduce seems to take input in form of
> file and creates output as a file. Since we have data in Solr index,
> should we read data from index convert to a file and send it as input to
> Hadoop and read its output file and write the results to index? This
> read and write to index will still be time consuming if not run
> parallel. Or should we get rid of Solr index and just store data in
> HDFS.  Also the index is stored in one folder which means one disk.  We
> donot use multiple disks. Is use of multiple disk a must for Hadoop?
> 
> 
> 
> I am new to Hadoop and trying to figure out whether Hadoop is the
> solution for our application.
> 
> 
> 
> Thanks
> 
> SR


Re: Data currently stored in Solr index. Should it be moved to HDFS?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hm, yes.  See how few hits this shows:


  http://search-hadoop.com/?q=non-distributed&fc_project=Hadoop

You can set it up on 1 box, but that's really useful only for development.
 
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: "Ranganathan, Sharmila" <sr...@library.rochester.edu>
> To: common-user@hadoop.apache.org
> Sent: Wed, January 20, 2010 3:23:34 PM
> Subject: RE: Data currently stored in Solr index. Should it be moved to HDFS?
> 
> Thanks for your reply. Is Hadoop only for distributed applications? 
> 
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Wednesday, January 20, 2010 2:03 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Data currently stored in Solr index. Should it be moved to
> HDFS?
> 
> Hello,
> 
> Reading large result sets from Solr is not the way we typically advise
> people to use Solr. It's not designed for that (nor is Lucene, the
> search library at its core).  There is some work being done right now
> about getting Solr better at retrieveing large result sets, but my
> feeling is you'd be better of avoiding Solr and getting data to your MR
> jobs from files stored in HDFS.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: "Ranganathan, Sharmila" 
> > To: common-user@hadoop.apache.org
> > Sent: Tue, January 19, 2010 5:15:36 PM
> > Subject: Data currently stored in Solr index. Should it be moved to
> HDFS? 
> > 
> > Hi,
> > 
> > 
> > 
> > Our application stores GBs of data in Lucene Solr index. It reads from
> > Solr index and does some processing on the data and stores it back in
> > Solr as index. It is stored in Solr index so that faceted search is
> > possible.  The process of reading from solr, processing data and
> writing
> > back to index is very slow. So we are looking at some parallel
> > programming frameworks. Hadoop MapReduce seems to take input in form
> of
> > file and creates output as a file. Since we have data in Solr index,
> > should we read data from index convert to a file and send it as input
> to
> > Hadoop and read its output file and write the results to index? This
> > read and write to index will still be time consuming if not run
> > parallel. Or should we get rid of Solr index and just store data in
> > HDFS.  Also the index is stored in one folder which means one disk.
> We
> > donot use multiple disks. Is use of multiple disk a must for Hadoop?
> > 
> > 
> > 
> > I am new to Hadoop and trying to figure out whether Hadoop is the
> > solution for our application.
> > 
> > 
> > 
> > Thanks
> > 
> > SR


RE: Data currently stored in Solr index. Should it be moved to HDFS?

Posted by "Ranganathan, Sharmila" <sr...@library.rochester.edu>.
Thanks for your reply. Is Hadoop only for distributed applications? 


-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Wednesday, January 20, 2010 2:03 PM
To: common-user@hadoop.apache.org
Subject: Re: Data currently stored in Solr index. Should it be moved to
HDFS?

Hello,

Reading large result sets from Solr is not the way we typically advise
people to use Solr. It's not designed for that (nor is Lucene, the
search library at its core).  There is some work being done right now
about getting Solr better at retrieveing large result sets, but my
feeling is you'd be better of avoiding Solr and getting data to your MR
jobs from files stored in HDFS.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: "Ranganathan, Sharmila" <sr...@library.rochester.edu>
> To: common-user@hadoop.apache.org
> Sent: Tue, January 19, 2010 5:15:36 PM
> Subject: Data currently stored in Solr index. Should it be moved to
HDFS? 
> 
> Hi,
> 
> 
> 
> Our application stores GBs of data in Lucene Solr index. It reads from
> Solr index and does some processing on the data and stores it back in
> Solr as index. It is stored in Solr index so that faceted search is
> possible.  The process of reading from solr, processing data and
writing
> back to index is very slow. So we are looking at some parallel
> programming frameworks. Hadoop MapReduce seems to take input in form
of
> file and creates output as a file. Since we have data in Solr index,
> should we read data from index convert to a file and send it as input
to
> Hadoop and read its output file and write the results to index? This
> read and write to index will still be time consuming if not run
> parallel. Or should we get rid of Solr index and just store data in
> HDFS.  Also the index is stored in one folder which means one disk.
We
> donot use multiple disks. Is use of multiple disk a must for Hadoop?
> 
> 
> 
> I am new to Hadoop and trying to figure out whether Hadoop is the
> solution for our application.
> 
> 
> 
> Thanks
> 
> SR