You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robert Hume <rh...@gmail.com> on 2017/02/22 00:57:13 UTC

Question about best way to architect a Solr application with many data sources

To learn how to properly use Solr, I'm building a little experimental
project with it to search for used car listings.

Car listings appear on a variety of different places ... central places
Craigslist and also many many individual Used Car dealership websites.

I am wondering, should I:

(a) deploy a Solr search engine and build individual indexers for every
type of web site I want to find listings on?

or

(b) build my own database to store car listings, and then build services
that scrape data from different sites and feed entries into the database;
then point my Solr search to my database, one simple source of listings?

My concerns are:

With (a) ... I have to be smart enough to understand all those different
data sources and remove/update listings when they change; while this be
harder to do with custom Solr indexers than writing something from scratch?

With (b) ... I'm maintaining a huge database of all my listings which seems
redundant; google doesn't make a *copy* of everything on the internet, it
just knows it's there.  Is maintaining my own database a bad design?

Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Posted by Walter Underwood <wu...@wunderwood.org>.
Awesome advice. flat=fast in Solr.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 5:17 PM, Dave <ha...@gmail.com> wrote:
> 
> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search 
> 
> 
> 
>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>> 
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>> 
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>> 
>> I am wondering, should I:
>> 
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>> 
>> or
>> 
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>> 
>> My concerns are:
>> 
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>> 
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>> 
>> Thanks for reading!


Re: Question about best way to architect a Solr application with many data sources

Posted by Robert Hume <rh...@gmail.com>.
Thanks for that!  I was thinking (B) too, but wanted guidance that I'm
using the tool correctly.

Am still interested in hearing opinions from others, thanks!

rh

On Tue, Feb 21, 2017 at 8:17 PM, Dave <ha...@gmail.com> wrote:

> B is a better option long term. Solr is meant for retrieving flat data,
> fast, not hierarchical. That's what a database is for and trust me you
> would rather have a real database on the end point.  Each tool has a
> purpose, solr can never replace a relational database, and a relational
> database could not replace solr. Start with the slow model (database) for
> control/display and enhance with the fast model (solr) for retrieval/search
>
>
>
> > On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
> >
> > To learn how to properly use Solr, I'm building a little experimental
> > project with it to search for used car listings.
> >
> > Car listings appear on a variety of different places ... central places
> > Craigslist and also many many individual Used Car dealership websites.
> >
> > I am wondering, should I:
> >
> > (a) deploy a Solr search engine and build individual indexers for every
> > type of web site I want to find listings on?
> >
> > or
> >
> > (b) build my own database to store car listings, and then build services
> > that scrape data from different sites and feed entries into the database;
> > then point my Solr search to my database, one simple source of listings?
> >
> > My concerns are:
> >
> > With (a) ... I have to be smart enough to understand all those different
> > data sources and remove/update listings when they change; while this be
> > harder to do with custom Solr indexers than writing something from
> scratch?
> >
> > With (b) ... I'm maintaining a huge database of all my listings which
> seems
> > redundant; google doesn't make a *copy* of everything on the internet, it
> > just knows it's there.  Is maintaining my own database a bad design?
> >
> > Thanks for reading!
>

Re: Question about best way to architect a Solr application with many data sources

Posted by David Hastings <DH...@wshein.com>.
And not to sound redundant but if you ever need help, database programmers are a dime a dozen, good luck finding solr developers that are available freelance for a price you're willing to pay. If you can do the solr anyone else that does web dev can do the sql

> On Feb 21, 2017, at 8:17 PM, Dave <ha...@gmail.com> wrote:
> 
> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search 
> 
> 
> 
>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>> 
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>> 
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>> 
>> I am wondering, should I:
>> 
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>> 
>> or
>> 
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>> 
>> My concerns are:
>> 
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>> 
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>> 
>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Posted by Joel Bernstein <jo...@gmail.com>.
Alfresco has spent ten+ years building a content management system that
follows this basic design:

1) Original bytes (PDF, Word Doc, image file) are stored in a filesystem
based content store.
2) Meta-data is stored in a relational database, normalized.
3) Content is transformed to text and meta-data is de-normalized and is
sent to Solr for indexing.
4) Solr keeps a copy of the de-normalized, pre-analyzed content on disk
next to the indexes for re-indexing and other purposes.
5) Sor analyzes and indexes the content.

This all happens automatically when the content is added to Alfresco. ACL
lists are also stored along with documents and passed to Solr to support
document level access control during the search.




Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Feb 22, 2017 at 3:01 PM, Tim Casey <tc...@gmail.com> wrote:

> I would possibly extend this a bit futher.  There is the source, then the
> 'normalized' version of the data, then the indexed version.
> Sometimes you realize you miss something in the normalized view and you
> have to go back to the actual source.
>
> This will be as likely as there are number of sources for data.   I would
> expect the "DB" version of the data would be the normalized view.
> It is also possible, the DB holds the raw bytes of the source which are
> then transformed and into a normalized view.  Indexing always happens from
> the normalized view.  In this scheme, frequently there is a way to mark
> what failed normalization so you can go back and recapture the data for a
> re-index.
>
> Also, if you are dealing with timely data, being able to reindex helps
> removing stale information from the search index.  In the pipeline of
> captured source -> normalized -> analyzed -> information, where analyzed is
> indexed here, what you do with the data over a year or more becomes part of
> the thinking.
>
>
>
> On Tue, Feb 21, 2017 at 8:24 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > Reindexing is exactly why you want the Single Source of Truth to be in a
> > repository outside of Solr.
> >
> > For our slowly-changing data sets, we have an intermediate JSONL batch.
> > That is created from the source repositories and saved in Amazon S3. Then
> > we load it into Solr nightly. That allows us to reload whenever we need
> to,
> > like loading prod data in test or moving search to a different Amazon
> > region.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Feb 21, 2017, at 7:34 PM, Erick Erickson <er...@gmail.com>
> > wrote:
> > >
> > > Dave:
> > >
> > > Oh, I agree that a DB is a perfectly valid place to store the data and
> > > you're absolutely right that it allows better interaction than flat
> > > files; you can ask questions of an RDBMS that you can't easily ask the
> > > disk ;). Storing to disk is an alternative if you're unwilling to deal
> > > with a DB is all.
> > >
> > > But the main point is you'll change your schema sometime and have to
> > > re-index. Having the data you're indexing stored locally in whatever
> > > form will allow much faster turn-around rather than re-crawling. Of
> > > course it'll result in out of date data so you'll have to refresh
> > > somehow sometime.
> > >
> > > Erick
> > >
> > > On Tue, Feb 21, 2017 at 6:07 PM, Dave <ha...@gmail.com>
> > wrote:
> > >> Ha I think I went to one of your training seminars in NYC maybe 4
> years
> > ago Eric. I'm going to have to respectfully disagree about the rdbms.
> It's
> > such a well know data format that you could hire a high school programmer
> > to help with the db end if you knew how to flatten it to solr. Besides
> it's
> > easy to visualize and interact with the data before it goes to solr. A
> > Json/Nosql format would work just as well, but I really think a database
> > has its place in a scenario like this
> > >>
> > >>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> > >>>
> > >>> I'll add that I _guarantee_ you'll want to re-index the data as you
> > >>> change your schema
> > >>> and the like. You'll be able to do that much more quickly if the data
> > >>> is stored locally somehow.
> > >>>
> > >>> A RDBMS is not necessary however. You could simply store the data on
> > >>> disk in some format
> > >>> you could re-read and send to Solr.
> > >>>
> > >>> Best,
> > >>> Erick
> > >>>
> > >>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <hastings.recursive@gmail.com
> >
> > wrote:
> > >>>> B is a better option long term. Solr is meant for retrieving flat
> > data, fast, not hierarchical. That's what a database is for and trust me
> > you would rather have a real database on the end point.  Each tool has a
> > purpose, solr can never replace a relational database, and a relational
> > database could not replace solr. Start with the slow model (database) for
> > control/display and enhance with the fast model (solr) for
> retrieval/search
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com>
> wrote:
> > >>>>>
> > >>>>> To learn how to properly use Solr, I'm building a little
> experimental
> > >>>>> project with it to search for used car listings.
> > >>>>>
> > >>>>> Car listings appear on a variety of different places ... central
> > places
> > >>>>> Craigslist and also many many individual Used Car dealership
> > websites.
> > >>>>>
> > >>>>> I am wondering, should I:
> > >>>>>
> > >>>>> (a) deploy a Solr search engine and build individual indexers for
> > every
> > >>>>> type of web site I want to find listings on?
> > >>>>>
> > >>>>> or
> > >>>>>
> > >>>>> (b) build my own database to store car listings, and then build
> > services
> > >>>>> that scrape data from different sites and feed entries into the
> > database;
> > >>>>> then point my Solr search to my database, one simple source of
> > listings?
> > >>>>>
> > >>>>> My concerns are:
> > >>>>>
> > >>>>> With (a) ... I have to be smart enough to understand all those
> > different
> > >>>>> data sources and remove/update listings when they change; while
> this
> > be
> > >>>>> harder to do with custom Solr indexers than writing something from
> > scratch?
> > >>>>>
> > >>>>> With (b) ... I'm maintaining a huge database of all my listings
> > which seems
> > >>>>> redundant; google doesn't make a *copy* of everything on the
> > internet, it
> > >>>>> just knows it's there.  Is maintaining my own database a bad
> design?
> > >>>>>
> > >>>>> Thanks for reading!
> >
> >
>

Re: Question about best way to architect a Solr application with many data sources

Posted by Tim Casey <tc...@gmail.com>.
I would possibly extend this a bit futher.  There is the source, then the
'normalized' version of the data, then the indexed version.
Sometimes you realize you miss something in the normalized view and you
have to go back to the actual source.

This will be as likely as there are number of sources for data.   I would
expect the "DB" version of the data would be the normalized view.
It is also possible, the DB holds the raw bytes of the source which are
then transformed and into a normalized view.  Indexing always happens from
the normalized view.  In this scheme, frequently there is a way to mark
what failed normalization so you can go back and recapture the data for a
re-index.

Also, if you are dealing with timely data, being able to reindex helps
removing stale information from the search index.  In the pipeline of
captured source -> normalized -> analyzed -> information, where analyzed is
indexed here, what you do with the data over a year or more becomes part of
the thinking.



On Tue, Feb 21, 2017 at 8:24 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Reindexing is exactly why you want the Single Source of Truth to be in a
> repository outside of Solr.
>
> For our slowly-changing data sets, we have an intermediate JSONL batch.
> That is created from the source repositories and saved in Amazon S3. Then
> we load it into Solr nightly. That allows us to reload whenever we need to,
> like loading prod data in test or moving search to a different Amazon
> region.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Feb 21, 2017, at 7:34 PM, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > Dave:
> >
> > Oh, I agree that a DB is a perfectly valid place to store the data and
> > you're absolutely right that it allows better interaction than flat
> > files; you can ask questions of an RDBMS that you can't easily ask the
> > disk ;). Storing to disk is an alternative if you're unwilling to deal
> > with a DB is all.
> >
> > But the main point is you'll change your schema sometime and have to
> > re-index. Having the data you're indexing stored locally in whatever
> > form will allow much faster turn-around rather than re-crawling. Of
> > course it'll result in out of date data so you'll have to refresh
> > somehow sometime.
> >
> > Erick
> >
> > On Tue, Feb 21, 2017 at 6:07 PM, Dave <ha...@gmail.com>
> wrote:
> >> Ha I think I went to one of your training seminars in NYC maybe 4 years
> ago Eric. I'm going to have to respectfully disagree about the rdbms.  It's
> such a well know data format that you could hire a high school programmer
> to help with the db end if you knew how to flatten it to solr. Besides it's
> easy to visualize and interact with the data before it goes to solr. A
> Json/Nosql format would work just as well, but I really think a database
> has its place in a scenario like this
> >>
> >>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <er...@gmail.com>
> wrote:
> >>>
> >>> I'll add that I _guarantee_ you'll want to re-index the data as you
> >>> change your schema
> >>> and the like. You'll be able to do that much more quickly if the data
> >>> is stored locally somehow.
> >>>
> >>> A RDBMS is not necessary however. You could simply store the data on
> >>> disk in some format
> >>> you could re-read and send to Solr.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <ha...@gmail.com>
> wrote:
> >>>> B is a better option long term. Solr is meant for retrieving flat
> data, fast, not hierarchical. That's what a database is for and trust me
> you would rather have a real database on the end point.  Each tool has a
> purpose, solr can never replace a relational database, and a relational
> database could not replace solr. Start with the slow model (database) for
> control/display and enhance with the fast model (solr) for retrieval/search
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
> >>>>>
> >>>>> To learn how to properly use Solr, I'm building a little experimental
> >>>>> project with it to search for used car listings.
> >>>>>
> >>>>> Car listings appear on a variety of different places ... central
> places
> >>>>> Craigslist and also many many individual Used Car dealership
> websites.
> >>>>>
> >>>>> I am wondering, should I:
> >>>>>
> >>>>> (a) deploy a Solr search engine and build individual indexers for
> every
> >>>>> type of web site I want to find listings on?
> >>>>>
> >>>>> or
> >>>>>
> >>>>> (b) build my own database to store car listings, and then build
> services
> >>>>> that scrape data from different sites and feed entries into the
> database;
> >>>>> then point my Solr search to my database, one simple source of
> listings?
> >>>>>
> >>>>> My concerns are:
> >>>>>
> >>>>> With (a) ... I have to be smart enough to understand all those
> different
> >>>>> data sources and remove/update listings when they change; while this
> be
> >>>>> harder to do with custom Solr indexers than writing something from
> scratch?
> >>>>>
> >>>>> With (b) ... I'm maintaining a huge database of all my listings
> which seems
> >>>>> redundant; google doesn't make a *copy* of everything on the
> internet, it
> >>>>> just knows it's there.  Is maintaining my own database a bad design?
> >>>>>
> >>>>> Thanks for reading!
>
>

Re: Question about best way to architect a Solr application with many data sources

Posted by Walter Underwood <wu...@wunderwood.org>.
Reindexing is exactly why you want the Single Source of Truth to be in a repository outside of Solr.

For our slowly-changing data sets, we have an intermediate JSONL batch. That is created from the source repositories and saved in Amazon S3. Then we load it into Solr nightly. That allows us to reload whenever we need to, like loading prod data in test or moving search to a different Amazon region.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 7:34 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> Dave:
> 
> Oh, I agree that a DB is a perfectly valid place to store the data and
> you're absolutely right that it allows better interaction than flat
> files; you can ask questions of an RDBMS that you can't easily ask the
> disk ;). Storing to disk is an alternative if you're unwilling to deal
> with a DB is all.
> 
> But the main point is you'll change your schema sometime and have to
> re-index. Having the data you're indexing stored locally in whatever
> form will allow much faster turn-around rather than re-crawling. Of
> course it'll result in out of date data so you'll have to refresh
> somehow sometime.
> 
> Erick
> 
> On Tue, Feb 21, 2017 at 6:07 PM, Dave <ha...@gmail.com> wrote:
>> Ha I think I went to one of your training seminars in NYC maybe 4 years ago Eric. I'm going to have to respectfully disagree about the rdbms.  It's such a well know data format that you could hire a high school programmer to help with the db end if you knew how to flatten it to solr. Besides it's easy to visualize and interact with the data before it goes to solr. A Json/Nosql format would work just as well, but I really think a database has its place in a scenario like this
>> 
>>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> I'll add that I _guarantee_ you'll want to re-index the data as you
>>> change your schema
>>> and the like. You'll be able to do that much more quickly if the data
>>> is stored locally somehow.
>>> 
>>> A RDBMS is not necessary however. You could simply store the data on
>>> disk in some format
>>> you could re-read and send to Solr.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <ha...@gmail.com> wrote:
>>>> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search
>>>> 
>>>> 
>>>> 
>>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>>>>> 
>>>>> To learn how to properly use Solr, I'm building a little experimental
>>>>> project with it to search for used car listings.
>>>>> 
>>>>> Car listings appear on a variety of different places ... central places
>>>>> Craigslist and also many many individual Used Car dealership websites.
>>>>> 
>>>>> I am wondering, should I:
>>>>> 
>>>>> (a) deploy a Solr search engine and build individual indexers for every
>>>>> type of web site I want to find listings on?
>>>>> 
>>>>> or
>>>>> 
>>>>> (b) build my own database to store car listings, and then build services
>>>>> that scrape data from different sites and feed entries into the database;
>>>>> then point my Solr search to my database, one simple source of listings?
>>>>> 
>>>>> My concerns are:
>>>>> 
>>>>> With (a) ... I have to be smart enough to understand all those different
>>>>> data sources and remove/update listings when they change; while this be
>>>>> harder to do with custom Solr indexers than writing something from scratch?
>>>>> 
>>>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>>>> just knows it's there.  Is maintaining my own database a bad design?
>>>>> 
>>>>> Thanks for reading!


Re: Question about best way to architect a Solr application with many data sources

Posted by Erick Erickson <er...@gmail.com>.
Dave:

Oh, I agree that a DB is a perfectly valid place to store the data and
you're absolutely right that it allows better interaction than flat
files; you can ask questions of an RDBMS that you can't easily ask the
disk ;). Storing to disk is an alternative if you're unwilling to deal
with a DB is all.

But the main point is you'll change your schema sometime and have to
re-index. Having the data you're indexing stored locally in whatever
form will allow much faster turn-around rather than re-crawling. Of
course it'll result in out of date data so you'll have to refresh
somehow sometime.

Erick

On Tue, Feb 21, 2017 at 6:07 PM, Dave <ha...@gmail.com> wrote:
> Ha I think I went to one of your training seminars in NYC maybe 4 years ago Eric. I'm going to have to respectfully disagree about the rdbms.  It's such a well know data format that you could hire a high school programmer to help with the db end if you knew how to flatten it to solr. Besides it's easy to visualize and interact with the data before it goes to solr. A Json/Nosql format would work just as well, but I really think a database has its place in a scenario like this
>
>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <er...@gmail.com> wrote:
>>
>> I'll add that I _guarantee_ you'll want to re-index the data as you
>> change your schema
>> and the like. You'll be able to do that much more quickly if the data
>> is stored locally somehow.
>>
>> A RDBMS is not necessary however. You could simply store the data on
>> disk in some format
>> you could re-read and send to Solr.
>>
>> Best,
>> Erick
>>
>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <ha...@gmail.com> wrote:
>>> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search
>>>
>>>
>>>
>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>>>>
>>>> To learn how to properly use Solr, I'm building a little experimental
>>>> project with it to search for used car listings.
>>>>
>>>> Car listings appear on a variety of different places ... central places
>>>> Craigslist and also many many individual Used Car dealership websites.
>>>>
>>>> I am wondering, should I:
>>>>
>>>> (a) deploy a Solr search engine and build individual indexers for every
>>>> type of web site I want to find listings on?
>>>>
>>>> or
>>>>
>>>> (b) build my own database to store car listings, and then build services
>>>> that scrape data from different sites and feed entries into the database;
>>>> then point my Solr search to my database, one simple source of listings?
>>>>
>>>> My concerns are:
>>>>
>>>> With (a) ... I have to be smart enough to understand all those different
>>>> data sources and remove/update listings when they change; while this be
>>>> harder to do with custom Solr indexers than writing something from scratch?
>>>>
>>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>>> just knows it's there.  Is maintaining my own database a bad design?
>>>>
>>>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Posted by Dave <ha...@gmail.com>.
Ha I think I went to one of your training seminars in NYC maybe 4 years ago Eric. I'm going to have to respectfully disagree about the rdbms.  It's such a well know data format that you could hire a high school programmer to help with the db end if you knew how to flatten it to solr. Besides it's easy to visualize and interact with the data before it goes to solr. A Json/Nosql format would work just as well, but I really think a database has its place in a scenario like this 

> On Feb 21, 2017, at 8:20 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> I'll add that I _guarantee_ you'll want to re-index the data as you
> change your schema
> and the like. You'll be able to do that much more quickly if the data
> is stored locally somehow.
> 
> A RDBMS is not necessary however. You could simply store the data on
> disk in some format
> you could re-read and send to Solr.
> 
> Best,
> Erick
> 
>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <ha...@gmail.com> wrote:
>> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search
>> 
>> 
>> 
>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>>> 
>>> To learn how to properly use Solr, I'm building a little experimental
>>> project with it to search for used car listings.
>>> 
>>> Car listings appear on a variety of different places ... central places
>>> Craigslist and also many many individual Used Car dealership websites.
>>> 
>>> I am wondering, should I:
>>> 
>>> (a) deploy a Solr search engine and build individual indexers for every
>>> type of web site I want to find listings on?
>>> 
>>> or
>>> 
>>> (b) build my own database to store car listings, and then build services
>>> that scrape data from different sites and feed entries into the database;
>>> then point my Solr search to my database, one simple source of listings?
>>> 
>>> My concerns are:
>>> 
>>> With (a) ... I have to be smart enough to understand all those different
>>> data sources and remove/update listings when they change; while this be
>>> harder to do with custom Solr indexers than writing something from scratch?
>>> 
>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>> just knows it's there.  Is maintaining my own database a bad design?
>>> 
>>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Posted by Erick Erickson <er...@gmail.com>.
I'll add that I _guarantee_ you'll want to re-index the data as you
change your schema
and the like. You'll be able to do that much more quickly if the data
is stored locally somehow.

A RDBMS is not necessary however. You could simply store the data on
disk in some format
you could re-read and send to Solr.

Best,
Erick

On Tue, Feb 21, 2017 at 5:17 PM, Dave <ha...@gmail.com> wrote:
> B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search
>
>
>
>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
>>
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>>
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>>
>> I am wondering, should I:
>>
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>>
>> or
>>
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>>
>> My concerns are:
>>
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>>
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>>
>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Posted by Dave <ha...@gmail.com>.
B is a better option long term. Solr is meant for retrieving flat data, fast, not hierarchical. That's what a database is for and trust me you would rather have a real database on the end point.  Each tool has a purpose, solr can never replace a relational database, and a relational database could not replace solr. Start with the slow model (database) for control/display and enhance with the fast model (solr) for retrieval/search 



> On Feb 21, 2017, at 7:57 PM, Robert Hume <rh...@gmail.com> wrote:
> 
> To learn how to properly use Solr, I'm building a little experimental
> project with it to search for used car listings.
> 
> Car listings appear on a variety of different places ... central places
> Craigslist and also many many individual Used Car dealership websites.
> 
> I am wondering, should I:
> 
> (a) deploy a Solr search engine and build individual indexers for every
> type of web site I want to find listings on?
> 
> or
> 
> (b) build my own database to store car listings, and then build services
> that scrape data from different sites and feed entries into the database;
> then point my Solr search to my database, one simple source of listings?
> 
> My concerns are:
> 
> With (a) ... I have to be smart enough to understand all those different
> data sources and remove/update listings when they change; while this be
> harder to do with custom Solr indexers than writing something from scratch?
> 
> With (b) ... I'm maintaining a huge database of all my listings which seems
> redundant; google doesn't make a *copy* of everything on the internet, it
> just knows it's there.  Is maintaining my own database a bad design?
> 
> Thanks for reading!