You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Srijan <sh...@gmail.com> on 2022/04/04 11:52:54 UTC

Solr as a dedicated data store?

Hi All,

I am working on designing a Solr based enterprise search solution. One
requirement I have is to track crawled data from various different data
sources with metadata like crawled date, indexing status and so on. I am
looking into using Solr itself as my data store and not adding a separate
database to my stack. Has anyone used Solr as a dedicated data store? How
did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
DB - can someone here share some insight into how Fusion is using this
'DB'? My store will need to track millions of objects and be able to handle
parallel adds/updates. Do you think Solr is a good tool for this or am I
better off depending on a database service?

Thanks a bunch.

Re: Solr as a dedicated data store?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/7/2022 8:03 PM, Dave wrote:
> I seem to recall hearing that this was
> actually enforced by the code but I didn't find the check on a quick look
> through the code

Lucene began recording the version that writes a segment at some point 
in 6.x, I have no idea which specific release.

I know that 8.x will refuse to open an index if the version is anything 
in 6.x or there is no version at all in the index, and I haven't heard 
about 7.x doing anything similar, so it was probably a 6.x version after 
6.0 that first started recording the version.

Upgrading more than one major release was never guaranteed before this 
change, but now it is enforced.

Thanks,
Shawn


Re: Solr as a dedicated data store?

Posted by dmitri maziuk <dm...@gmail.com>.
On 2022-04-08 7:36 PM, James Greene wrote:
> I think you are speaking to the point that the requirement to have all your
> data rebuildable from source isn't a hard requirement as their are ways to
> re-index without having access to the original source (you still need the
> full docs stored in solr just not indexed). By looking at solr from that
> pov it becomes more approachable as a primary data store.

I may have a different definition of primary data store, one in which 
it's a store for primary data.

Dima

Re: Solr as a dedicated data store?

Posted by David Hastings <ha...@gmail.com>.
As long as your documents are simple in structure. A key value or an array
for any given field, you’re good to go. Anything multi level, you’re out of
luck. Not sure how relevant this link is still but:
https://stackoverflow.com/questions/22192904/is-solr-support-complex-types-like-structure-for-multivalued-fields


It’s from 2017 but believe it still holds true however there are
possibilities with nested documents
https://solr.apache.org/guide/8_1/indexing-nested-documents.html

Admittedly I have not gotten too in depth myself with child documents for
more complex data structures. And yeah you could just store the complex
data structure into a single large text stored and non indexed field as
json and only index what you will be searching on.

Another option I’ve experimented with is two completely different cores or
even completely different solr servers (I use stand alone a lot) use one
for searching and use the result to pull the raw data from the other
“storage server” by an identifier.  This is actually surprisingly fast.

It’s a hack, you’re using the wrong tool for the job, but it can be done if
you REALLY want to and get creative.

Good luck. Curious to hear what you come up with
-dave


On Fri, Apr 8, 2022 at 8:36 PM James Greene <ja...@jamesaustingreene.com>
wrote:

> I think you are speaking to the point that the requirement to have all your
> data rebuildable from source isn't a hard requirement as their are ways to
> re-index without having access to the original source (you still need the
> full docs stored in solr just not indexed). By looking at solr from that
> pov it becomes more approachable as a primary data store.
>
> On Fri, Apr 8, 2022, 1:53 PM dmitri maziuk <dm...@gmail.com>
> wrote:
>
> > On 2022-04-07 11:51 PM, Shawn Heisey wrote:
> > ...
> > > As I understand it, ES offers reindex capability by storing the entire
> > > input document into a field in the index.  Which means that the index
> > > will be lot bigger than it needs to be, which is going to affect
> > > performance.  If the field is not indexed, then the performance impact
> > > may not be huge, but it will not be zero.  And it wouldn't really
> > > improve the speed of a full reindex, it just makes it possible to do a
> > > reindex without an external data source.
> > >
> > > The same thing can be done with Solr, and it is something I would
> > > definitely say needs to be part of any index design where Solr will be
> a
> > > primary data store.  That capability should be available in Solr, but I
> > > do not think it should be enabled by default.
> > >
> > What would be the advantage over dumping the documents into a text file
> > (xml, json) and doing a full re-import? In principle you could dump
> > everything Solr needs into the file and only check if it's all there
> > during the import; that plus the protocol overhead would be the only
> > downside. And deleting the existing index will take a little extra time.
> >
> > The upside if we can stick the files into git and have versions, it
> > should compress really well, we can clone it to off-site storage etc.
> etc.
> >
> > Dima
> >
>

Re: Solr as a dedicated data store?

Posted by James Greene <ja...@jamesaustingreene.com>.
I think you are speaking to the point that the requirement to have all your
data rebuildable from source isn't a hard requirement as their are ways to
re-index without having access to the original source (you still need the
full docs stored in solr just not indexed). By looking at solr from that
pov it becomes more approachable as a primary data store.

On Fri, Apr 8, 2022, 1:53 PM dmitri maziuk <dm...@gmail.com> wrote:

> On 2022-04-07 11:51 PM, Shawn Heisey wrote:
> ...
> > As I understand it, ES offers reindex capability by storing the entire
> > input document into a field in the index.  Which means that the index
> > will be lot bigger than it needs to be, which is going to affect
> > performance.  If the field is not indexed, then the performance impact
> > may not be huge, but it will not be zero.  And it wouldn't really
> > improve the speed of a full reindex, it just makes it possible to do a
> > reindex without an external data source.
> >
> > The same thing can be done with Solr, and it is something I would
> > definitely say needs to be part of any index design where Solr will be a
> > primary data store.  That capability should be available in Solr, but I
> > do not think it should be enabled by default.
> >
> What would be the advantage over dumping the documents into a text file
> (xml, json) and doing a full re-import? In principle you could dump
> everything Solr needs into the file and only check if it's all there
> during the import; that plus the protocol overhead would be the only
> downside. And deleting the existing index will take a little extra time.
>
> The upside if we can stick the files into git and have versions, it
> should compress really well, we can clone it to off-site storage etc. etc.
>
> Dima
>

Re: Solr as a dedicated data store?

Posted by dmitri maziuk <dm...@gmail.com>.
On 2022-04-07 11:51 PM, Shawn Heisey wrote:
...
> As I understand it, ES offers reindex capability by storing the entire 
> input document into a field in the index.  Which means that the index 
> will be lot bigger than it needs to be, which is going to affect 
> performance.  If the field is not indexed, then the performance impact 
> may not be huge, but it will not be zero.  And it wouldn't really 
> improve the speed of a full reindex, it just makes it possible to do a 
> reindex without an external data source.
> 
> The same thing can be done with Solr, and it is something I would 
> definitely say needs to be part of any index design where Solr will be a 
> primary data store.  That capability should be available in Solr, but I 
> do not think it should be enabled by default.
>
What would be the advantage over dumping the documents into a text file 
(xml, json) and doing a full re-import? In principle you could dump 
everything Solr needs into the file and only check if it's all there 
during the import; that plus the protocol overhead would be the only 
downside. And deleting the existing index will take a little extra time.

The upside if we can stick the files into git and have versions, it 
should compress really well, we can clone it to off-site storage etc. etc.

Dima

Re: Solr as a dedicated data store?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/7/2022 8:41 PM, James Greene wrote:
> This is actually why people abandon solr for elastic/opensearch. Solrs core
> contributors  hold little value in supporting migration paths and stability
> with-in so it's always a heavy cost to users for upgrades.

At a fundamental level, because ES and Solr both use Lucene for the vast 
majority of their functionality, there are not a lot of differences 
between what each of them is capable of doing.  The main differences are 
in what each of them does out of the box. Solr is geared towards maximum 
capability and flexibility, and the sheer number of things that can be 
configured is overwhelming for a novice.  ES focuses more on a "typical 
user" sort of audience. If you want to really dive into things, they 
offer some of the same things that Solr does, but those advanced 
features are not staring you in the face when you take a look at a 
default config.

As I understand it, ES offers reindex capability by storing the entire 
input document into a field in the index.  Which means that the index 
will be lot bigger than it needs to be, which is going to affect 
performance.  If the field is not indexed, then the performance impact 
may not be huge, but it will not be zero.  And it wouldn't really 
improve the speed of a full reindex, it just makes it possible to do a 
reindex without an external data source.

The same thing can be done with Solr, and it is something I would 
definitely say needs to be part of any index design where Solr will be a 
primary data store.  That capability should be available in Solr, but I 
do not think it should be enabled by default.

I would love to explore the possibilities in making it possible to have 
a much more streamlined config system for Solr.  Something that would 
make it a lot easier for a novice to get started, without making things 
really difficult for an advanced user to create more complex configurations.

Thanks,
Shawn


Re: Solr as a dedicated data store?

Posted by James Greene <ja...@jamesaustingreene.com>.
I mean to only encourage focus on stability between releases and offer
migration path options.  I AM a fan boy of technology that offers an easier
path of adoption/maintainability over its competitors.

On Thu, Apr 7, 2022, 11:11 PM Gus Heck <gu...@gmail.com> wrote:

> It's not shocking that there are differences among products. If that
> feature is your favorite, use elastic. There are other features... and
> licensing which matters to some. Amazon's effort is interesting, but will
> it persist? When Oracle bought Mysql AB a site named dorsal source dot org
> (don't' go there it's now inhabited by an attack site AFAICT, but you can
> see it on wayback machine ~2008) sprang up in response (a friend of mine
> was involved). Granted it was not backed by a big company, but it was
> useful for a while. Even big companies may change priorities over time and
> sunset things. Open source projects can be archived too, but Lucene and
> Solr are among the most active, so that is clearly not a near term risk.
> Your tone however sounds a bit fanboyish, and sounds a bit like you forget
> that the folks who maintain solr are all volunteers. If you see things that
> need fixing, improved or want to argue for change without disparaging
> comments, we certainly welcome your input, (and your code if you are so
> inclined).
>
> -Gus
>
> On Thu, Apr 7, 2022 at 10:41 PM James Greene <ja...@jamesaustingreene.com>
> wrote:
>
> > > so that we can be free to make improvements without having to carry an
> > ever growing weight of back compatibility
> >
> >
> > This is actually why people abandon solr for elastic/opensearch. Solrs
> core
> > contributors  hold little value in supporting migration paths and
> stability
> > with-in so it's always a heavy cost to users for upgrades.
> >
> > Very few people think solr is stable between upgrades (anyone?
> Bueller....
> > anyone?). This means  you need to plan for the migration of data
> > (time/storage) between upgrades.  This doesn't mean you need to reindex
> > from source (you will be reindexing) it means you cannot get more/new
> data
> > from source that you didn't include in your original document when
> > indexing.  There are strategies for storing full "source documents"
> without
> > having them indexed that allow you to re-index from the stored document
> > (non-indexed fields) without requiring you to have a totally separate
> > persistence layer.
> >
> >
> >
> > On Thu, Apr 7, 2022, 10:03 PM Dave <ha...@gmail.com> wrote:
> >
> > > This is one of the most interesting and articulate emails I’ve read
> about
> > > the fundamentals in a long time. Saving this one :)
> > >
> > > > On Apr 7, 2022, at 9:32 PM, Gus Heck <gu...@gmail.com> wrote:
> > > >
> > > > Solr is not a "good" primary data store. Solr is built for finding
> > your
> > > > documents, not storing them. A good primary data store holds stuff
> > > > indefinitely without adding weight and without changing regularly,
> Solr
> > > > doesn't fit that description. One of the biggest reasons for this is
> > that
> > > > at some point you'll want to upgrade to the latest version, and we
> only
> > > > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc...
> > > multiple
> > > > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that
> this
> > > was
> > > > actually enforced by the code but I didn't find the check on a quick
> > look
> > > > through the code (doesn't mean it isn't there, just I didn't find
> it).
> > In
> > > > any case, multi-version upgrades are not generally supported,
> > > intentionally
> > > > so that we can be free to make improvements without having to carry
> an
> > > ever
> > > > growing weight of back compatibility. Typically if new index features
> > are
> > > > developed and you want to use them (like when doc values were
> > introduced)
> > > > you will need to re-index to use the new feature. Search engines
> > > > precalculate and write typically denormalized or otherwise processed
> > > > information into the index prioritizing speed of retrieval over space
> > and
> > > > long term storage. As others have mentioned, there is also the ever
> > > > changing requirements problem. Typically someone in product
> management
> > or
> > > > if you are unlucky, the CEO hears of something cool someone did with
> > solr
> > > > and says: Hey, let's do that too! I bet it would really draw
> customers
> > > > in!... 9 times out of 10 the new thing involves changing the way
> > > something
> > > > is analyzed, or adding a new analysis of previously ingested data. If
> > you
> > > > can't reindex you have to be able to say no, not on old data" and
> > > possibly
> > > > say "we'll need a separate collection for the new data and it will be
> > > > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> > > >
> > > > The ability to add fields to documents is much more for things like
> > > adding
> > > > searchable geo-located gps coordinates for documents that have a
> > location
> > > > or metadata like you mention to a document than for storing the
> > document
> > > > content itself. It is *possible* to have self re-indexing documents
> > that
> > > > contain all the data needed to repeat indexing but it takes a lot of
> > > space
> > > > and, slows down your index. Furthermore it will requires that all
> > > indexing
> > > > enrichment/cleaning/etc be baked inside solr using
> > > > updateProcessorFactories... which in turn makes all that indexing
> work
> > > > compete more heavily with search queries... or alternately requires
> > that
> > > > the data has to be queried out and inserted back in after external
> > > > processing, which is also going to compete with user queries (so
> maybe
> > > one
> > > > winds up fielding extra hardware or even 2 clusters - twice as many
> > > > machines - and swapping clusters back and forth periodically, now
> it's,
> > > > complex, expensive and has a very high index latency instead of slow
> > > > queries... no free lunch there) Trying to store the original data
> just
> > > > complicates matters. Keeping it simple and using solr to find things
> > that
> > > > are then served from a primary source is really the best place to
> > start.
> > > >
> > > > So yeah, you *could* use it as a primary store with work and
> acceptance
> > > of
> > > > limitations, but you have to be aware of what you are doing, and
> have a
> > > > decently working crystal ball. I never advise clients to do this
> > because
> > > I
> > > > prefer happy clients that say nice things about me :) So my advice to
> > you
> > > > is don't do it unless there is an extremely compelling reason.
> > > >
> > > > Assuming you're not dealing with really massive amounts of data, just
> > > > indexing some internal intranet (and it's not something the size of
> > apple
> > > > or google), then for your use case, crawling pages, I'd have the
> > crawler
> > > > drop anything it finds and considers worthy of indexing to a
> filesystem
> > > > (maybe 2 files, the content and a file with metadata like the link
> > where
> > > it
> > > > was found), have a separate indexing process scan the filesystem
> > > > periodically, munge it for metadata or whatever other manipulations
> are
> > > > useful and then write the result to solr. If the crawl store is
> > designed
> > > so
> > > > the same document always lands in the same location and you don't
> have
> > to
> > > > worry about growth other than the growth of the site(s) you are
> > indexing.
> > > > There are ways to improve on things from there such as adding a kafka
> > > > instance for a topic that identifies newly fetched docs to prevent
> (or
> > > > augment) the periodic scanning. Also storing a hash of the content
> in a
> > > > database to let the indexer ignore when the crawler simply downloaded
> > the
> > > > same bytes, cause nothing's changed...
> > > >
> > > > And you'll want to decide if you want to remove references to pages
> > that
> > > > disappeared or detect moves/renames vs deletions which is a whole
> thing
> > > of
> > > > its own...
> > > >
> > > > My side project JesterJ.org <https://www.JesterJ.org> provides a
> good
> > > deal
> > > > of the indexer features I describe (but it still needs a kafka
> > connector,
> > > > contributions welcome :) ). Some folks have used it profitably, but
> > it's
> > > > admittedly still rough, and the current master is much better than
> the
> > > now
> > > > ancient, last released beta (which probably should have been an alpha
> > but
> > > > oh well :)
> > > >
> > > > -Gus
> > > >
> > > >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <
> > > dominique.bejean@eolya.fr>
> > > >> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> A best practice for performances and ressources usage is to store
> > and/or
> > > >> index and/or docValues only data required for your search features.
> > > >> However, in order to implement or modify new or existing features in
> > an
> > > >> index you will need to reindex all the data in this index.
> > > >>
> > > >> I propose 2 solutions :
> > > >>
> > > >>   - The first one is to store the full original JSON data into the
> > _str_
> > > >>   fields of the index.
> > > >>
> > > >>
> > > >>
> > >
> >
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
> > > >>
> > > >>
> > > >>   - The second and the best solution in my opinion is to store the
> > JSON
> > > >>   data into an intermediate feature neutral data store as a file
> > simple
> > > >> file
> > > >>   system or better a MongoDB database. This way will allow you to
> use
> > > your
> > > >>   data in several indexes (one index for search, one index for
> > > suggesters,
> > > >>   ...)  without duplicating data into _src_ fields in each index. A
> > uuid
> > > >> in
> > > >>   each index will allow you to get the full JSON object in MongoDB.
> > > >>
> > > >>
> > > >> Obviously a key point is the backup strategy of your data store
> > > according
> > > >> to the solution you choose : either Solr indexes or the file system
> or
> > > the
> > > >> MongoDB database.
> > > >>
> > > >> Dominique
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :
> > > >>>
> > > >>> Hi All,
> > > >>>
> > > >>> I am working on designing a Solr based enterprise search solution.
> > One
> > > >>> requirement I have is to track crawled data from various different
> > data
> > > >>> sources with metadata like crawled date, indexing status and so
> on. I
> > > am
> > > >>> looking into using Solr itself as my data store and not adding a
> > > separate
> > > >>> database to my stack. Has anyone used Solr as a dedicated data
> store?
> > > How
> > > >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of
> > > Crawl
> > > >>> DB - can someone here share some insight into how Fusion is using
> > this
> > > >>> 'DB'? My store will need to track millions of objects and be able
> to
> > > >> handle
> > > >>> parallel adds/updates. Do you think Solr is a good tool for this or
> > am
> > > I
> > > >>> better off depending on a database service?
> > > >>>
> > > >>> Thanks a bunch.
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > http://www.needhamsoftware.com (work)
> > > > http://www.the111shift.com (play)
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: Solr as a dedicated data store?

Posted by Gus Heck <gu...@gmail.com>.
It's not shocking that there are differences among products. If that
feature is your favorite, use elastic. There are other features... and
licensing which matters to some. Amazon's effort is interesting, but will
it persist? When Oracle bought Mysql AB a site named dorsal source dot org
(don't' go there it's now inhabited by an attack site AFAICT, but you can
see it on wayback machine ~2008) sprang up in response (a friend of mine
was involved). Granted it was not backed by a big company, but it was
useful for a while. Even big companies may change priorities over time and
sunset things. Open source projects can be archived too, but Lucene and
Solr are among the most active, so that is clearly not a near term risk.
Your tone however sounds a bit fanboyish, and sounds a bit like you forget
that the folks who maintain solr are all volunteers. If you see things that
need fixing, improved or want to argue for change without disparaging
comments, we certainly welcome your input, (and your code if you are so
inclined).

-Gus

On Thu, Apr 7, 2022 at 10:41 PM James Greene <ja...@jamesaustingreene.com>
wrote:

> > so that we can be free to make improvements without having to carry an
> ever growing weight of back compatibility
>
>
> This is actually why people abandon solr for elastic/opensearch. Solrs core
> contributors  hold little value in supporting migration paths and stability
> with-in so it's always a heavy cost to users for upgrades.
>
> Very few people think solr is stable between upgrades (anyone? Bueller....
> anyone?). This means  you need to plan for the migration of data
> (time/storage) between upgrades.  This doesn't mean you need to reindex
> from source (you will be reindexing) it means you cannot get more/new data
> from source that you didn't include in your original document when
> indexing.  There are strategies for storing full "source documents" without
> having them indexed that allow you to re-index from the stored document
> (non-indexed fields) without requiring you to have a totally separate
> persistence layer.
>
>
>
> On Thu, Apr 7, 2022, 10:03 PM Dave <ha...@gmail.com> wrote:
>
> > This is one of the most interesting and articulate emails I’ve read about
> > the fundamentals in a long time. Saving this one :)
> >
> > > On Apr 7, 2022, at 9:32 PM, Gus Heck <gu...@gmail.com> wrote:
> > >
> > > Solr is not a "good" primary data store. Solr is built for finding
> your
> > > documents, not storing them. A good primary data store holds stuff
> > > indefinitely without adding weight and without changing regularly, Solr
> > > doesn't fit that description. One of the biggest reasons for this is
> that
> > > at some point you'll want to upgrade to the latest version, and we only
> > > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc...
> > multiple
> > > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this
> > was
> > > actually enforced by the code but I didn't find the check on a quick
> look
> > > through the code (doesn't mean it isn't there, just I didn't find it).
> In
> > > any case, multi-version upgrades are not generally supported,
> > intentionally
> > > so that we can be free to make improvements without having to carry an
> > ever
> > > growing weight of back compatibility. Typically if new index features
> are
> > > developed and you want to use them (like when doc values were
> introduced)
> > > you will need to re-index to use the new feature. Search engines
> > > precalculate and write typically denormalized or otherwise processed
> > > information into the index prioritizing speed of retrieval over space
> and
> > > long term storage. As others have mentioned, there is also the ever
> > > changing requirements problem. Typically someone in product management
> or
> > > if you are unlucky, the CEO hears of something cool someone did with
> solr
> > > and says: Hey, let's do that too! I bet it would really draw customers
> > > in!... 9 times out of 10 the new thing involves changing the way
> > something
> > > is analyzed, or adding a new analysis of previously ingested data. If
> you
> > > can't reindex you have to be able to say no, not on old data" and
> > possibly
> > > say "we'll need a separate collection for the new data and it will be
> > > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> > >
> > > The ability to add fields to documents is much more for things like
> > adding
> > > searchable geo-located gps coordinates for documents that have a
> location
> > > or metadata like you mention to a document than for storing the
> document
> > > content itself. It is *possible* to have self re-indexing documents
> that
> > > contain all the data needed to repeat indexing but it takes a lot of
> > space
> > > and, slows down your index. Furthermore it will requires that all
> > indexing
> > > enrichment/cleaning/etc be baked inside solr using
> > > updateProcessorFactories... which in turn makes all that indexing work
> > > compete more heavily with search queries... or alternately requires
> that
> > > the data has to be queried out and inserted back in after external
> > > processing, which is also going to compete with user queries (so maybe
> > one
> > > winds up fielding extra hardware or even 2 clusters - twice as many
> > > machines - and swapping clusters back and forth periodically, now it's,
> > > complex, expensive and has a very high index latency instead of slow
> > > queries... no free lunch there) Trying to store the original data just
> > > complicates matters. Keeping it simple and using solr to find things
> that
> > > are then served from a primary source is really the best place to
> start.
> > >
> > > So yeah, you *could* use it as a primary store with work and acceptance
> > of
> > > limitations, but you have to be aware of what you are doing, and have a
> > > decently working crystal ball. I never advise clients to do this
> because
> > I
> > > prefer happy clients that say nice things about me :) So my advice to
> you
> > > is don't do it unless there is an extremely compelling reason.
> > >
> > > Assuming you're not dealing with really massive amounts of data, just
> > > indexing some internal intranet (and it's not something the size of
> apple
> > > or google), then for your use case, crawling pages, I'd have the
> crawler
> > > drop anything it finds and considers worthy of indexing to a filesystem
> > > (maybe 2 files, the content and a file with metadata like the link
> where
> > it
> > > was found), have a separate indexing process scan the filesystem
> > > periodically, munge it for metadata or whatever other manipulations are
> > > useful and then write the result to solr. If the crawl store is
> designed
> > so
> > > the same document always lands in the same location and you don't have
> to
> > > worry about growth other than the growth of the site(s) you are
> indexing.
> > > There are ways to improve on things from there such as adding a kafka
> > > instance for a topic that identifies newly fetched docs to prevent (or
> > > augment) the periodic scanning. Also storing a hash of the content in a
> > > database to let the indexer ignore when the crawler simply downloaded
> the
> > > same bytes, cause nothing's changed...
> > >
> > > And you'll want to decide if you want to remove references to pages
> that
> > > disappeared or detect moves/renames vs deletions which is a whole thing
> > of
> > > its own...
> > >
> > > My side project JesterJ.org <https://www.JesterJ.org> provides a good
> > deal
> > > of the indexer features I describe (but it still needs a kafka
> connector,
> > > contributions welcome :) ). Some folks have used it profitably, but
> it's
> > > admittedly still rough, and the current master is much better than the
> > now
> > > ancient, last released beta (which probably should have been an alpha
> but
> > > oh well :)
> > >
> > > -Gus
> > >
> > >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <
> > dominique.bejean@eolya.fr>
> > >> wrote:
> > >>
> > >> Hi,
> > >>
> > >> A best practice for performances and ressources usage is to store
> and/or
> > >> index and/or docValues only data required for your search features.
> > >> However, in order to implement or modify new or existing features in
> an
> > >> index you will need to reindex all the data in this index.
> > >>
> > >> I propose 2 solutions :
> > >>
> > >>   - The first one is to store the full original JSON data into the
> _str_
> > >>   fields of the index.
> > >>
> > >>
> > >>
> >
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
> > >>
> > >>
> > >>   - The second and the best solution in my opinion is to store the
> JSON
> > >>   data into an intermediate feature neutral data store as a file
> simple
> > >> file
> > >>   system or better a MongoDB database. This way will allow you to use
> > your
> > >>   data in several indexes (one index for search, one index for
> > suggesters,
> > >>   ...)  without duplicating data into _src_ fields in each index. A
> uuid
> > >> in
> > >>   each index will allow you to get the full JSON object in MongoDB.
> > >>
> > >>
> > >> Obviously a key point is the backup strategy of your data store
> > according
> > >> to the solution you choose : either Solr indexes or the file system or
> > the
> > >> MongoDB database.
> > >>
> > >> Dominique
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>> Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :
> > >>>
> > >>> Hi All,
> > >>>
> > >>> I am working on designing a Solr based enterprise search solution.
> One
> > >>> requirement I have is to track crawled data from various different
> data
> > >>> sources with metadata like crawled date, indexing status and so on. I
> > am
> > >>> looking into using Solr itself as my data store and not adding a
> > separate
> > >>> database to my stack. Has anyone used Solr as a dedicated data store?
> > How
> > >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of
> > Crawl
> > >>> DB - can someone here share some insight into how Fusion is using
> this
> > >>> 'DB'? My store will need to track millions of objects and be able to
> > >> handle
> > >>> parallel adds/updates. Do you think Solr is a good tool for this or
> am
> > I
> > >>> better off depending on a database service?
> > >>>
> > >>> Thanks a bunch.
> > >>>
> > >>
> > >
> > >
> > > --
> > > http://www.needhamsoftware.com (work)
> > > http://www.the111shift.com (play)
> >
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Solr as a dedicated data store?

Posted by James Greene <ja...@jamesaustingreene.com>.
> so that we can be free to make improvements without having to carry an
ever growing weight of back compatibility


This is actually why people abandon solr for elastic/opensearch. Solrs core
contributors  hold little value in supporting migration paths and stability
with-in so it's always a heavy cost to users for upgrades.

Very few people think solr is stable between upgrades (anyone? Bueller....
anyone?). This means  you need to plan for the migration of data
(time/storage) between upgrades.  This doesn't mean you need to reindex
from source (you will be reindexing) it means you cannot get more/new data
from source that you didn't include in your original document when
indexing.  There are strategies for storing full "source documents" without
having them indexed that allow you to re-index from the stored document
(non-indexed fields) without requiring you to have a totally separate
persistence layer.



On Thu, Apr 7, 2022, 10:03 PM Dave <ha...@gmail.com> wrote:

> This is one of the most interesting and articulate emails I’ve read about
> the fundamentals in a long time. Saving this one :)
>
> > On Apr 7, 2022, at 9:32 PM, Gus Heck <gu...@gmail.com> wrote:
> >
> > Solr is not a "good" primary data store. Solr is built for finding your
> > documents, not storing them. A good primary data store holds stuff
> > indefinitely without adding weight and without changing regularly, Solr
> > doesn't fit that description. One of the biggest reasons for this is that
> > at some point you'll want to upgrade to the latest version, and we only
> > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc...
> multiple
> > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this
> was
> > actually enforced by the code but I didn't find the check on a quick look
> > through the code (doesn't mean it isn't there, just I didn't find it). In
> > any case, multi-version upgrades are not generally supported,
> intentionally
> > so that we can be free to make improvements without having to carry an
> ever
> > growing weight of back compatibility. Typically if new index features are
> > developed and you want to use them (like when doc values were introduced)
> > you will need to re-index to use the new feature. Search engines
> > precalculate and write typically denormalized or otherwise processed
> > information into the index prioritizing speed of retrieval over space and
> > long term storage. As others have mentioned, there is also the ever
> > changing requirements problem. Typically someone in product management or
> > if you are unlucky, the CEO hears of something cool someone did with solr
> > and says: Hey, let's do that too! I bet it would really draw customers
> > in!... 9 times out of 10 the new thing involves changing the way
> something
> > is analyzed, or adding a new analysis of previously ingested data. If you
> > can't reindex you have to be able to say no, not on old data" and
> possibly
> > say "we'll need a separate collection for the new data and it will be
> > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> >
> > The ability to add fields to documents is much more for things like
> adding
> > searchable geo-located gps coordinates for documents that have a location
> > or metadata like you mention to a document than for storing the document
> > content itself. It is *possible* to have self re-indexing documents that
> > contain all the data needed to repeat indexing but it takes a lot of
> space
> > and, slows down your index. Furthermore it will requires that all
> indexing
> > enrichment/cleaning/etc be baked inside solr using
> > updateProcessorFactories... which in turn makes all that indexing work
> > compete more heavily with search queries... or alternately requires that
> > the data has to be queried out and inserted back in after external
> > processing, which is also going to compete with user queries (so maybe
> one
> > winds up fielding extra hardware or even 2 clusters - twice as many
> > machines - and swapping clusters back and forth periodically, now it's,
> > complex, expensive and has a very high index latency instead of slow
> > queries... no free lunch there) Trying to store the original data just
> > complicates matters. Keeping it simple and using solr to find things that
> > are then served from a primary source is really the best place to start.
> >
> > So yeah, you *could* use it as a primary store with work and acceptance
> of
> > limitations, but you have to be aware of what you are doing, and have a
> > decently working crystal ball. I never advise clients to do this because
> I
> > prefer happy clients that say nice things about me :) So my advice to you
> > is don't do it unless there is an extremely compelling reason.
> >
> > Assuming you're not dealing with really massive amounts of data, just
> > indexing some internal intranet (and it's not something the size of apple
> > or google), then for your use case, crawling pages, I'd have the crawler
> > drop anything it finds and considers worthy of indexing to a filesystem
> > (maybe 2 files, the content and a file with metadata like the link where
> it
> > was found), have a separate indexing process scan the filesystem
> > periodically, munge it for metadata or whatever other manipulations are
> > useful and then write the result to solr. If the crawl store is designed
> so
> > the same document always lands in the same location and you don't have to
> > worry about growth other than the growth of the site(s) you are indexing.
> > There are ways to improve on things from there such as adding a kafka
> > instance for a topic that identifies newly fetched docs to prevent (or
> > augment) the periodic scanning. Also storing a hash of the content in a
> > database to let the indexer ignore when the crawler simply downloaded the
> > same bytes, cause nothing's changed...
> >
> > And you'll want to decide if you want to remove references to pages that
> > disappeared or detect moves/renames vs deletions which is a whole thing
> of
> > its own...
> >
> > My side project JesterJ.org <https://www.JesterJ.org> provides a good
> deal
> > of the indexer features I describe (but it still needs a kafka connector,
> > contributions welcome :) ). Some folks have used it profitably, but it's
> > admittedly still rough, and the current master is much better than the
> now
> > ancient, last released beta (which probably should have been an alpha but
> > oh well :)
> >
> > -Gus
> >
> >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <
> dominique.bejean@eolya.fr>
> >> wrote:
> >>
> >> Hi,
> >>
> >> A best practice for performances and ressources usage is to store and/or
> >> index and/or docValues only data required for your search features.
> >> However, in order to implement or modify new or existing features in an
> >> index you will need to reindex all the data in this index.
> >>
> >> I propose 2 solutions :
> >>
> >>   - The first one is to store the full original JSON data into the _str_
> >>   fields of the index.
> >>
> >>
> >>
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
> >>
> >>
> >>   - The second and the best solution in my opinion is to store the JSON
> >>   data into an intermediate feature neutral data store as a file simple
> >> file
> >>   system or better a MongoDB database. This way will allow you to use
> your
> >>   data in several indexes (one index for search, one index for
> suggesters,
> >>   ...)  without duplicating data into _src_ fields in each index. A uuid
> >> in
> >>   each index will allow you to get the full JSON object in MongoDB.
> >>
> >>
> >> Obviously a key point is the backup strategy of your data store
> according
> >> to the solution you choose : either Solr indexes or the file system or
> the
> >> MongoDB database.
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>> Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :
> >>>
> >>> Hi All,
> >>>
> >>> I am working on designing a Solr based enterprise search solution. One
> >>> requirement I have is to track crawled data from various different data
> >>> sources with metadata like crawled date, indexing status and so on. I
> am
> >>> looking into using Solr itself as my data store and not adding a
> separate
> >>> database to my stack. Has anyone used Solr as a dedicated data store?
> How
> >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of
> Crawl
> >>> DB - can someone here share some insight into how Fusion is using this
> >>> 'DB'? My store will need to track millions of objects and be able to
> >> handle
> >>> parallel adds/updates. Do you think Solr is a good tool for this or am
> I
> >>> better off depending on a database service?
> >>>
> >>> Thanks a bunch.
> >>>
> >>
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
>

Re: Solr as a dedicated data store?

Posted by Dave <ha...@gmail.com>.
This is one of the most interesting and articulate emails I’ve read about the fundamentals in a long time. Saving this one :)

> On Apr 7, 2022, at 9:32 PM, Gus Heck <gu...@gmail.com> wrote:
> 
> Solr is not a "good" primary data store. Solr is built for finding your
> documents, not storing them. A good primary data store holds stuff
> indefinitely without adding weight and without changing regularly, Solr
> doesn't fit that description. One of the biggest reasons for this is that
> at some point you'll want to upgrade to the latest version, and we only
> support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple
> step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was
> actually enforced by the code but I didn't find the check on a quick look
> through the code (doesn't mean it isn't there, just I didn't find it). In
> any case, multi-version upgrades are not generally supported, intentionally
> so that we can be free to make improvements without having to carry an ever
> growing weight of back compatibility. Typically if new index features are
> developed and you want to use them (like when doc values were introduced)
> you will need to re-index to use the new feature. Search engines
> precalculate and write typically denormalized or otherwise processed
> information into the index prioritizing speed of retrieval over space and
> long term storage. As others have mentioned, there is also the ever
> changing requirements problem. Typically someone in product management or
> if you are unlucky, the CEO hears of something cool someone did with solr
> and says: Hey, let's do that too! I bet it would really draw customers
> in!... 9 times out of 10 the new thing involves changing the way something
> is analyzed, or adding a new analysis of previously ingested data. If you
> can't reindex you have to be able to say no, not on old data" and possibly
> say "we'll need a separate collection for the new data and it will be
> difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> 
> The ability to add fields to documents is much more for things like adding
> searchable geo-located gps coordinates for documents that have a location
> or metadata like you mention to a document than for storing the document
> content itself. It is *possible* to have self re-indexing documents that
> contain all the data needed to repeat indexing but it takes a lot of space
> and, slows down your index. Furthermore it will requires that all indexing
> enrichment/cleaning/etc be baked inside solr using
> updateProcessorFactories... which in turn makes all that indexing work
> compete more heavily with search queries... or alternately requires that
> the data has to be queried out and inserted back in after external
> processing, which is also going to compete with user queries (so maybe one
> winds up fielding extra hardware or even 2 clusters - twice as many
> machines - and swapping clusters back and forth periodically, now it's,
> complex, expensive and has a very high index latency instead of slow
> queries... no free lunch there) Trying to store the original data just
> complicates matters. Keeping it simple and using solr to find things that
> are then served from a primary source is really the best place to start.
> 
> So yeah, you *could* use it as a primary store with work and acceptance of
> limitations, but you have to be aware of what you are doing, and have a
> decently working crystal ball. I never advise clients to do this because I
> prefer happy clients that say nice things about me :) So my advice to you
> is don't do it unless there is an extremely compelling reason.
> 
> Assuming you're not dealing with really massive amounts of data, just
> indexing some internal intranet (and it's not something the size of apple
> or google), then for your use case, crawling pages, I'd have the crawler
> drop anything it finds and considers worthy of indexing to a filesystem
> (maybe 2 files, the content and a file with metadata like the link where it
> was found), have a separate indexing process scan the filesystem
> periodically, munge it for metadata or whatever other manipulations are
> useful and then write the result to solr. If the crawl store is designed so
> the same document always lands in the same location and you don't have to
> worry about growth other than the growth of the site(s) you are indexing.
> There are ways to improve on things from there such as adding a kafka
> instance for a topic that identifies newly fetched docs to prevent (or
> augment) the periodic scanning. Also storing a hash of the content in a
> database to let the indexer ignore when the crawler simply downloaded the
> same bytes, cause nothing's changed...
> 
> And you'll want to decide if you want to remove references to pages that
> disappeared or detect moves/renames vs deletions which is a whole thing of
> its own...
> 
> My side project JesterJ.org <https://www.JesterJ.org> provides a good deal
> of the indexer features I describe (but it still needs a kafka connector,
> contributions welcome :) ). Some folks have used it profitably, but it's
> admittedly still rough, and the current master is much better than the now
> ancient, last released beta (which probably should have been an alpha but
> oh well :)
> 
> -Gus
> 
>> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <do...@eolya.fr>
>> wrote:
>> 
>> Hi,
>> 
>> A best practice for performances and ressources usage is to store and/or
>> index and/or docValues only data required for your search features.
>> However, in order to implement or modify new or existing features in an
>> index you will need to reindex all the data in this index.
>> 
>> I propose 2 solutions :
>> 
>>   - The first one is to store the full original JSON data into the _str_
>>   fields of the index.
>> 
>> 
>> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
>> 
>> 
>>   - The second and the best solution in my opinion is to store the JSON
>>   data into an intermediate feature neutral data store as a file simple
>> file
>>   system or better a MongoDB database. This way will allow you to use your
>>   data in several indexes (one index for search, one index for suggesters,
>>   ...)  without duplicating data into _src_ fields in each index. A uuid
>> in
>>   each index will allow you to get the full JSON object in MongoDB.
>> 
>> 
>> Obviously a key point is the backup strategy of your data store according
>> to the solution you choose : either Solr indexes or the file system or the
>> MongoDB database.
>> 
>> Dominique
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :
>>> 
>>> Hi All,
>>> 
>>> I am working on designing a Solr based enterprise search solution. One
>>> requirement I have is to track crawled data from various different data
>>> sources with metadata like crawled date, indexing status and so on. I am
>>> looking into using Solr itself as my data store and not adding a separate
>>> database to my stack. Has anyone used Solr as a dedicated data store? How
>>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
>>> DB - can someone here share some insight into how Fusion is using this
>>> 'DB'? My store will need to track millions of objects and be able to
>> handle
>>> parallel adds/updates. Do you think Solr is a good tool for this or am I
>>> better off depending on a database service?
>>> 
>>> Thanks a bunch.
>>> 
>> 
> 
> 
> -- 
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

Re: Solr as a dedicated data store?

Posted by Gus Heck <gu...@gmail.com>.
Solr is not a "good" primary data store. Solr is built for finding your
documents, not storing them. A good primary data store holds stuff
indefinitely without adding weight and without changing regularly, Solr
doesn't fit that description. One of the biggest reasons for this is that
at some point you'll want to upgrade to the latest version, and we only
support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple
step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was
actually enforced by the code but I didn't find the check on a quick look
through the code (doesn't mean it isn't there, just I didn't find it). In
any case, multi-version upgrades are not generally supported, intentionally
so that we can be free to make improvements without having to carry an ever
growing weight of back compatibility. Typically if new index features are
developed and you want to use them (like when doc values were introduced)
you will need to re-index to use the new feature. Search engines
precalculate and write typically denormalized or otherwise processed
information into the index prioritizing speed of retrieval over space and
long term storage. As others have mentioned, there is also the ever
changing requirements problem. Typically someone in product management or
if you are unlucky, the CEO hears of something cool someone did with solr
and says: Hey, let's do that too! I bet it would really draw customers
in!... 9 times out of 10 the new thing involves changing the way something
is analyzed, or adding a new analysis of previously ingested data. If you
can't reindex you have to be able to say no, not on old data" and possibly
say "we'll need a separate collection for the new data and it will be
difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.

The ability to add fields to documents is much more for things like adding
searchable geo-located gps coordinates for documents that have a location
or metadata like you mention to a document than for storing the document
content itself. It is *possible* to have self re-indexing documents that
contain all the data needed to repeat indexing but it takes a lot of space
and, slows down your index. Furthermore it will requires that all indexing
enrichment/cleaning/etc be baked inside solr using
updateProcessorFactories... which in turn makes all that indexing work
compete more heavily with search queries... or alternately requires that
the data has to be queried out and inserted back in after external
processing, which is also going to compete with user queries (so maybe one
winds up fielding extra hardware or even 2 clusters - twice as many
machines - and swapping clusters back and forth periodically, now it's,
complex, expensive and has a very high index latency instead of slow
queries... no free lunch there) Trying to store the original data just
complicates matters. Keeping it simple and using solr to find things that
are then served from a primary source is really the best place to start.

So yeah, you *could* use it as a primary store with work and acceptance of
limitations, but you have to be aware of what you are doing, and have a
decently working crystal ball. I never advise clients to do this because I
prefer happy clients that say nice things about me :) So my advice to you
is don't do it unless there is an extremely compelling reason.

Assuming you're not dealing with really massive amounts of data, just
indexing some internal intranet (and it's not something the size of apple
or google), then for your use case, crawling pages, I'd have the crawler
drop anything it finds and considers worthy of indexing to a filesystem
(maybe 2 files, the content and a file with metadata like the link where it
was found), have a separate indexing process scan the filesystem
periodically, munge it for metadata or whatever other manipulations are
useful and then write the result to solr. If the crawl store is designed so
the same document always lands in the same location and you don't have to
worry about growth other than the growth of the site(s) you are indexing.
There are ways to improve on things from there such as adding a kafka
instance for a topic that identifies newly fetched docs to prevent (or
augment) the periodic scanning. Also storing a hash of the content in a
database to let the indexer ignore when the crawler simply downloaded the
same bytes, cause nothing's changed...

And you'll want to decide if you want to remove references to pages that
disappeared or detect moves/renames vs deletions which is a whole thing of
its own...

My side project JesterJ.org <https://www.JesterJ.org> provides a good deal
of the indexer features I describe (but it still needs a kafka connector,
contributions welcome :) ). Some folks have used it profitably, but it's
admittedly still rough, and the current master is much better than the now
ancient, last released beta (which probably should have been an alpha but
oh well :)

-Gus

On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <do...@eolya.fr>
wrote:

> Hi,
>
> A best practice for performances and ressources usage is to store and/or
> index and/or docValues only data required for your search features.
> However, in order to implement or modify new or existing features in an
> index you will need to reindex all the data in this index.
>
> I propose 2 solutions :
>
>    - The first one is to store the full original JSON data into the _str_
>    fields of the index.
>
>
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
>
>
>    - The second and the best solution in my opinion is to store the JSON
>    data into an intermediate feature neutral data store as a file simple
> file
>    system or better a MongoDB database. This way will allow you to use your
>    data in several indexes (one index for search, one index for suggesters,
>    ...)  without duplicating data into _src_ fields in each index. A uuid
> in
>    each index will allow you to get the full JSON object in MongoDB.
>
>
> Obviously a key point is the backup strategy of your data store according
> to the solution you choose : either Solr indexes or the file system or the
> MongoDB database.
>
> Dominique
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :
>
> > Hi All,
> >
> > I am working on designing a Solr based enterprise search solution. One
> > requirement I have is to track crawled data from various different data
> > sources with metadata like crawled date, indexing status and so on. I am
> > looking into using Solr itself as my data store and not adding a separate
> > database to my stack. Has anyone used Solr as a dedicated data store? How
> > did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> > DB - can someone here share some insight into how Fusion is using this
> > 'DB'? My store will need to track millions of objects and be able to
> handle
> > parallel adds/updates. Do you think Solr is a good tool for this or am I
> > better off depending on a database service?
> >
> > Thanks a bunch.
> >
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Solr as a dedicated data store?

Posted by Dominique Bejean <do...@eolya.fr>.
Hi,

A best practice for performances and ressources usage is to store and/or
index and/or docValues only data required for your search features.
However, in order to implement or modify new or existing features in an
index you will need to reindex all the data in this index.

I propose 2 solutions :

   - The first one is to store the full original JSON data into the _str_
   fields of the index.

   https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default


   - The second and the best solution in my opinion is to store the JSON
   data into an intermediate feature neutral data store as a file simple file
   system or better a MongoDB database. This way will allow you to use your
   data in several indexes (one index for search, one index for suggesters,
   ...)  without duplicating data into _src_ fields in each index. A uuid in
   each index will allow you to get the full JSON object in MongoDB.


Obviously a key point is the backup strategy of your data store according
to the solution you choose : either Solr indexes or the file system or the
MongoDB database.

Dominique

















Le lun. 4 avr. 2022 à 13:53, Srijan <sh...@gmail.com> a écrit :

> Hi All,
>
> I am working on designing a Solr based enterprise search solution. One
> requirement I have is to track crawled data from various different data
> sources with metadata like crawled date, indexing status and so on. I am
> looking into using Solr itself as my data store and not adding a separate
> database to my stack. Has anyone used Solr as a dedicated data store? How
> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> DB - can someone here share some insight into how Fusion is using this
> 'DB'? My store will need to track millions of objects and be able to handle
> parallel adds/updates. Do you think Solr is a good tool for this or am I
> better off depending on a database service?
>
> Thanks a bunch.
>

Re: Solr as a dedicated data store?

Posted by Markus Jelsma <ma...@openindex.io>.
> The 'no' response is traditional and a bit dated.

Agreed, we have been using Solr as a main data store for many years for
some usecases. But, we only store either logs or data that we can reproduce
or regenerate.

The original message wrote about storing a CrawlDB, in that case storing it
in Solr is fine, the data is easy to reproduce in case of distaster.



Op di 5 apr. 2022 om 15:26 schreef James Greene <james@jamesaustingreene.com
>:

> The 'no' response is traditional and a bit dated.  If you have proper
> backup/snapshots happening it is totally plausible to use solr (lucene) as
> a primary data store. If you need field/config changes you can import a
> collection from an existing collection doing the field transforms on the
> fly.
>
> There are a growing number of products built on lucene/elastic that act as
> a primary datastore. There is no reason solr can't be used as the same
> outside of the core devs slow response to bugs/documentation but that's a
> topic for questioning using solr at all.
>
> Like all software solutions your system should be designed with redundancy
> and resiliency.
>
> Good Luck!
>
> On Tue, Apr 5, 2022, 12:44 AM Tim Casey <tc...@gmail.com> wrote:
>
> > Srijan,
> >
> > Comments off the top of my head, so buyer beware.
> >
> > Almost always you want to be able to reindex your data from a 'source'.
> > This makes things like indexes not good as a data store, or a source of
> > truth.  The reasons for this vary.  Indexes age out data because there is
> > frequently a weight towards more recent items, indexes need to be
> reindexed
> > for new info to index/issues during indexing/processing, and the list
> would
> > go on.
> >
> > I have built an index data POJO store in lucene a *long* time ago.  It is
> > doable to hydrate a stored object into a language level object, such as a
> > java object instance.  It is fairly straightforward to data model from a
> > 'common' type of data model into an index as a data model.  But, it is
> not
> > quite the same query expectations and so on.  It is is not that far, but
> > again, this is not what the primary focus of an invertible index is.  The
> > primary focus is to take unstructured language data and return results
> in a
> > hopefully well ordered list.
> >
> > So, the first you might do is treat the different sources of data as
> > different clusters with a different topology.  You might stripe the data
> > less and have it be more nodes than you might otherwise because you will
> do
> > less indexing with it, than you might a normal index.  Once you make a
> > decision to separate out the data, then you have to contend with two
> > different indexes having references to the same 'documents' with some id
> to
> > tie them together and you would lose the ability to do any form of
> in-index
> > join using document ids.  If you keep all the data in the same index,
> then
> > you might be in a situation where the common answer is reindex and you
> > would not know what to do about the "metadata".
> >
> > I strongly suspect what you want is to have a way to either maintain the
> > metadata within the index and use it simply as you would along with the
> > documents.  As you spider, keep the info about the document with the
> > document contents.  I cannot think of a reason to keep all of the data
> in a
> > kinda weird separate space.    If you want to be more sophisticated, then
> > you can build an ETL which takes documents and forms indexable units,
> store
> > the indexable units for reindexing.  This is usually pretty quick and
> > separates out the crawling, ETL and indexing/query pieces, for all that
> > means.   This is more complicated, but would be a bit more standard in
> how
> > people think about it.
> >
> > tim
> >
> >
> >
> > On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > > On 4/4/2022 5:52 AM, Srijan wrote:
> > > > I am working on designing a Solr based enterprise search solution.
> One
> > > > requirement I have is to track crawled data from various different
> data
> > > > sources with metadata like crawled date, indexing status and so on. I
> > am
> > > > looking into using Solr itself as my data store and not adding a
> > separate
> > > > database to my stack. Has anyone used Solr as a dedicated data store?
> > How
> > > > did it compare to an RDBMS?
> > >
> > > As you've been told, Solr is NOT a database.  It is most definitely not
> > > equivalent in any way to an RDBMS.  If you want the kinds of things an
> > > RDBMS is good for, you should use an RDBMS, not Solr.
> > >
> > > Handling ever-changing search requirements in Solr is typically going
> to
> > > require the kinds of schema changes that need a full reindex.  So you
> > > probably wouldn't be able to use the same Solr index for your data
> > > storage as you do for searching anyway.
> > >
> > > If you're going to need to set up two Solr installs to handle your
> > > needs, you should probably NOT use Solr for the storage role.  Use
> > > something that has been tested and hardened against data loss. Solr
> does
> > > do its best to never lose data, but guaranteed data durability is not
> > > one of its design goals.  The changes that would be required to make
> > > that guarantee would most likely have an extremely adverse effect on
> > > search performance.
> > >
> > > Solr's core functionality has always been search.  Search is what it's
> > > good at, and that's what will be optimized in future versions ... not
> > > any kind of database functionality.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>

Re: Solr as a dedicated data store?

Posted by James Greene <ja...@jamesaustingreene.com>.
The 'no' response is traditional and a bit dated.  If you have proper
backup/snapshots happening it is totally plausible to use solr (lucene) as
a primary data store. If you need field/config changes you can import a
collection from an existing collection doing the field transforms on the
fly.

There are a growing number of products built on lucene/elastic that act as
a primary datastore. There is no reason solr can't be used as the same
outside of the core devs slow response to bugs/documentation but that's a
topic for questioning using solr at all.

Like all software solutions your system should be designed with redundancy
and resiliency.

Good Luck!

On Tue, Apr 5, 2022, 12:44 AM Tim Casey <tc...@gmail.com> wrote:

> Srijan,
>
> Comments off the top of my head, so buyer beware.
>
> Almost always you want to be able to reindex your data from a 'source'.
> This makes things like indexes not good as a data store, or a source of
> truth.  The reasons for this vary.  Indexes age out data because there is
> frequently a weight towards more recent items, indexes need to be reindexed
> for new info to index/issues during indexing/processing, and the list would
> go on.
>
> I have built an index data POJO store in lucene a *long* time ago.  It is
> doable to hydrate a stored object into a language level object, such as a
> java object instance.  It is fairly straightforward to data model from a
> 'common' type of data model into an index as a data model.  But, it is not
> quite the same query expectations and so on.  It is is not that far, but
> again, this is not what the primary focus of an invertible index is.  The
> primary focus is to take unstructured language data and return results in a
> hopefully well ordered list.
>
> So, the first you might do is treat the different sources of data as
> different clusters with a different topology.  You might stripe the data
> less and have it be more nodes than you might otherwise because you will do
> less indexing with it, than you might a normal index.  Once you make a
> decision to separate out the data, then you have to contend with two
> different indexes having references to the same 'documents' with some id to
> tie them together and you would lose the ability to do any form of in-index
> join using document ids.  If you keep all the data in the same index, then
> you might be in a situation where the common answer is reindex and you
> would not know what to do about the "metadata".
>
> I strongly suspect what you want is to have a way to either maintain the
> metadata within the index and use it simply as you would along with the
> documents.  As you spider, keep the info about the document with the
> document contents.  I cannot think of a reason to keep all of the data in a
> kinda weird separate space.    If you want to be more sophisticated, then
> you can build an ETL which takes documents and forms indexable units, store
> the indexable units for reindexing.  This is usually pretty quick and
> separates out the crawling, ETL and indexing/query pieces, for all that
> means.   This is more complicated, but would be a bit more standard in how
> people think about it.
>
> tim
>
>
>
> On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 4/4/2022 5:52 AM, Srijan wrote:
> > > I am working on designing a Solr based enterprise search solution. One
> > > requirement I have is to track crawled data from various different data
> > > sources with metadata like crawled date, indexing status and so on. I
> am
> > > looking into using Solr itself as my data store and not adding a
> separate
> > > database to my stack. Has anyone used Solr as a dedicated data store?
> How
> > > did it compare to an RDBMS?
> >
> > As you've been told, Solr is NOT a database.  It is most definitely not
> > equivalent in any way to an RDBMS.  If you want the kinds of things an
> > RDBMS is good for, you should use an RDBMS, not Solr.
> >
> > Handling ever-changing search requirements in Solr is typically going to
> > require the kinds of schema changes that need a full reindex.  So you
> > probably wouldn't be able to use the same Solr index for your data
> > storage as you do for searching anyway.
> >
> > If you're going to need to set up two Solr installs to handle your
> > needs, you should probably NOT use Solr for the storage role.  Use
> > something that has been tested and hardened against data loss. Solr does
> > do its best to never lose data, but guaranteed data durability is not
> > one of its design goals.  The changes that would be required to make
> > that guarantee would most likely have an extremely adverse effect on
> > search performance.
> >
> > Solr's core functionality has always been search.  Search is what it's
> > good at, and that's what will be optimized in future versions ... not
> > any kind of database functionality.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Solr as a dedicated data store?

Posted by Tim Casey <tc...@gmail.com>.
Srijan,

Comments off the top of my head, so buyer beware.

Almost always you want to be able to reindex your data from a 'source'.
This makes things like indexes not good as a data store, or a source of
truth.  The reasons for this vary.  Indexes age out data because there is
frequently a weight towards more recent items, indexes need to be reindexed
for new info to index/issues during indexing/processing, and the list would
go on.

I have built an index data POJO store in lucene a *long* time ago.  It is
doable to hydrate a stored object into a language level object, such as a
java object instance.  It is fairly straightforward to data model from a
'common' type of data model into an index as a data model.  But, it is not
quite the same query expectations and so on.  It is is not that far, but
again, this is not what the primary focus of an invertible index is.  The
primary focus is to take unstructured language data and return results in a
hopefully well ordered list.

So, the first you might do is treat the different sources of data as
different clusters with a different topology.  You might stripe the data
less and have it be more nodes than you might otherwise because you will do
less indexing with it, than you might a normal index.  Once you make a
decision to separate out the data, then you have to contend with two
different indexes having references to the same 'documents' with some id to
tie them together and you would lose the ability to do any form of in-index
join using document ids.  If you keep all the data in the same index, then
you might be in a situation where the common answer is reindex and you
would not know what to do about the "metadata".

I strongly suspect what you want is to have a way to either maintain the
metadata within the index and use it simply as you would along with the
documents.  As you spider, keep the info about the document with the
document contents.  I cannot think of a reason to keep all of the data in a
kinda weird separate space.    If you want to be more sophisticated, then
you can build an ETL which takes documents and forms indexable units, store
the indexable units for reindexing.  This is usually pretty quick and
separates out the crawling, ETL and indexing/query pieces, for all that
means.   This is more complicated, but would be a bit more standard in how
people think about it.

tim



On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 4/4/2022 5:52 AM, Srijan wrote:
> > I am working on designing a Solr based enterprise search solution. One
> > requirement I have is to track crawled data from various different data
> > sources with metadata like crawled date, indexing status and so on. I am
> > looking into using Solr itself as my data store and not adding a separate
> > database to my stack. Has anyone used Solr as a dedicated data store? How
> > did it compare to an RDBMS?
>
> As you've been told, Solr is NOT a database.  It is most definitely not
> equivalent in any way to an RDBMS.  If you want the kinds of things an
> RDBMS is good for, you should use an RDBMS, not Solr.
>
> Handling ever-changing search requirements in Solr is typically going to
> require the kinds of schema changes that need a full reindex.  So you
> probably wouldn't be able to use the same Solr index for your data
> storage as you do for searching anyway.
>
> If you're going to need to set up two Solr installs to handle your
> needs, you should probably NOT use Solr for the storage role.  Use
> something that has been tested and hardened against data loss. Solr does
> do its best to never lose data, but guaranteed data durability is not
> one of its design goals.  The changes that would be required to make
> that guarantee would most likely have an extremely adverse effect on
> search performance.
>
> Solr's core functionality has always been search.  Search is what it's
> good at, and that's what will be optimized in future versions ... not
> any kind of database functionality.
>
> Thanks,
> Shawn
>
>

Re: Solr as a dedicated data store?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/4/2022 5:52 AM, Srijan wrote:
> I am working on designing a Solr based enterprise search solution. One
> requirement I have is to track crawled data from various different data
> sources with metadata like crawled date, indexing status and so on. I am
> looking into using Solr itself as my data store and not adding a separate
> database to my stack. Has anyone used Solr as a dedicated data store? How
> did it compare to an RDBMS?

As you've been told, Solr is NOT a database.  It is most definitely not 
equivalent in any way to an RDBMS.  If you want the kinds of things an 
RDBMS is good for, you should use an RDBMS, not Solr.

Handling ever-changing search requirements in Solr is typically going to 
require the kinds of schema changes that need a full reindex.  So you 
probably wouldn't be able to use the same Solr index for your data 
storage as you do for searching anyway.

If you're going to need to set up two Solr installs to handle your 
needs, you should probably NOT use Solr for the storage role.  Use 
something that has been tested and hardened against data loss. Solr does 
do its best to never lose data, but guaranteed data durability is not 
one of its design goals.  The changes that would be required to make 
that guarantee would most likely have an extremely adverse effect on 
search performance.

Solr's core functionality has always been search.  Search is what it's 
good at, and that's what will be optimized in future versions ... not 
any kind of database functionality.

Thanks,
Shawn


Re: Solr as a dedicated data store?

Posted by matthew sporleder <ms...@gmail.com>.
Agreed. We get messages on this list pretty regularly about data locked in old versions of solr with no good way out. 

Even if reindexing takes a week on a big cluster and is hard to do, and means un-glaciering stuff from s3, etc make sure you can do it!

> On Apr 4, 2022, at 7:57 AM, Dave <ha...@gmail.com> wrote:
> 
> NO. I know it’s tempting but solr is a search engine not a database. You should at any point be able to destroy the search index and rebuild it from the database.   Most any rdbms can do what you want, or go the nosql mongo route which is becoming popular, but never use a search engine in this way, you could use it as an intermediate data store for queries and speed but it’s not the purpose. 
> 
>> On Apr 4, 2022, at 7:53 AM, Srijan <sh...@gmail.com> wrote:
>> 
>> Hi All,
>> 
>> I am working on designing a Solr based enterprise search solution. One
>> requirement I have is to track crawled data from various different data
>> sources with metadata like crawled date, indexing status and so on. I am
>> looking into using Solr itself as my data store and not adding a separate
>> database to my stack. Has anyone used Solr as a dedicated data store? How
>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
>> DB - can someone here share some insight into how Fusion is using this
>> 'DB'? My store will need to track millions of objects and be able to handle
>> parallel adds/updates. Do you think Solr is a good tool for this or am I
>> better off depending on a database service?
>> 
>> Thanks a bunch.

Re: Solr as a dedicated data store?

Posted by Dave <ha...@gmail.com>.
NO. I know it’s tempting but solr is a search engine not a database. You should at any point be able to destroy the search index and rebuild it from the database.   Most any rdbms can do what you want, or go the nosql mongo route which is becoming popular, but never use a search engine in this way, you could use it as an intermediate data store for queries and speed but it’s not the purpose. 

> On Apr 4, 2022, at 7:53 AM, Srijan <sh...@gmail.com> wrote:
> 
> Hi All,
> 
> I am working on designing a Solr based enterprise search solution. One
> requirement I have is to track crawled data from various different data
> sources with metadata like crawled date, indexing status and so on. I am
> looking into using Solr itself as my data store and not adding a separate
> database to my stack. Has anyone used Solr as a dedicated data store? How
> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> DB - can someone here share some insight into how Fusion is using this
> 'DB'? My store will need to track millions of objects and be able to handle
> parallel adds/updates. Do you think Solr is a good tool for this or am I
> better off depending on a database service?
> 
> Thanks a bunch.