You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Dan Rosher <ro...@gmail.com> on 2021/09/06 13:32:50 UTC

Email alerts with streaming expressions

Hi,

I was wondering if anyone had tried email alerts with streaming
expressions, and what their experience was if attempting this with say 12
million emails / day? Traditionally this might have been done with a
database cursor iterator daily.

I was thinking if something like the following pseudocode expression with
'kafka' as a custom push expression:

daemon(id="alertId",
       runInterval="1000",
       kafka(
        kafka_topic,
        alertId,
        topic(email_alerts,
          doc_collection,
          q="email query",
          fl="id, title, abstract",
          id="alertId",
          initialCheckpoint=0)
        )

If you have done something like this 'where' would you typically run the
daemon, on replicas away from replicas running web queries?

Many thanks in advance for any advice / suggestions,

Dan

Re: Email alerts with streaming expressions

Posted by Dan Rosher <ro...@gmail.com>.
Ah yes .. brilliant thanks Joel !  I think this is exactly what I was
looking for, I wasn't aware of the executor decorator.

Thanks all again for the suggestions and the interesting possibilities
available.

Kind regards,
Dan

On Tue, 7 Sept 2021 at 13:51, Joel Bernstein <jo...@gmail.com> wrote:

> There was a design implemented in Streaming Expression for large scale
> alerting described here:
>
>
> https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html
>
> In this design you would store each alert in Solr as a topic expression.
> Then a single daemon can run all the topics or it can be parallelized.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <
> chull@opensourceconnections.com>
> wrote:
>
> > Hi Dan,
> >
> > Yuval and my suggestions both rely on the same underlying code (Luwak,
> > now called Lucene Monitor). This lets you store a set of Lucene queries
> > and run them against every new document.
> >
> > The Lucene Monitor allows for very high-performance matching (I know of
> > situations with around 1m stored queries, monitoring 1m new documents a
> > day running on a few tens of nodes) and it does this with some clever
> > optimisations: effectively it builds an index of your stored queries,
> > and turns each new document into a query across this index (I know it
> > sounds confusing!). It's a 'reverse search'. Check out the original
> > Luwak project as it's got links to several presentations and blogs
> > showing how others have implemented these systems.
> >
> > The bit you'll have to build is the Solr layer and then the code that
> > uses this to generate alerts - and Solcolator and
> > https://github.com/o19s/solr-monitor are two examples of how to do the
> > first part, which you can build on. The facility to do a reverse search
> > is not built into Solr - yet, unlike Elasticsearch's Percolator.
> >
> > Best
> >
> > Charlie
> >
> > On 07/09/2021 10:24, Dan Rosher wrote:
> > > Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
> > >
> > > Eric: Yes I thought the monitoring might be a it of a pain, esp with
> > > millions of them, I'll have to check out the topic code, but I wondered
> > if
> > > I can look @ the checkpoint collections for uniqueIds that haven't been
> > > updated for a 'while' which might suggest the demon had stopped/died,
> > > rather than checking each daemon individually?
> > >
> > > I was also wondering whether it's possible, or a useful enhancement to
> > look
> > > at the replica index version (as opposed to _vesion_ ) for the topic
> > > streaming expression to skip queries where the replica index is the
> same
> > as
> > > what we might store in the checkpoint collection ? For collections that
> > > update infrequently I think this might be useful.
> > >
> > > Charlie: It was for email alerts, so a user stores a query for
> collection
> > > docs to match against, and then the system emails matches to the user.
> Do
> > > you think solr-monitor can be used for this purpose?
> > >
> > > Yuval: I like the idea of using the UpdateProcessor, at least there's
> no
> > > need for deamons or monitoring of them, but would this scale for
> millions
> > > of email queries though?
> > >
> > > Many thanks again to all.
> > >
> > > Kind regards,
> > > Dan
> > >
> > >
> > >
> > >
> > > On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yu...@mail.huji.ac.il>
> > wrote:
> > >
> > >> Me and my team are building upon this solcolator:
> > >> https://github.com/SOLR4189/solcolator
> > >>
> > >> Currently the processor is build for Solr 6.5.1, we are working on
> > updating
> > >> our Solr and I hope to release a complete version of our Solcolator
> as
> > >> open source then (it will be for version 8.6.x).
> > >>
> > >> Making it an update processor (either make it the last element and
> > replace
> > >> the usual processor that index the document, or by using it as the one
> > from
> > >> last processor in the collection, and so allow monitoring also atomic
> > >> updates [which is relatively costly]).
> > >>
> > >> By making it an update processor we don't rely on the streaming
> deamon,
> > >> which we found unsatisfying as we wish to allow users to define their
> > own
> > >> monitors over the index.
> > >>
> > >> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
> > chull@opensourceconnections.com
> > >> wrote:
> > >>
> > >>> Are you trying to monitor a stream of emails for certain patterns? In
> > >>> which case you might look at the Lucene Monitor
> > >>>
> > >>>
> > >>
> >
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> > >>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
> > originally
> > >>> Luwak - at my previous company Flax we helped build several
> large-scale
> > >>> monitoring systems with this https://github.com/flaxsearch/luwak .
> > It's
> > >>> not officially surfaced in Solr yet although my colleague Scott
> Stults
> > >>> has been working on some ideas: https://github.com/o19s/solr-monitor
> > >>>
> > >>> best
> > >>> Charlie
> > >>>
> > >>> On 06/09/2021 14:32, Dan Rosher wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I was wondering if anyone had tried email alerts with streaming
> > >>>> expressions, and what their experience was if attempting this with
> say
> > >> 12
> > >>>> million emails / day? Traditionally this might have been done with a
> > >>>> database cursor iterator daily.
> > >>>>
> > >>>> I was thinking if something like the following pseudocode expression
> > >> with
> > >>>> 'kafka' as a custom push expression:
> > >>>>
> > >>>> daemon(id="alertId",
> > >>>>          runInterval="1000",
> > >>>>          kafka(
> > >>>>           kafka_topic,
> > >>>>           alertId,
> > >>>>           topic(email_alerts,
> > >>>>             doc_collection,
> > >>>>             q="email query",
> > >>>>             fl="id, title, abstract",
> > >>>>             id="alertId",
> > >>>>             initialCheckpoint=0)
> > >>>>           )
> > >>>>
> > >>>> If you have done something like this 'where' would you typically run
> > >> the
> > >>>> daemon, on replicas away from replicas running web queries?
> > >>>>
> > >>>> Many thanks in advance for any advice / suggestions,
> > >>>>
> > >>>> Dan
> > >>>>
> > >>> --
> > >>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > >>> <www.o19s.com>
> > >>> Founding member of The Search Network <https://thesearchnetwork.com/
> >
> > >>> and co-author of Searching the Enterprise
> > >>> <https://opensourceconnections.com/about-us/books-resources/>
> > >>> tel/fax: +44 (0)8700 118334
> > >>> mobile: +44 (0)7767 825828
> > >>>
> > >>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > >>> Amtsgericht Charlottenburg | HRB 230712 B
> > >>> Geschäftsführer: John M. Woodell | David E. Pugh
> > >>> Finanzamt: Berlin Finanzamt für Körperschaften II
> > >>>
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > <www.o19s.com>
> > Founding member of The Search Network <https://thesearchnetwork.com/>
> > and co-author of Searching the Enterprise
> > <https://opensourceconnections.com/about-us/books-resources/>
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
> > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > Amtsgericht Charlottenburg | HRB 230712 B
> > Geschäftsführer: John M. Woodell | David E. Pugh
> > Finanzamt: Berlin Finanzamt für Körperschaften II
> >
>

Re: Email alerts with streaming expressions

Posted by Eric Pugh <ep...@opensourceconnections.com>.
Also, I think this is something you could easily trial, just take out the Kafka step, and replace it with say a insert into a solr collection, and see what happens.

Monitoring the daemon process is easy too  ;-)


> On Sep 7, 2021, at 8:50 AM, Joel Bernstein <jo...@gmail.com> wrote:
> 
> There was a design implemented in Streaming Expression for large scale
> alerting described here:
> 
> https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html
> 
> In this design you would store each alert in Solr as a topic expression.
> Then a single daemon can run all the topics or it can be parallelized.
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> 
> On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <ch...@opensourceconnections.com>
> wrote:
> 
>> Hi Dan,
>> 
>> Yuval and my suggestions both rely on the same underlying code (Luwak,
>> now called Lucene Monitor). This lets you store a set of Lucene queries
>> and run them against every new document.
>> 
>> The Lucene Monitor allows for very high-performance matching (I know of
>> situations with around 1m stored queries, monitoring 1m new documents a
>> day running on a few tens of nodes) and it does this with some clever
>> optimisations: effectively it builds an index of your stored queries,
>> and turns each new document into a query across this index (I know it
>> sounds confusing!). It's a 'reverse search'. Check out the original
>> Luwak project as it's got links to several presentations and blogs
>> showing how others have implemented these systems.
>> 
>> The bit you'll have to build is the Solr layer and then the code that
>> uses this to generate alerts - and Solcolator and
>> https://github.com/o19s/solr-monitor are two examples of how to do the
>> first part, which you can build on. The facility to do a reverse search
>> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>> 
>> Best
>> 
>> Charlie
>> 
>> On 07/09/2021 10:24, Dan Rosher wrote:
>>> Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
>>> 
>>> Eric: Yes I thought the monitoring might be a it of a pain, esp with
>>> millions of them, I'll have to check out the topic code, but I wondered
>> if
>>> I can look @ the checkpoint collections for uniqueIds that haven't been
>>> updated for a 'while' which might suggest the demon had stopped/died,
>>> rather than checking each daemon individually?
>>> 
>>> I was also wondering whether it's possible, or a useful enhancement to
>> look
>>> at the replica index version (as opposed to _vesion_ ) for the topic
>>> streaming expression to skip queries where the replica index is the same
>> as
>>> what we might store in the checkpoint collection ? For collections that
>>> update infrequently I think this might be useful.
>>> 
>>> Charlie: It was for email alerts, so a user stores a query for collection
>>> docs to match against, and then the system emails matches to the user. Do
>>> you think solr-monitor can be used for this purpose?
>>> 
>>> Yuval: I like the idea of using the UpdateProcessor, at least there's no
>>> need for deamons or monitoring of them, but would this scale for millions
>>> of email queries though?
>>> 
>>> Many thanks again to all.
>>> 
>>> Kind regards,
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yu...@mail.huji.ac.il>
>> wrote:
>>> 
>>>> Me and my team are building upon this solcolator:
>>>> https://github.com/SOLR4189/solcolator
>>>> 
>>>> Currently the processor is build for Solr 6.5.1, we are working on
>> updating
>>>> our Solr and I hope to release a complete version of our Solcolator  as
>>>> open source then (it will be for version 8.6.x).
>>>> 
>>>> Making it an update processor (either make it the last element and
>> replace
>>>> the usual processor that index the document, or by using it as the one
>> from
>>>> last processor in the collection, and so allow monitoring also atomic
>>>> updates [which is relatively costly]).
>>>> 
>>>> By making it an update processor we don't rely on the streaming deamon,
>>>> which we found unsatisfying as we wish to allow users to define their
>> own
>>>> monitors over the index.
>>>> 
>>>> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
>> chull@opensourceconnections.com
>>>> wrote:
>>>> 
>>>>> Are you trying to monitor a stream of emails for certain patterns? In
>>>>> which case you might look at the Lucene Monitor
>>>>> 
>>>>> 
>>>> 
>> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
>>>>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
>> originally
>>>>> Luwak - at my previous company Flax we helped build several large-scale
>>>>> monitoring systems with this https://github.com/flaxsearch/luwak .
>> It's
>>>>> not officially surfaced in Solr yet although my colleague Scott Stults
>>>>> has been working on some ideas: https://github.com/o19s/solr-monitor
>>>>> 
>>>>> best
>>>>> Charlie
>>>>> 
>>>>> On 06/09/2021 14:32, Dan Rosher wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I was wondering if anyone had tried email alerts with streaming
>>>>>> expressions, and what their experience was if attempting this with say
>>>> 12
>>>>>> million emails / day? Traditionally this might have been done with a
>>>>>> database cursor iterator daily.
>>>>>> 
>>>>>> I was thinking if something like the following pseudocode expression
>>>> with
>>>>>> 'kafka' as a custom push expression:
>>>>>> 
>>>>>> daemon(id="alertId",
>>>>>>         runInterval="1000",
>>>>>>         kafka(
>>>>>>          kafka_topic,
>>>>>>          alertId,
>>>>>>          topic(email_alerts,
>>>>>>            doc_collection,
>>>>>>            q="email query",
>>>>>>            fl="id, title, abstract",
>>>>>>            id="alertId",
>>>>>>            initialCheckpoint=0)
>>>>>>          )
>>>>>> 
>>>>>> If you have done something like this 'where' would you typically run
>>>> the
>>>>>> daemon, on replicas away from replicas running web queries?
>>>>>> 
>>>>>> Many thanks in advance for any advice / suggestions,
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>> --
>>>>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
>>>>> <www.o19s.com>
>>>>> Founding member of The Search Network <https://thesearchnetwork.com/>
>>>>> and co-author of Searching the Enterprise
>>>>> <https://opensourceconnections.com/about-us/books-resources/>
>>>>> tel/fax: +44 (0)8700 118334
>>>>> mobile: +44 (0)7767 825828
>>>>> 
>>>>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
>>>>> Amtsgericht Charlottenburg | HRB 230712 B
>>>>> Geschäftsführer: John M. Woodell | David E. Pugh
>>>>> Finanzamt: Berlin Finanzamt für Körperschaften II
>>>>> 
>> 
>> --
>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
>> <www.o19s.com>
>> Founding member of The Search Network <https://thesearchnetwork.com/>
>> and co-author of Searching the Enterprise
>> <https://opensourceconnections.com/about-us/books-resources/>
>> tel/fax: +44 (0)8700 118334
>> mobile: +44 (0)7767 825828
>> 
>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
>> Amtsgericht Charlottenburg | HRB 230712 B
>> Geschäftsführer: John M. Woodell | David E. Pugh
>> Finanzamt: Berlin Finanzamt für Körperschaften II
>> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: Email alerts with streaming expressions

Posted by Joel Bernstein <jo...@gmail.com>.
There was a design implemented in Streaming Expression for large scale
alerting described here:

https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html

In this design you would store each alert in Solr as a topic expression.
Then a single daemon can run all the topics or it can be parallelized.



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <ch...@opensourceconnections.com>
wrote:

> Hi Dan,
>
> Yuval and my suggestions both rely on the same underlying code (Luwak,
> now called Lucene Monitor). This lets you store a set of Lucene queries
> and run them against every new document.
>
> The Lucene Monitor allows for very high-performance matching (I know of
> situations with around 1m stored queries, monitoring 1m new documents a
> day running on a few tens of nodes) and it does this with some clever
> optimisations: effectively it builds an index of your stored queries,
> and turns each new document into a query across this index (I know it
> sounds confusing!). It's a 'reverse search'. Check out the original
> Luwak project as it's got links to several presentations and blogs
> showing how others have implemented these systems.
>
> The bit you'll have to build is the Solr layer and then the code that
> uses this to generate alerts - and Solcolator and
> https://github.com/o19s/solr-monitor are two examples of how to do the
> first part, which you can build on. The facility to do a reverse search
> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>
> Best
>
> Charlie
>
> On 07/09/2021 10:24, Dan Rosher wrote:
> > Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
> >
> > Eric: Yes I thought the monitoring might be a it of a pain, esp with
> > millions of them, I'll have to check out the topic code, but I wondered
> if
> > I can look @ the checkpoint collections for uniqueIds that haven't been
> > updated for a 'while' which might suggest the demon had stopped/died,
> > rather than checking each daemon individually?
> >
> > I was also wondering whether it's possible, or a useful enhancement to
> look
> > at the replica index version (as opposed to _vesion_ ) for the topic
> > streaming expression to skip queries where the replica index is the same
> as
> > what we might store in the checkpoint collection ? For collections that
> > update infrequently I think this might be useful.
> >
> > Charlie: It was for email alerts, so a user stores a query for collection
> > docs to match against, and then the system emails matches to the user. Do
> > you think solr-monitor can be used for this purpose?
> >
> > Yuval: I like the idea of using the UpdateProcessor, at least there's no
> > need for deamons or monitoring of them, but would this scale for millions
> > of email queries though?
> >
> > Many thanks again to all.
> >
> > Kind regards,
> > Dan
> >
> >
> >
> >
> > On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yu...@mail.huji.ac.il>
> wrote:
> >
> >> Me and my team are building upon this solcolator:
> >> https://github.com/SOLR4189/solcolator
> >>
> >> Currently the processor is build for Solr 6.5.1, we are working on
> updating
> >> our Solr and I hope to release a complete version of our Solcolator  as
> >> open source then (it will be for version 8.6.x).
> >>
> >> Making it an update processor (either make it the last element and
> replace
> >> the usual processor that index the document, or by using it as the one
> from
> >> last processor in the collection, and so allow monitoring also atomic
> >> updates [which is relatively costly]).
> >>
> >> By making it an update processor we don't rely on the streaming deamon,
> >> which we found unsatisfying as we wish to allow users to define their
> own
> >> monitors over the index.
> >>
> >> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
> chull@opensourceconnections.com
> >> wrote:
> >>
> >>> Are you trying to monitor a stream of emails for certain patterns? In
> >>> which case you might look at the Lucene Monitor
> >>>
> >>>
> >>
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> >>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
> originally
> >>> Luwak - at my previous company Flax we helped build several large-scale
> >>> monitoring systems with this https://github.com/flaxsearch/luwak .
> It's
> >>> not officially surfaced in Solr yet although my colleague Scott Stults
> >>> has been working on some ideas: https://github.com/o19s/solr-monitor
> >>>
> >>> best
> >>> Charlie
> >>>
> >>> On 06/09/2021 14:32, Dan Rosher wrote:
> >>>> Hi,
> >>>>
> >>>> I was wondering if anyone had tried email alerts with streaming
> >>>> expressions, and what their experience was if attempting this with say
> >> 12
> >>>> million emails / day? Traditionally this might have been done with a
> >>>> database cursor iterator daily.
> >>>>
> >>>> I was thinking if something like the following pseudocode expression
> >> with
> >>>> 'kafka' as a custom push expression:
> >>>>
> >>>> daemon(id="alertId",
> >>>>          runInterval="1000",
> >>>>          kafka(
> >>>>           kafka_topic,
> >>>>           alertId,
> >>>>           topic(email_alerts,
> >>>>             doc_collection,
> >>>>             q="email query",
> >>>>             fl="id, title, abstract",
> >>>>             id="alertId",
> >>>>             initialCheckpoint=0)
> >>>>           )
> >>>>
> >>>> If you have done something like this 'where' would you typically run
> >> the
> >>>> daemon, on replicas away from replicas running web queries?
> >>>>
> >>>> Many thanks in advance for any advice / suggestions,
> >>>>
> >>>> Dan
> >>>>
> >>> --
> >>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> >>> <www.o19s.com>
> >>> Founding member of The Search Network <https://thesearchnetwork.com/>
> >>> and co-author of Searching the Enterprise
> >>> <https://opensourceconnections.com/about-us/books-resources/>
> >>> tel/fax: +44 (0)8700 118334
> >>> mobile: +44 (0)7767 825828
> >>>
> >>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> >>> Amtsgericht Charlottenburg | HRB 230712 B
> >>> Geschäftsführer: John M. Woodell | David E. Pugh
> >>> Finanzamt: Berlin Finanzamt für Körperschaften II
> >>>
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> <www.o19s.com>
> Founding member of The Search Network <https://thesearchnetwork.com/>
> and co-author of Searching the Enterprise
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>

Re: Email alerts with streaming expressions

Posted by Charlie Hull <ch...@opensourceconnections.com>.
Hi Dan,

Yuval and my suggestions both rely on the same underlying code (Luwak, 
now called Lucene Monitor). This lets you store a set of Lucene queries 
and run them against every new document.

The Lucene Monitor allows for very high-performance matching (I know of 
situations with around 1m stored queries, monitoring 1m new documents a 
day running on a few tens of nodes) and it does this with some clever 
optimisations: effectively it builds an index of your stored queries, 
and turns each new document into a query across this index (I know it 
sounds confusing!). It's a 'reverse search'. Check out the original 
Luwak project as it's got links to several presentations and blogs 
showing how others have implemented these systems.

The bit you'll have to build is the Solr layer and then the code that 
uses this to generate alerts - and Solcolator and 
https://github.com/o19s/solr-monitor are two examples of how to do the 
first part, which you can build on. The facility to do a reverse search 
is not built into Solr - yet, unlike Elasticsearch's Percolator.

Best

Charlie

On 07/09/2021 10:24, Dan Rosher wrote:
> Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
>
> Eric: Yes I thought the monitoring might be a it of a pain, esp with
> millions of them, I'll have to check out the topic code, but I wondered if
> I can look @ the checkpoint collections for uniqueIds that haven't been
> updated for a 'while' which might suggest the demon had stopped/died,
> rather than checking each daemon individually?
>
> I was also wondering whether it's possible, or a useful enhancement to look
> at the replica index version (as opposed to _vesion_ ) for the topic
> streaming expression to skip queries where the replica index is the same as
> what we might store in the checkpoint collection ? For collections that
> update infrequently I think this might be useful.
>
> Charlie: It was for email alerts, so a user stores a query for collection
> docs to match against, and then the system emails matches to the user. Do
> you think solr-monitor can be used for this purpose?
>
> Yuval: I like the idea of using the UpdateProcessor, at least there's no
> need for deamons or monitoring of them, but would this scale for millions
> of email queries though?
>
> Many thanks again to all.
>
> Kind regards,
> Dan
>
>
>
>
> On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yu...@mail.huji.ac.il> wrote:
>
>> Me and my team are building upon this solcolator:
>> https://github.com/SOLR4189/solcolator
>>
>> Currently the processor is build for Solr 6.5.1, we are working on updating
>> our Solr and I hope to release a complete version of our Solcolator  as
>> open source then (it will be for version 8.6.x).
>>
>> Making it an update processor (either make it the last element and replace
>> the usual processor that index the document, or by using it as the one from
>> last processor in the collection, and so allow monitoring also atomic
>> updates [which is relatively costly]).
>>
>> By making it an update processor we don't rely on the streaming deamon,
>> which we found unsatisfying as we wish to allow users to define their own
>> monitors over the index.
>>
>> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <chull@opensourceconnections.com
>> wrote:
>>
>>> Are you trying to monitor a stream of emails for certain patterns? In
>>> which case you might look at the Lucene Monitor
>>>
>>>
>> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
>>> https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
>>> Luwak - at my previous company Flax we helped build several large-scale
>>> monitoring systems with this https://github.com/flaxsearch/luwak . It's
>>> not officially surfaced in Solr yet although my colleague Scott Stults
>>> has been working on some ideas: https://github.com/o19s/solr-monitor
>>>
>>> best
>>> Charlie
>>>
>>> On 06/09/2021 14:32, Dan Rosher wrote:
>>>> Hi,
>>>>
>>>> I was wondering if anyone had tried email alerts with streaming
>>>> expressions, and what their experience was if attempting this with say
>> 12
>>>> million emails / day? Traditionally this might have been done with a
>>>> database cursor iterator daily.
>>>>
>>>> I was thinking if something like the following pseudocode expression
>> with
>>>> 'kafka' as a custom push expression:
>>>>
>>>> daemon(id="alertId",
>>>>          runInterval="1000",
>>>>          kafka(
>>>>           kafka_topic,
>>>>           alertId,
>>>>           topic(email_alerts,
>>>>             doc_collection,
>>>>             q="email query",
>>>>             fl="id, title, abstract",
>>>>             id="alertId",
>>>>             initialCheckpoint=0)
>>>>           )
>>>>
>>>> If you have done something like this 'where' would you typically run
>> the
>>>> daemon, on replicas away from replicas running web queries?
>>>>
>>>> Many thanks in advance for any advice / suggestions,
>>>>
>>>> Dan
>>>>
>>> --
>>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
>>> <www.o19s.com>
>>> Founding member of The Search Network <https://thesearchnetwork.com/>
>>> and co-author of Searching the Enterprise
>>> <https://opensourceconnections.com/about-us/books-resources/>
>>> tel/fax: +44 (0)8700 118334
>>> mobile: +44 (0)7767 825828
>>>
>>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
>>> Amtsgericht Charlottenburg | HRB 230712 B
>>> Geschäftsführer: John M. Woodell | David E. Pugh
>>> Finanzamt: Berlin Finanzamt für Körperschaften II
>>>

-- 
Charlie Hull - Managing Consultant at OpenSource Connections Limited 
<www.o19s.com>
Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Re: Email alerts with streaming expressions

Posted by Dan Rosher <ro...@gmail.com>.
Thanks Eric, Charlie and Yuval for all the feedback and suggestions.

Eric: Yes I thought the monitoring might be a it of a pain, esp with
millions of them, I'll have to check out the topic code, but I wondered if
I can look @ the checkpoint collections for uniqueIds that haven't been
updated for a 'while' which might suggest the demon had stopped/died,
rather than checking each daemon individually?

I was also wondering whether it's possible, or a useful enhancement to look
at the replica index version (as opposed to _vesion_ ) for the topic
streaming expression to skip queries where the replica index is the same as
what we might store in the checkpoint collection ? For collections that
update infrequently I think this might be useful.

Charlie: It was for email alerts, so a user stores a query for collection
docs to match against, and then the system emails matches to the user. Do
you think solr-monitor can be used for this purpose?

Yuval: I like the idea of using the UpdateProcessor, at least there's no
need for deamons or monitoring of them, but would this scale for millions
of email queries though?

Many thanks again to all.

Kind regards,
Dan




On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yu...@mail.huji.ac.il> wrote:

> Me and my team are building upon this solcolator:
> https://github.com/SOLR4189/solcolator
>
> Currently the processor is build for Solr 6.5.1, we are working on updating
> our Solr and I hope to release a complete version of our Solcolator  as
> open source then (it will be for version 8.6.x).
>
> Making it an update processor (either make it the last element and replace
> the usual processor that index the document, or by using it as the one from
> last processor in the collection, and so allow monitoring also atomic
> updates [which is relatively costly]).
>
> By making it an update processor we don't rely on the streaming deamon,
> which we found unsatisfying as we wish to allow users to define their own
> monitors over the index.
>
> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <chull@opensourceconnections.com
> >
> wrote:
>
> > Are you trying to monitor a stream of emails for certain patterns? In
> > which case you might look at the Lucene Monitor
> >
> >
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> > https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
> > Luwak - at my previous company Flax we helped build several large-scale
> > monitoring systems with this https://github.com/flaxsearch/luwak . It's
> > not officially surfaced in Solr yet although my colleague Scott Stults
> > has been working on some ideas: https://github.com/o19s/solr-monitor
> >
> > best
> > Charlie
> >
> > On 06/09/2021 14:32, Dan Rosher wrote:
> > > Hi,
> > >
> > > I was wondering if anyone had tried email alerts with streaming
> > > expressions, and what their experience was if attempting this with say
> 12
> > > million emails / day? Traditionally this might have been done with a
> > > database cursor iterator daily.
> > >
> > > I was thinking if something like the following pseudocode expression
> with
> > > 'kafka' as a custom push expression:
> > >
> > > daemon(id="alertId",
> > >         runInterval="1000",
> > >         kafka(
> > >          kafka_topic,
> > >          alertId,
> > >          topic(email_alerts,
> > >            doc_collection,
> > >            q="email query",
> > >            fl="id, title, abstract",
> > >            id="alertId",
> > >            initialCheckpoint=0)
> > >          )
> > >
> > > If you have done something like this 'where' would you typically run
> the
> > > daemon, on replicas away from replicas running web queries?
> > >
> > > Many thanks in advance for any advice / suggestions,
> > >
> > > Dan
> > >
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > <www.o19s.com>
> > Founding member of The Search Network <https://thesearchnetwork.com/>
> > and co-author of Searching the Enterprise
> > <https://opensourceconnections.com/about-us/books-resources/>
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
> > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > Amtsgericht Charlottenburg | HRB 230712 B
> > Geschäftsführer: John M. Woodell | David E. Pugh
> > Finanzamt: Berlin Finanzamt für Körperschaften II
> >
>

Re: Email alerts with streaming expressions

Posted by Yuval Paz <yu...@mail.huji.ac.il>.
Me and my team are building upon this solcolator:
https://github.com/SOLR4189/solcolator

Currently the processor is build for Solr 6.5.1, we are working on updating
our Solr and I hope to release a complete version of our Solcolator  as
open source then (it will be for version 8.6.x).

Making it an update processor (either make it the last element and replace
the usual processor that index the document, or by using it as the one from
last processor in the collection, and so allow monitoring also atomic
updates [which is relatively costly]).

By making it an update processor we don't rely on the streaming deamon,
which we found unsatisfying as we wish to allow users to define their own
monitors over the index.

On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <ch...@opensourceconnections.com>
wrote:

> Are you trying to monitor a stream of emails for certain patterns? In
> which case you might look at the Lucene Monitor
>
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
> Luwak - at my previous company Flax we helped build several large-scale
> monitoring systems with this https://github.com/flaxsearch/luwak . It's
> not officially surfaced in Solr yet although my colleague Scott Stults
> has been working on some ideas: https://github.com/o19s/solr-monitor
>
> best
> Charlie
>
> On 06/09/2021 14:32, Dan Rosher wrote:
> > Hi,
> >
> > I was wondering if anyone had tried email alerts with streaming
> > expressions, and what their experience was if attempting this with say 12
> > million emails / day? Traditionally this might have been done with a
> > database cursor iterator daily.
> >
> > I was thinking if something like the following pseudocode expression with
> > 'kafka' as a custom push expression:
> >
> > daemon(id="alertId",
> >         runInterval="1000",
> >         kafka(
> >          kafka_topic,
> >          alertId,
> >          topic(email_alerts,
> >            doc_collection,
> >            q="email query",
> >            fl="id, title, abstract",
> >            id="alertId",
> >            initialCheckpoint=0)
> >          )
> >
> > If you have done something like this 'where' would you typically run the
> > daemon, on replicas away from replicas running web queries?
> >
> > Many thanks in advance for any advice / suggestions,
> >
> > Dan
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> <www.o19s.com>
> Founding member of The Search Network <https://thesearchnetwork.com/>
> and co-author of Searching the Enterprise
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>

Re: Email alerts with streaming expressions

Posted by Charlie Hull <ch...@opensourceconnections.com>.
Are you trying to monitor a stream of emails for certain patterns? In 
which case you might look at the Lucene Monitor 
https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html 
https://issues.apache.org/jira/browse/LUCENE-8766, which was originally 
Luwak - at my previous company Flax we helped build several large-scale 
monitoring systems with this https://github.com/flaxsearch/luwak . It's 
not officially surfaced in Solr yet although my colleague Scott Stults 
has been working on some ideas: https://github.com/o19s/solr-monitor

best
Charlie

On 06/09/2021 14:32, Dan Rosher wrote:
> Hi,
>
> I was wondering if anyone had tried email alerts with streaming
> expressions, and what their experience was if attempting this with say 12
> million emails / day? Traditionally this might have been done with a
> database cursor iterator daily.
>
> I was thinking if something like the following pseudocode expression with
> 'kafka' as a custom push expression:
>
> daemon(id="alertId",
>         runInterval="1000",
>         kafka(
>          kafka_topic,
>          alertId,
>          topic(email_alerts,
>            doc_collection,
>            q="email query",
>            fl="id, title, abstract",
>            id="alertId",
>            initialCheckpoint=0)
>          )
>
> If you have done something like this 'where' would you typically run the
> daemon, on replicas away from replicas running web queries?
>
> Many thanks in advance for any advice / suggestions,
>
> Dan
>

-- 
Charlie Hull - Managing Consultant at OpenSource Connections Limited 
<www.o19s.com>
Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Re: Email alerts with streaming expressions

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I don’t have a specific answer for you, but I do think this is a great use for streaming expressions!    

I believe that with Solr you will need to monitor the process yourself, that there isn’t any support for restarting the daemon if it burps.   I know in some commercial products that they take care of restarting the daemon for you.



> On Sep 6, 2021, at 9:32 AM, Dan Rosher <ro...@gmail.com> wrote:
> 
> Hi,
> 
> I was wondering if anyone had tried email alerts with streaming
> expressions, and what their experience was if attempting this with say 12
> million emails / day? Traditionally this might have been done with a
> database cursor iterator daily.
> 
> I was thinking if something like the following pseudocode expression with
> 'kafka' as a custom push expression:
> 
> daemon(id="alertId",
>       runInterval="1000",
>       kafka(
>        kafka_topic,
>        alertId,
>        topic(email_alerts,
>          doc_collection,
>          q="email query",
>          fl="id, title, abstract",
>          id="alertId",
>          initialCheckpoint=0)
>        )
> 
> If you have done something like this 'where' would you typically run the
> daemon, on replicas away from replicas running web queries?
> 
> Many thanks in advance for any advice / suggestions,
> 
> Dan

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.