You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Varun Thacker (Jira)" <ji...@apache.org> on 2020/11/10 21:11:00 UTC

[jira] [Commented] (SOLR-14614) Add Simplified Aggregation Interface to Streaming Expression

    [ https://issues.apache.org/jira/browse/SOLR-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229525#comment-17229525 ] 

Varun Thacker commented on SOLR-14614:
--------------------------------------

nice!

> Add Simplified Aggregation Interface to Streaming Expression
> ------------------------------------------------------------
>
>                 Key: SOLR-14614
>                 URL: https://issues.apache.org/jira/browse/SOLR-14614
>             Project: Solr
>          Issue Type: Improvement
>          Components: query, query parsers, streaming expressions
>    Affects Versions: 7.7.2, 8.4.1
>            Reporter: Aroop
>            Assignee: Timothy Potter
>            Priority: Major
>
> For the Data Analytics use cases the standard use case is:
>  # Find a pattern
>  # Then Aggregate by certain dimensions
>  # Then compute metrics (like count, sum, avg)
>  # Sort by a dimension or metric
>  # look at top-n
> This functionality has been available over many different interfaces in the past on solr, but only streaming expressions have the ability to deliver results in a scalable, performant and stable manner for systems that have large data to the tune of Big data systems.
> However, one barrier to entry is the query interface, not being simple enough in streaming expressions.
> to give an example of how involved the corresponding streaming expression can get, to get it to work on large scale systems,{color:#4c9aff} _find top 10 cities where someone named Alex works with the respective counts_{color}
> {code:java}
> qt=/stream&aggregationMode=facet&expr=
> select( top( rollup(sort(by%3D"city+asc",
>    +plist( 
>           select(facet(collection1,+q%3D"(*:*+AND+name:alex)",+buckets%3D"city",+bucketSizeLimit%3D"2010",+bucketSorts%3D"count(*)+desc",+count(*)),+city,+count(*)+as+Nj3bXa),
>           select(facet(collection2,+q%3D"(*:*+AND+name:alex)",+buckets%3D"city",+bucketSizeLimit%3D"2010",+bucketSorts%3D"count(*)+desc",+count(*)),+city,+count(*)+as+Nj3bXa)
>          )),
> 		+over%3D"city",+sum(Nj3bXa)),
> 	+n%3D"10",+sort%3D"sum(Nj3bXa)+desc"),
> +city,+sum(Nj3bXa)+as+Nj3bXa)
> {code}
> This is a query on an alias with 2 collections behind it representing 2 data partitions, which is a requirement of sorts in big data systems. This is one of the only ways to get information from Billions of records in a matter of seconds. This is awesome in terms of capability and performance.
> But one can see how involved this syntax can be in the current scheme and is a barrier to entry for new adopters.
>  
> This Jira is to track the work of creating a simplified analytics endpoint augmenting streaming expressions.
> a starting proposal is to have the endpoint have these query parameters:
> {code:java}
> /analytics?action=aggregate&q=*:*&fq=name:alex&dimensions=city&metrics=count&sort=count&sortOrder=desc&limit=10{code}
> This is equivalent to a sql that an analyst would write:
> {code:java}
> select city, count(*) from collection where name = 'alex'
> group by city order by count(*) desc limit 10;{code}
> On the solr side this would get translated to the best possible streaming expression using *rollups, top, sort, plist* etc.; but all done transparently to the user.
> Heres to making the power of Streaming expressions simpler to use for all.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org