You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dennis Gove (JIRA)" <ji...@apache.org> on 2015/12/18 05:56:46 UTC

[jira] [Comment Edited] (SOLR-7525) Add ComplementStream to the Streaming API and Streaming Expressions

    [ https://issues.apache.org/jira/browse/SOLR-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063456#comment-15063456 ] 

Dennis Gove edited comment on SOLR-7525 at 12/18/15 4:56 AM:
-------------------------------------------------------------

I'll rebase this off trunk so it is a little cleaner but I think the use of ReducerStream still holds. 

The purpose of Complement and Intersect is to return tuples in A which either do or do not exist in B. The tuples in B aren't used for anything and are dropped as soon as possible. The reason they make use of the ReducerStream is because B having 1 instance of some tuple found in A is the same as B having 100 instances of some tuple found in A. Whether its 1 or 100 the tuple exists in B so its twin in A can either be returned from A or not. For this reason the size of the ReducerStream can always just be 1 because we only care about the first one and all others can be dropped from B. The fieldName (or fieldNames because you can do an intersect on N fields) provided to the ReducerStream are the fields the Intersect or Complement streams are acting on. 

Essentially, the goal is to take all the tuples in B and reduce them down to a unique list of tuples where uniqueness is defined over the fields that the intersect or complement is being checked over. Given that B is a set of unique tuples it is much easier to know when to move onto the next tuple in B.

I'll take a look at the GroupOperation but I would suspect that it can use a StreamEqualitor instead of a StreamComparator. A comparator allows order while an equalitor just checks if they are equal. There may be a reason it allows for ordering, though.


was (Author: dpgove):
I'll rebase this off trunk so it is a little cleaner but I think the use of ReducerStream still holds. 

The purpose of Complement and Intersect is to return tuples in A which either do or do not exist in B. The tuples in B aren't used for anything and are dropped as soon as possible. The reason they make use of the ReducerStream is because B having 1 instance of some tuple found in A is the same as B having 100 instances of some tuple found in A. Whether its 1 or 100 the tuple exists in B so it can either be returned in A or not. For this reason the size of the ReducerStream can always just be 1 because we only care about the first one and all others can be dropped from B. The fieldName (or fieldNames because you can do an intersect on N fields) provided to the ReducerStream are the fields the Intersect or Complement streams are acting on. 

Essentially, the goal is to take all the tuples in B and reduce them down to a unique list of tuples where uniqueness is defined over the fields that the intersect or complement is being checked over. Given that B is a set of unique tuples it is much easier to know when to move onto the next tuple in B.

I'll take a look at the GroupOperation but I would suspect that it can use a StreamEqualitor instead of a StreamComparator. A comparator allows order while an equalitor just checks if they are equal. There may be a reason it allows for ordering, though.

> Add ComplementStream to the Streaming API and Streaming Expressions
> -------------------------------------------------------------------
>
>                 Key: SOLR-7525
>                 URL: https://issues.apache.org/jira/browse/SOLR-7525
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrJ
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7525.patch
>
>
> This ticket adds a ComplementStream to the Streaming API and Streaming Expression language.
> The ComplementStream will wrap two TupleStreams (StreamA, StreamB) and emit Tuples from StreamA that are not in StreamB.
> Streaming API Syntax:
> {code}
> ComplementStream cstream = new ComplementStream(streamA, streamB, comp);
> {code}
> Streaming Expression syntax:
> {code}
> complement(search(...), search(...), on(...))
> {code}
> Internal implementation will rely on the ReducerStream. The ComplementStream can be parallelized using the ParallelStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org