You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Dobes Vandermeer <do...@gmail.com> on 2020/02/26 01:27:13 UTC

Mongo filter push-down limitations?

Hi,

I am trying to understand drill's performance how we can best use it for our project.  We use mongo as our primary "live" database and I am looking at syncing data to Amazon S3 and using Drill to run reports off of that.

I was hoping that I could have Drill connect directly to mongo for some things.

For example: Our software is used to collect responses from school classroom.  I thought if I was running a report for students in a given class, I could build the list of students at a school using a query to mongodb.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

I also tried a query where I filtered on createdAt > date_sub(current_timestamp, interval "1" day) and it didn't apply that as a push-down filter (according to the physical plan tab) whereas I had hoped it would have calculated the resulting timestamp and applied that as a filter when scanning the collection.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Dobes Vandermeer <do...@gmail.com>.

Hi Paul,

Thanks for your replies. It does seem like this part of Drill is not doing as much as I had hoped, and it's out of my depth/budget to address it. I might still be able to use Drill strictly for querying Parquet files.

On 2/26/2020 12:42:16 PM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi Dobes,

Good points as always. The way open source projects like Drill improve is to understand use cases such as yours, then implement them.

We discussed some of Drill's join optimizations, which, if added to the Mongo plugin, would likely solve your join problem. The process you describe is typical of an RDBMS: the optimizer notices that the cheapest path is to do a row-key lookup per join key (using a B-tree in an RDBMS, say). This was implemented for MapRDB, and can be added for Mongo.

On the deletion issue, one handy trick is to not actually delete a question: just mark it as deleted. That way, you can always compute a score, but if someone asks "which questions are available", only those not marked as deleted appear. Else, you might find you have some tricky race conditions and non-repeatable queries.

Thanks,
- Paul

On Wednesday, February 26, 2020, 11:37:22 AM PST, Dobes Vandermeer wrote:

Hi Paul,

In my case I was looking to union and join. I was thinking of using a join to build up a sort of filter on the parquet data based on the user's query.

Example:

We have "tags" that can be applied to each question, and we want a report of each student's average score per tag for a given time period. Questions can also be deleted and we have to verify that a question is not deleted before including it in the score.

So, we will scan the answers table, but filtering on whether a question is deleted, and grouping on the tags on each question, then take an average.

It seems like the way drill functions currently, if I wanted to get a question's tags and deleted status from mongodb, drill will load the entire mongodb collection of questions, which is too slow.

What I had hoped for is that drill would be able to scan the answers and gather up a list of question ids of interest and query mongodb for those questions only with some kind of grouping.

As for the union, I was also hoping that I would be able to pull the most recent answers from mongodb and union those with the ones from S3 parquet files. However, my brief test trying to query answers from mongodb via drill showed it trying to load the entire collection.

My feeling at the moment is that Drill is not very useful for combining mongodb data with other data sources because I will constantly run into times where I accidentally pull down the entire collection, and also times where it gives an error if the data does not conform to a fixed tabular schema.

On 2/26/2020 12:04:38 AM, Paul Rogers wrote:
Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

You describe the classic "lambda" architecture: historical data in system A, current data in system B. In this case, it would be more of a union than a join. Drill handles this well. But, the user has to know how to write a query that does the union.

At a prior job, we wrote a front end that rewrote queries so the user just asks for, say, "testScores", and the rewrite layer figures that, for a time range of more than a week ago, go to long-term storage, else go to current storage. If current storage is faster, then, of course, some customers want a longer retention period in current storage to get faster queries. This means that the cut-off point is not fixed: it differs per customer (or data type.)

Would be cool to do this logic in Drill itself. Can probably even do it today with a cleverly written storage plugin that, during planning, rewrites itself out of the query in favor of the long-term and short-term data sources. (Calcite, Drill's query planner, is quite flexible.)

Once Drill has data, it can join it with any other data source. Drill comes from the "big data, scan the whole file" tradition, so the most basic join requires a scan of both tables. There is "partition filter push-down" for directories which works on each table individually. There is also a "join-predicate push-down" (JPPD) feature added a while back. A couple of years ago, Drill added the ability to push keys into queries (as would be done for an old-school DB with indexes.)

I believe, the Mongo plugin was done before most of the above work was added, so there might need to be work to get Mongo to work with these newer features.

Thanks,
- Paul

On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer wrote:

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I was curious what I could expect to be applied at the mongodb level and what would not.

Would drill be able to do joins between queries where it pushes down filters for the elements that were found? By the sounds of it, this may be quite far off, which does reduce Drill's appeal vs competitors to some degree.

I had hoped that Drill could intelligently merge historical data saved as parquet with the latest data in mongodb, giving a kind of hybrid reporting approach that gives current data without overloading mongodb to pull millions of historical records. However, it sounds like this is not supported yet, and likely won't be for some time.
On 2/25/2020 8:19:19 PM, Paul Rogers wrote:
Hi Dobes,

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

A quick check of Mongo shows that it does implement filter push down. Check out the class MongoPushDownFilterForScan. The details appear to be in MongoFilterBuilder. This particular implementation appears to be rather limited: it seems to either push ALL filters, or none. A more advanced implementation would push those it can handle, leaving the rest to Drill.

There may be limitations; it depends on what the plugin author implemented. What kind of query did you do where you saw no push-down? And, how did you check the plan? Using an EXPLAIN PLAN FOR ... command? If filters are, in fact, pushed down, there has to be some trace in the JSON plan (in some Mongo-specific format.)

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Filter push-down is a common operation, most implementations are currently (incomplete) copy/pastes of other (incomplete) implementations. We're working to fix that. We had a PR for the standard (col RELOP const) cases, but reviwers asked that it be made more complete. The PR does handle partial filter pushdown. Perhaps, as we move forward, we can apply the same ideas to Mongo.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I am trying to understand drill's performance how we can best use it for our project. We use mongo as our primary "live" database and I am looking at syncing data to Amazon S3 and using Drill to run reports off of that.

I was hoping that I could have Drill connect directly to mongo for some things.

For example: Our software is used to collect responses from school classroom. I thought if I was running a report for students in a given class, I could build the list of students at a school using a query to mongodb.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

I also tried a query where I filtered on createdAt > date_sub(current_timestamp, interval "1" day) and it didn't apply that as a push-down filter (according to the physical plan tab) whereas I had hoped it would have calculated the resulting timestamp and applied that as a filter when scanning the collection.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Dobes,

Good points as always. The way open source projects like Drill improve is to understand use cases such as yours, then implement them.

Thanks,
- Paul

On Wednesday, February 26, 2020, 11:37:22 AM PST, Dobes Vandermeer <do...@gmail.com> wrote:

Hi Paul,

In my case I was looking to union and join. I was thinking of using a join to build up a sort of filter on the parquet data based on the user's query.

Example:

So, we will scan the answers table, but filtering on whether a question is deleted, and grouping on the tags on each question, then take an average.

It seems like the way drill functions currently, if I wanted to get a question's tags and deleted status from mongodb, drill will load the entire mongodb collection of questions, which is too slow.

What I had hoped for is that drill would be able to scan the answers and gather up a list of question ids of interest and query mongodb for those questions only with some kind of grouping.

On 2/26/2020 12:04:38 AM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

I believe, the Mongo plugin was done before most of the above work was added, so there might need to be work to get Mongo to work with these newer features.

Thanks,
- Paul

On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer wrote:

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I was curious what I could expect to be applied at the mongodb level and what would not.

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I was hoping that I could have Drill connect directly to mongo for some things.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Dobes Vandermeer <do...@gmail.com>.

Hi Paul,

In my case I was looking to union and join. I was thinking of using a join to build up a sort of filter on the parquet data based on the user's query.

Example:

So, we will scan the answers table, but filtering on whether a question is deleted, and grouping on the tags on each question, then take an average.

It seems like the way drill functions currently, if I wanted to get a question's tags and deleted status from mongodb, drill will load the entire mongodb collection of questions, which is too slow.

What I had hoped for is that drill would be able to scan the answers and gather up a list of question ids of interest and query mongodb for those questions only with some kind of grouping.

On 2/26/2020 12:04:38 AM, Paul Rogers <pa...@yahoo.com.invalid> wrote:
Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

I believe, the Mongo plugin was done before most of the above work was added, so there might need to be work to get Mongo to work with these newer features.

Thanks,
- Paul

On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer wrote:

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I was curious what I could expect to be applied at the mongodb level and what would not.

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I was hoping that I could have Drill connect directly to mongo for some things.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

I believe, the Mongo plugin was done before most of the above work was added, so there might need to be work to get Mongo to work with these newer features.

Thanks,
- Paul

On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer <do...@gmail.com> wrote:

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I was curious what I could expect to be applied at the mongodb level and what would not.

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I was hoping that I could have Drill connect directly to mongo for some things.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Dobes Vandermeer <do...@gmail.com>.

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I was curious what I could expect to be applied at the mongodb level and what would not.

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I was hoping that I could have Drill connect directly to mongo for some things.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?

Re: Mongo filter push-down limitations?

Posted by Dobes Vandermeer <do...@gmail.com>.

Charles,

Yeah, makes sense to me, although I guess sorting would have to be applied as well in that case.  Does it push down sorting to the storage plugin?

On 2/25/2020 8:21:42 PM, Charles Givre <cg...@gmail.com> wrote:
Hey Paul, Dobes,
I was thinking about this, and it would seem that a LIMIT pushdown would be an obvious candidate for most plugins as well. I don't think that is in the Mongo plugin, but that would be a good addition.
--C

> On Feb 25, 2020, at 11:19 PM, Paul Rogers wrote:
>
> d

Re: Mongo filter push-down limitations?

Posted by Charles Givre <cg...@gmail.com>.

Hey Paul, Dobes, 
I was thinking about this, and it would seem that a LIMIT pushdown would be an obvious candidate for most plugins as well.  I don't think that is in the Mongo plugin, but that would be a good addition.  
--C 

> On Feb 25, 2020, at 11:19 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> d

Re: Mongo filter push-down limitations?

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Dobes,

Your use case is exactly the one we hope Drill can serve: integrate data from multiple sources. We may have to work on Drill a bit to get it there, however.

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe try the simplest possible query such as classID = 10.

Thanks,
- Paul

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer <do...@gmail.com> wrote:

Hi,

I was hoping that I could have Drill connect directly to mongo for some things.

I wanted to verify that drill would push down filters when doing a join, maybe first collecting a list of ids it is interested and use that as a filter when it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it would do this, it shows the filter as null in this case.

Is there some rule I can use to predict when a filter will be propagated to the mongo query?