You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafodion.apache.org by Eric Owhadi <er...@esgyn.com> on 2015/07/22 20:56:52 UTC

Parallel scanner?

Hi All,
I have been looking at how we currently use the scanner. Look like it
should be not too difficult to inject a parallel scanner instead of the
default serial scanner since in many use cases we don't care about the
ordering of the data retrieved.
Key question: do we sometime take advantage of the ordering (to do stuff
like merges) or are these merges requiring sorting are anyway always at the
ESP level?
The question is to know if we should have optional serial scanner or
parallel scanner (one with sorting preserved, the other not) or if we could
always enable parallel scanner?
On implementation details, we can do sophisticated algorithm to preserve
thread resources and auto scale the parallelism based on the speed of
consumption of the code doing next(), or we can simply always go with as
many thread as there is regions to scan, accepting the fact that some
thread will wait() if client next() code is not consuming fast enough.
I can prototype the simple one, then move to the auto scaling of thread
once done.
The reason I need to know if we should keep the serial scanner path is to
know if I should create a whole new wiring for parallel scanner, or if I
can just replace the serial scanner with the parallel one (just enabling
one or the other at config time just for bench-marking purpose).
Anybody working on this already, or should I give it a try?
Regards,
Eric

RE: Parallel scanner?

Posted by Eric Owhadi <er...@esgyn.com>.

I posted this on the JIRA I created, but since  there is no watcher, people
are not notified:


Anoop, you are talking about "merging sorted streams": In what I was going
to implement the stream seen by ESP or Master executor would not be multiple
streams, but a single stream of unsorted data (not random data, but
intermingle of multiple regions scanned in parallel data in a single stream.
So for operators that needs sorted stream, that parallel scanner would not
be appropriate.
Hope this is still useful ? I guess it is since you would get
multi-threading parallelism on top of ESP (multi process parallelism)?

Eric

-----Original Message-----
From: Anoop Sharma [mailto:anoop.sharma@esgyn.com]
Sent: Thursday, July 23, 2015 12:16 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: Parallel scanner?

If data is needed in sorted order for an order by clause or for a merge
join, then optimizer chooses or can potentially choose a plan that will
ensure sorted order.

This could be done either by reading data in key order if only one partition
is being read, or reading data from multiple partitions sequentially if data
order is preserved across multiple partitions, or by doing a merge of
multiple streams/partitions where each partition is returning data in sorted
order, or by doing an external sort on returned data from each partitions
and then merging them, if needed.
Traf opt may or may not be doing all of this at this point.

If an ESP is reading data from multiple partitions/regions, and parallel
asynchronous functionality is added at ESP level (this will be similar to
the PAPA (parallel access partition
access) node in the early implementation), then need to make sure that
optimizer is aware of this runtime functionality and chooses appropriate
plan by merging sorted streams.

anoop

-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
Sent: Wednesday, July 22, 2015 11:57 AM
To: dev@trafodion.incubator.apache.org
Subject: Parallel scanner?

Hi All,
I have been looking at how we currently use the scanner. Look like it should
be not too difficult to inject a parallel scanner instead of the default
serial scanner since in many use cases we don't care about the ordering of
the data retrieved.
Key question: do we sometime take advantage of the ordering (to do stuff
like merges) or are these merges requiring sorting are anyway always at the
ESP level?
The question is to know if we should have optional serial scanner or
parallel scanner (one with sorting preserved, the other not) or if we could
always enable parallel scanner?
On implementation details, we can do sophisticated algorithm to preserve
thread resources and auto scale the parallelism based on the speed of
consumption of the code doing next(), or we can simply always go with as
many thread as there is regions to scan, accepting the fact that some thread
will wait() if client next() code is not consuming fast enough.
I can prototype the simple one, then move to the auto scaling of thread once
done.
The reason I need to know if we should keep the serial scanner path is to
know if I should create a whole new wiring for parallel scanner, or if I can
just replace the serial scanner with the parallel one (just enabling one or
the other at config time just for bench-marking purpose).
Anybody working on this already, or should I give it a try?
Regards,
Eric

RE: Parallel scanner?

Posted by Anoop Sharma <an...@esgyn.com>.

If data is needed in sorted order for an order by clause or for a merge join,
then optimizer chooses or can potentially choose a plan that will ensure sorted order.

This could be done either by reading data in key order if only one partition
is being read, or reading data from multiple partitions sequentially if data
order is preserved across multiple partitions,
or by doing a merge of multiple streams/partitions
where each partition is returning data in sorted order, or by doing an external sort on 
returned data from each partitions and then merging them, if needed.
Traf opt may or may not be doing all of this at this point.

If an ESP is reading data from multiple partitions/regions, and parallel asynchronous
functionality is added at ESP level (this will be similar to the PAPA (parallel access partition
access) node in the early implementation), then need to make sure that optimizer is
aware of this runtime functionality and chooses appropriate plan by merging sorted streams.

anoop

-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
Sent: Wednesday, July 22, 2015 11:57 AM
To: dev@trafodion.incubator.apache.org
Subject: Parallel scanner?

Hi All,
I have been looking at how we currently use the scanner. Look like it should be not too difficult to inject a parallel scanner instead of the default serial scanner since in many use cases we don't care about the ordering of the data retrieved.
Key question: do we sometime take advantage of the ordering (to do stuff like merges) or are these merges requiring sorting are anyway always at the ESP level?
The question is to know if we should have optional serial scanner or parallel scanner (one with sorting preserved, the other not) or if we could always enable parallel scanner?
On implementation details, we can do sophisticated algorithm to preserve thread resources and auto scale the parallelism based on the speed of consumption of the code doing next(), or we can simply always go with as many thread as there is regions to scan, accepting the fact that some thread will wait() if client next() code is not consuming fast enough.
I can prototype the simple one, then move to the auto scaling of thread once done.
The reason I need to know if we should keep the serial scanner path is to know if I should create a whole new wiring for parallel scanner, or if I can just replace the serial scanner with the parallel one (just enabling one or the other at config time just for bench-marking purpose).
Anybody working on this already, or should I give it a try?
Regards,
Eric