You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@vxquery.apache.org by Preston Carman <pr...@apache.org> on 2015/09/10 19:35:59 UTC

How should we return the query result?

So this post may be a little rabbling, but I hope it starts the discussion.

Apache VXQuery by default returns the result to the CLI and prints it to
the screen. In practice, I usually pipe the output to a file for review. Do
you think we should add an option to save the result to a file (local or
hdfs)? I think this will become an issue/speed concern as we start running
VXQuery in a Yarn Cluster [1]. Currently the CLI must be running for the
whole query to receive the result. It would be nice to decouple these
processes. Although this creates two issues: how do you know when the query
is complete and how will we save the result.

Things to discuss:

alpha: Should we write the result to a file (local or hdfs)? Currently the
result is read and returned to the user through the CLI. The CLI could save
the result to a file instead. (sounds easy)

bravo: Can writing the result to a file be pushed into the Hyracks job? The
goal would be to allow the CLI to create and send the job while a separate
process read the result once finished. The client be able to disconnect
from the server while the job was running and connect back later to get the
result (no more need for the cli to be in a screen session).

charlie: What is the workflow we would like to see for running a query on a
Yarn VXQuery cluster? See diagram [1].


[1]
https://docs.google.com/drawings/d/13_kP4Yt1ze_pgqQcbVLmlBOxE6aX0Pmjg3FT2q4XX2k/edit?usp=sharing

Re: How should we return the query result?

Posted by Till Westmann <ti...@apache.org>.

> On Sep 10, 2015, at 10:35 AM, Preston Carman <pr...@apache.org> wrote:
> 
> So this post may be a little rabbling, but I hope it starts the discussion.
> 
> Apache VXQuery by default returns the result to the CLI and prints it to
> the screen. In practice, I usually pipe the output to a file for review. Do
> you think we should add an option to save the result to a file (local or
> hdfs)? I think this will become an issue/speed concern as we start running
> VXQuery in a Yarn Cluster [1]. Currently the CLI must be running for the
> whole query to receive the result. It would be nice to decouple these
> processes. Although this creates two issues: how do you know when the query
> is complete and how will we save the result.
> 
> Things to discuss:
> 
> alpha: Should we write the result to a file (local or hdfs)? Currently the
> result is read and returned to the user through the CLI. The CLI could save
> the result to a file instead. (sounds easy)

Indeed, we could pass a command line parameter with a place to put the result in.

> bravo: Can writing the result to a file be pushed into the Hyracks job? The
> goal would be to allow the CLI to create and send the job while a separate
> process read the result once finished. The client be able to disconnect
> from the server while the job was running and connect back later to get the
> result (no more need for the cli to be in a screen session).

I think that we have 2 different points here
1) having the Hyracks job write the result (e.g. by putting a new operator on top of the plan that creates a tmp file on one of the NCs, writes the result to the file, and returns a reference to the file)
2) disconnecting from the CC before the job is done (in that case we’d need a way to communicate the state of the job)

> charlie: What is the workflow we would like to see for running a query on a
> Yarn VXQuery cluster? See diagram [1].

A question on the diagram: Do we need to run the CLI on the YARN cluster?
Generally it seems to me that we could just do option 1 for bravo above and write to HDFS instead of writing to a local file.

> [1]
> https://docs.google.com/drawings/d/13_kP4Yt1ze_pgqQcbVLmlBOxE6aX0Pmjg3FT2q4XX2k/edit?usp=sharing

Re: How should we return the query result?

Posted by Jochen Wiedmann <jo...@gmail.com>.

On Thu, Sep 10, 2015 at 7:35 PM, Preston Carman <pr...@apache.org> wrote:
> So this post may be a little rabbling, but I hope it starts the discussion.
>
> Apache VXQuery by default returns the result to the CLI and prints it to
> the screen. In practice, I usually pipe the output to a file for review. Do
> you think we should add an option to save the result to a file (local or
> hdfs)? I think this will become an issue/speed concern as we start running
> VXQuery in a Yarn Cluster [1]. Currently the CLI must be running for the
> whole query to receive the result. It would be nice to decouple these
> processes. Although this creates two issues: how do you know when the query
> is complete and how will we save the result.
>
> Things to discuss:
>
> alpha: Should we write the result to a file (local or hdfs)? Currently the
> result is read and returned to the user through the CLI. The CLI could save
> the result to a file instead. (sounds easy)

Based on my experiences with very large result sets, I strongly
recommend the following approach:

- First, and foremost, have an API internally, which specifies a kind
of event listener, and have output always written
  to that event listener. (In the case of XML, the event listener
would be a SAX ContentHandler.)
- The default event listener would simply serialize the output events
into a stream, thereby implementing the
  functionality to write to standard output, or a file. (In the case
of XML, the default event listener would be a
  Transformer with a StreamResult.
- Alternative, and custom event listeners could (for example) filter,
and count events, discarding all data.

Jochen



-- 
The next time you hear: "Don't reinvent the wheel!"

http://www.keystonedevelopment.co.uk/wp-content/uploads/2014/10/evolution-of-the-wheel-300x85.jpg

Re: How should we return the query result?

Posted by Michael Carey <mj...@ics.uci.edu>.

I think 'bravo' would be a good answer - we have such an option in 
AsterixDB.  (Not channeling the results back through a thin straw would 
be a good idea - that way the system could be used to do large XML-based 
ETL type things.)  Both local files and HDFS make sense as the target FS....

On 9/10/15 10:35 AM, Preston Carman wrote:
> So this post may be a little rabbling, but I hope it starts the discussion.
>
> Apache VXQuery by default returns the result to the CLI and prints it to
> the screen. In practice, I usually pipe the output to a file for review. Do
> you think we should add an option to save the result to a file (local or
> hdfs)? I think this will become an issue/speed concern as we start running
> VXQuery in a Yarn Cluster [1]. Currently the CLI must be running for the
> whole query to receive the result. It would be nice to decouple these
> processes. Although this creates two issues: how do you know when the query
> is complete and how will we save the result.
>
> Things to discuss:
>
> alpha: Should we write the result to a file (local or hdfs)? Currently the
> result is read and returned to the user through the CLI. The CLI could save
> the result to a file instead. (sounds easy)
>
> bravo: Can writing the result to a file be pushed into the Hyracks job? The
> goal would be to allow the CLI to create and send the job while a separate
> process read the result once finished. The client be able to disconnect
> from the server while the job was running and connect back later to get the
> result (no more need for the cli to be in a screen session).
>
> charlie: What is the workflow we would like to see for running a query on a
> Yarn VXQuery cluster? See diagram [1].
>
>
> [1]
> https://docs.google.com/drawings/d/13_kP4Yt1ze_pgqQcbVLmlBOxE6aX0Pmjg3FT2q4XX2k/edit?usp=sharing
>