You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by _na <ni...@seeq.com> on 2016/04/28 18:34:17 UTC

Using Spark when data definitions are unknowable at compile time

We are looking to incorporate Spark into a timeseries data investigation
application, but we are having a hard time transforming our workflow into
the required transformations-on-data model. The crux of the problem is that
we don’t know a priori which data will be required for our transformations.
 
For example, a common request might be `average($series2.within($ranges))`,
where in order to fetch the right sections of data from $series2, $ranges
will need to be computed first and then used to define data boundaries.
 
Is there a way to get around the need to define data first in Spark?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Spark-when-data-definitions-are-unknowable-at-compile-time-tp17371.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Using Spark when data definitions are unknowable at compile time

Posted by Dean Wampler <de...@gmail.com>.
I would start with using DataFrames and the Row
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row>
API, because you can fetch fields by index. Presumably, you'll parse the
incoming data and determine what fields have what types, etc. Or, will
someone specify the schema dynamically some how?

Either way, once you know the types and indices of the fields you need for
a given query, you can fetch them using the Row methods.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 28, 2016 at 11:34 AM, _na <ni...@seeq.com> wrote:

> We are looking to incorporate Spark into a timeseries data investigation
> application, but we are having a hard time transforming our workflow into
> the required transformations-on-data model. The crux of the problem is that
> we don’t know a priori which data will be required for our transformations.
>
> For example, a common request might be `average($series2.within($ranges))`,
> where in order to fetch the right sections of data from $series2, $ranges
> will need to be computed first and then used to define data boundaries.
>
> Is there a way to get around the need to define data first in Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Spark-when-data-definitions-are-unknowable-at-compile-time-tp17371.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>