You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "assaf.mendelson" <as...@rsa.com> on 2018/07/31 06:07:45 UTC

Data source V2

Hi all,
I am currently in the middle of developing a new data source (for an
internal tool) using data source V2.
I noticed that  SPARK-24882
<https://issues.apache.org/jira/browse/SPARK-24882>   is planned for 2.4 and
includes interface changes.

I was wondering if those are planned in addition to the current interfaces
or are aimed to replace them (specifically the most basic reading as this is
what I am using).

As a side note, I was wondering if there is any means to expose metrics from
the data source, e.g. I would like to expose a metric of the number of rows
read to the application (currently I am adding a per partition index column
and doing a custom idempotent accumulator which collects the maximum index
for each partition). 

Thanks,
    Assaf.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Data source V2

Posted by Wenchen Fan <cl...@gmail.com>.

Hi assaf,

Thanks for trying data source v2! Data source v2 is still evolving(we
marked all the data source v2 interface as @Evolving), and we've already
made a lot of API changes in this release(some renaming, switching to
InternalRow, etc.). So I'd not encourage people to use data source v2 in
long-term productions until we mark data source v2 as stable(or
experimental at least). SPARK-24882 is also an API change, and I'd say
people should implement data source after it gets merged or rejected.

About metrics, it should be easy to add a mixin interface to report metrics.

Thanks,
Wenchen

On Tue, Jul 31, 2018 at 2:07 PM assaf.mendelson <as...@rsa.com>
wrote:

> Hi all,
> I am currently in the middle of developing a new data source (for an
> internal tool) using data source V2.
> I noticed that  SPARK-24882
> <https://issues.apache.org/jira/browse/SPARK-24882>   is planned for 2.4
> and
> includes interface changes.
>
> I was wondering if those are planned in addition to the current interfaces
> or are aimed to replace them (specifically the most basic reading as this
> is
> what I am using).
>
> As a side note, I was wondering if there is any means to expose metrics
> from
> the data source, e.g. I would like to expose a metric of the number of rows
> read to the application (currently I am adding a per partition index column
> and doing a custom idempotent accumulator which collects the maximum index
> for each partition).
>
> Thanks,
>     Assaf.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Data source V2

Posted by vaclavkosar <ad...@vaclavkosar.com>.

For streaming there is an event StreamingQueryProgress which provides num of
input rows for each source. Num of output rows that were written is
currently not available in StreamingQueryProgress, but I submitted an PR for
that here: https://github.com/apache/spark/pull/21919 If you are interested,
please vote on the corresponding issue.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org