You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marco Didonna <m....@gmail.com> on 2017/05/21 15:50:37 UTC

Sampling data on RDD vs sampling data on Dataframes

Hello,

me and my team have developed a fairly large big data application using
only the dataframe api (Spark 1.6.3). Since our application uses machine
learning to do prediction we need to sample the train dataset in order not
to have skewed data.

To achieve such objective we use stratified sampling: now, you all probably
know that the DataFrameStatFunctions provided a useful sampleBy method that
supposedly carries out stratified sampling based on the fraction map passed
as input. There are a few question that have risen:

- the samplyBy methods seems to return variabile results with the same
input data therefore looks more like and *approximate* stratified sampling.
Inspection of the spark source code seems to confirm such hypothesis. There
is no mention on the documentation of such approximation nor a confidence
interval that guarantees how good the approximation is supposed to be.

- on the RDD world there is a sampleByKeyExact method which clearly states
that it will produce a sampled datasets with tight guarantees ... is there
anything like that in the DataFrame world?

Has anybody in the community worked around such shortcomings of the
dataframe api? I'm very much aware that I can get an rdd from a dataframe,
perform sampleByKeyExact and then convert the RDD back to a dataframe. I'd
really like to avoid such conversion, if possibile.

Thank you for any help you people can give :)

Best,

Marco