You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sunita Arvind <su...@gmail.com> on 2019/06/26 00:35:13 UTC

Challenges with Datasource V2 API

Hello Spark Experts,

I am having challenges using the DataSource V2 API. I created a mock

The input partitions seem to be created correctly. The below output
confirms that:

19/06/23 16:00:21 INFO root: createInputPartitions
19/06/23 16:00:21 INFO root: Create a partition for abc

The InputPartitionReader seems to have fetched the data right as well,
however, it seems to keep going infinitely between the next() and get()
operations of the InputPartitionReader while on the cluster.

I tried to mock this and here is the code for the mockup - *
https://github.com/skopp002/SparkDatasourceV2.git *

However, the issue does not surface in the mock project. One concern that
does seem to show up is the duplication of records that I had noticed once
in production as well. There is only one record with usage value of
"1.2006451E7" in mockdata.json. But there are multiple records in the load
result. Could this be having the effect of infinite data in production? In
production, even for a few KBs I hit the error below.
```2019-06-23 16:07:29 INFO UnsafeExternalSorter:209 - Thread 47 spilling
sort data of 1984.0 MB to disk (50 times so far)
2019-06-23 16:07:31 INFO UnsafeExternalSorter:209 - Thread 47 spilling sort
data of 1984.0 MB to disk (51 times so far)
2019-06-23 16:07:33 INFO UnsafeExternalSorter:209 - Thread 47 spilling sort
data of 1984.0 MB to disk (52 times so far)```

But could not reproduce the exact error here in the mock project. Probably
the data is too small to surface the problem.
Can someone review the code and tell me if I am doing something wrong?

regards
Sunita