You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nasrulla Khan Haris <Na...@microsoft.com.INVALID> on 2020/06/04 06:40:43 UTC

preferredlocations for hadoopfsrelations based baseRelations

HI Spark developers,

I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?

Appreciate your responses.

Thanks,
NKH

Re: preferredlocations for hadoopfsrelations based baseRelations

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

Here's a class which lets you proved a function on a row by row basis to
declare location

https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

needs to be in o.a.spark as something you need is scoped to the spark
packages only.

I used it for a PoC of a distcp replacement -each row was a filename, so
the locations of each row was the server with the first block of the file
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137

it would be convenient if either the bits of the API I needed was public or
the extra RDD code just went in somewhere. It's nothing complicated

On Thu, 4 Jun 2020 at 09:31, ZHANG Wei <we...@outlook.com> wrote:

> AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
> method, which is ordered by the data size, to get the partition
> preferred locations. If there are other vectors to sort, I'm wondering
> if here[1] can be a place to add. Or inheriting class `FilePartition`
> with overridden `preferredLocations()` might also work.
>
> --
> Cheers,
> -z
> [1]
> https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
>
> On Thu, 4 Jun 2020 06:40:43 +0000
> Nasrulla Khan Haris <Na...@microsoft.com.INVALID> wrote:
>
> > HI Spark developers,
> >
> > I have created new format extending fileformat. I see
> getPrefferedLocations is available if newCustomRDD is created. Since
> fileformat is based off FileScanRDD which uses readfile method to read
> partitioned file, Is there a way to add desired preferredLocations ?
> >
> > Appreciate your responses.
> >
> > Thanks,
> > NKH
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: preferredlocations for hadoopfsrelations based baseRelations

Posted by ZHANG Wei <we...@outlook.com>.

AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
method, which is ordered by the data size, to get the partition
preferred locations. If there are other vectors to sort, I'm wondering
if here[1] can be a place to add. Or inheriting class `FilePartition`
with overridden `preferredLocations()` might also work.

-- 
Cheers,
-z
[1] https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43

On Thu, 4 Jun 2020 06:40:43 +0000
Nasrulla Khan Haris <Na...@microsoft.com.INVALID> wrote:

> HI Spark developers,
> 
> I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?
> 
> Appreciate your responses.
> 
> Thanks,
> NKH
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org