You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nasrulla Khan Haris <Na...@microsoft.com.INVALID> on 2020/06/04 06:40:43 UTC
preferredlocations for hadoopfsrelations based baseRelations
HI Spark developers,
I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?
Appreciate your responses.
Thanks,
NKH
Re: preferredlocations for hadoopfsrelations based baseRelations
Posted by Steve Loughran <st...@cloudera.com.INVALID>.
Here's a class which lets you proved a function on a row by row basis to
declare location
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala
needs to be in o.a.spark as something you need is scoped to the spark
packages only.
I used it for a PoC of a distcp replacement -each row was a filename, so
the locations of each row was the server with the first block of the file
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137
it would be convenient if either the bits of the API I needed was public or
the extra RDD code just went in somewhere. It's nothing complicated
On Thu, 4 Jun 2020 at 09:31, ZHANG Wei <we...@outlook.com> wrote:
> AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
> method, which is ordered by the data size, to get the partition
> preferred locations. If there are other vectors to sort, I'm wondering
> if here[1] can be a place to add. Or inheriting class `FilePartition`
> with overridden `preferredLocations()` might also work.
>
> --
> Cheers,
> -z
> [1]
> https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
>
> On Thu, 4 Jun 2020 06:40:43 +0000
> Nasrulla Khan Haris <Na...@microsoft.com.INVALID> wrote:
>
> > HI Spark developers,
> >
> > I have created new format extending fileformat. I see
> getPrefferedLocations is available if newCustomRDD is created. Since
> fileformat is based off FileScanRDD which uses readfile method to read
> partitioned file, Is there a way to add desired preferredLocations ?
> >
> > Appreciate your responses.
> >
> > Thanks,
> > NKH
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
Re: preferredlocations for hadoopfsrelations based baseRelations
Posted by ZHANG Wei <we...@outlook.com>.
AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
method, which is ordered by the data size, to get the partition
preferred locations. If there are other vectors to sort, I'm wondering
if here[1] can be a place to add. Or inheriting class `FilePartition`
with overridden `preferredLocations()` might also work.
--
Cheers,
-z
[1] https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
On Thu, 4 Jun 2020 06:40:43 +0000
Nasrulla Khan Haris <Na...@microsoft.com.INVALID> wrote:
> HI Spark developers,
>
> I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?
>
> Appreciate your responses.
>
> Thanks,
> NKH
>
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org