You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sean Owen <sr...@gmail.com> on 2023/03/09 21:05:03 UTC

Re: How to share a dataset file across nodes

Put the file on HDFS, if you have a Hadoop cluster?

On Thu, Mar 9, 2023 at 3:02 PM sam smith <qu...@gmail.com> wrote:

> Hello,
>
> I use Yarn client mode to submit my driver program to Hadoop, the dataset
> I load is from the local file system, when i invoke load("file://path")
> Spark complains about the csv file being not found, which i totally
> understand, since the dataset is not in any of the workers or the
> applicationMaster but only where the driver program resides.
> I tried to share the file using the configurations:
>
>> *spark.yarn.dist.files* OR *spark.files *
>
> but both ain't working.
> My question is how to share the csv dataset across the nodes at the
> specified path?
>
> Thanks.
>

Re: How to share a dataset file across nodes

Posted by Mich Talebzadeh <mi...@gmail.com>.

Try something like below

1) Put your csv say cities.csv in HDFS as below
hdfs dfs -put cities.csv /data/stg/test
2) Read it into dataframe in PySpark as below
csv_file="hdfs://<HOST>:PORT/data/stg/test/cities.csv"
# read it in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 21:07, Sean Owen <sr...@gmail.com> wrote:

> Put the file on HDFS, if you have a Hadoop cluster?
>
> On Thu, Mar 9, 2023 at 3:02 PM sam smith <qu...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I use Yarn client mode to submit my driver program to Hadoop, the dataset
>> I load is from the local file system, when i invoke load("file://path")
>> Spark complains about the csv file being not found, which i totally
>> understand, since the dataset is not in any of the workers or the
>> applicationMaster but only where the driver program resides.
>> I tried to share the file using the configurations:
>>
>>> *spark.yarn.dist.files* OR *spark.files *
>>
>> but both ain't working.
>> My question is how to share the csv dataset across the nodes at the
>> specified path?
>>
>> Thanks.
>>
>