You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jim Huang (Jira)" <ji...@apache.org> on 2020/03/26 19:09:00 UTC
[jira] [Updated] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

     [ https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Huang updated SPARK-31276:
------------------------------
    Description: 
This Spark SQL Guide --> Data sources --> Generic Load/Save Functions

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

described a very simple "local file system load of an example file".  

 

I am looking for an example that demonstrates a workflow that exercises different file systems.  For example, 
 # Driver loads an input file from local file system
 # Add a simple column using lit() and stores that DataFrame in cluster mode to HDFS
 # Write that same final DataFrame back to Driver's local file system

 

The examples I found on the internet only uses simple paths without the explicit URI prefixes.

Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was called, local stand alone vs cluster mode.   So a "filepath" will be read/write locally (file system) vs cluster mode HDFS, without these explicit URIs.

There are situations were a Spark program needs to deal with both local file system and cluster mode (big data) in the same Spark application, like producing a summary table stored on the local file system of the driver at the end.  

If there are any existing alternatives Spark documentation that provides examples of different URIs, I am happy to accept that as well.  Thanks!

  was:
This Spark SQL Guide --> Data sources --> Generic Load/Save Functions

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

described a very simple "local file system load of an example file".  

 

I am looking for an example that demonstrates a workflow that exercises different file systems.  For example, 
 # Driver loads an input file from local file system
 # Add a simple column using lit() and stores that DataFrame in cluster mode to HDFS
 # Write that same final DataFrame back to Driver's local file system

 

The examples I found on the internet only uses simple paths without the explicit URI prefixes.

Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was called, local stand alone vs cluster mode.   So a "filepath" will be read/write locally (file system) vs cluster mode HDFS, without these explicit URIs.

There are situations were a Spark program needs to deal with both local file system and cluster mode (big data) in the same Spark application, like producing a summary table stored on the local file system of the driver at the end.  

Thanks.


> Contrived working example that works with multiple URI file storages for Spark cluster mode
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31276
>                 URL: https://issues.apache.org/jira/browse/SPARK-31276
>             Project: Spark
>          Issue Type: Wish
>          Components: Examples
>    Affects Versions: 2.4.5
>            Reporter: Jim Huang
>            Priority: Major
>
> This Spark SQL Guide --> Data sources --> Generic Load/Save Functions
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
> described a very simple "local file system load of an example file".  
>  
> I am looking for an example that demonstrates a workflow that exercises different file systems.  For example, 
>  # Driver loads an input file from local file system
>  # Add a simple column using lit() and stores that DataFrame in cluster mode to HDFS
>  # Write that same final DataFrame back to Driver's local file system
>  
> The examples I found on the internet only uses simple paths without the explicit URI prefixes.
> Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was called, local stand alone vs cluster mode.   So a "filepath" will be read/write locally (file system) vs cluster mode HDFS, without these explicit URIs.
> There are situations were a Spark program needs to deal with both local file system and cluster mode (big data) in the same Spark application, like producing a summary table stored on the local file system of the driver at the end.  
> If there are any existing alternatives Spark documentation that provides examples of different URIs, I am happy to accept that as well.  Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org