You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Dan Bikle <bi...@gmail.com> on 2016/09/25 08:27:25 UTC

How to use Spark-Scala to download a CSV file from the web?

hello spark-world,

How to use Spark-Scala to download a CSV file from the web and load the
file into a spark-csv DataFrame?

Currently I depend on curl in a shell command to get my CSV file.

Here is the syntax I want to enhance:



















*/* fb_csv.scalaThis script should load FB prices from
Yahoo.Demo:spark-shell -i fb_csv.scala*/// I should get prices:import
sys.process._"/usr/bin/curl -o /tmp/fb.csv
http://ichart.finance.yahoo.com/table.csv?s=FB
<http://ichart.finance.yahoo.com/table.csv?s=FB>"!import
org.apache.spark.sql.SQLContextval sqlContext = new SQLContext(sc)val fb_df
=
sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")fb_df.head(9)*
I want to enhance the above script so it is pure Scala with no shell syntax
inside.

Re: How to use Spark-Scala to download a CSV file from the web?

Posted by Marco Mistroni <mm...@gmail.com>.
Hi
 not sure if spark-csv supports the http:// format you use to load data
from the WEB.  I just tried this and got exception

scala> val df = sqlContext.read.
     | format("com.databricks.spark.csv").
     | option("inferSchema", "true").
     | load("http://ichart.finance.yahoo.com/table.csv?s=FB")
16/09/25 10:08:09 WARN : Your hostname, MarcoLaptop resolves to a
loopback/non-reachable address: fe80:0:0:0:3c1f:e7b4:c7cc:d2bd%wlan3, but
we couldn't find any external IP address!
java.io.IOException: No FileSystem for scheme: http


But, it supports reading from a csv file, so you could write a spark
program that
1. download your FB data from yahoo (i have code which is doiing exactly
what you are doing and i am using com.github.tototoshi.csv  package for
downloading csv data from web)
2 . create an RDD out of that (or a DataFrame)
3. do whatever processing you need

hth

Re: How to use Spark-Scala to download a CSV file from the web?

Posted by Jörn Franke <jo...@gmail.com>.
Use a tool like flume and/or oozie to reliable download files from http and do error handling (e.g. Requests time out). Afterwards process the data with spark.

> On 25 Sep 2016, at 10:27, Dan Bikle <bi...@gmail.com> wrote:
> 
> hello spark-world,
> 
> How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame?
> 
> Currently I depend on curl in a shell command to get my CSV file.
> 
> Here is the syntax I want to enhance:
> 
> /* fb_csv.scala
> This script should load FB prices from Yahoo.
> 
> Demo:
> spark-shell -i fb_csv.scala
> */
> 
> // I should get prices:
> import sys.process._
> "/usr/bin/curl -o /tmp/fb.csv http://ichart.finance.yahoo.com/table.csv?s=FB"!
> 
> import org.apache.spark.sql.SQLContext
> 
> val sqlContext = new SQLContext(sc)
> 
> val fb_df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")
> 
> fb_df.head(9)
> 
> I want to enhance the above script so it is pure Scala with no shell syntax inside.
>