You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Henrik Baastrup <he...@netscout.com> on 2016/01/07 18:53:34 UTC

Problems with reading data from parquet files in a HDFS remotely

Hi All,

I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:

# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000")
# recSet.show()  

But when I run the Java program below, from my client, I get: 

Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC.

The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");

On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log:

16/01/07 18:27:47 INFO Master: Registering app SparkTest
16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801

If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist.

Java program:
	SparkConf conf = new SparkConf()
		.setAppName("SparkTest")
		.setMaster("spark://172.27.13.57:7077");
	JavaSparkContext sc = new JavaSparkContext(conf);
	SQLContext sqlContext = new SQLContext(sc);
	
	DataFrameReader reader = sqlContext.read();
	DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
	DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000");
	filtered.show();

Are there someone there can help me?

Henrik

Re: Problems with reading data from parquet files in a HDFS remotely

Posted by Henrik Baastrup <he...@netscout.com>.

I solved the problem. I needed to tell the SparkContext about my Hadoop
set up, so now my program is as follow:

    SparkConf conf = new SparkConf()
        .setAppName("SparkTest")
        .setMaster("spark://172.27.13.57:7077")
        .set("spark.executor.memory", "2g") // We assign 2 GB ram to our
job on each Worker
        .set("spark.driver.port", "51810"); // Fix the port the driver
will listen on, good for firewalls!
    JavaSparkContext sc = new JavaSparkContext(conf);

    // Tell Spark about our Hadoop environment
    File coreSite = new File("/etc/hadoop/conf/core-site.xml");
    File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml");
    Configuration hConf =  sc.hadoopConfiguration();
    hConf.addResource(new Path(coreSite.getAbsolutePath()));
    hConf.addResource(new Path(hdfsSite.getAbsolutePath()));

    SQLContext sqlContext = new SQLContext(sc);

    DataFrameReader reader = sqlContext.read();
    DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
    DataFrame filtered = df.filter("endTime>=1449421800000000 AND
endTime<=1449422400000000 AND calling='6287870642893' AND
p_endtime=1449422400000000");
    filtered.show();

Henrik

On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> ------ Original message------
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist.
>
> Java program:
> 	SparkConf conf = new SparkConf()
> 		.setAppName("SparkTest")
> 		.setMaster("spark://172.27.13.57:7077");
> 	JavaSparkContext sc = new JavaSparkContext(conf);
> 	SQLContext sqlContext = new SQLContext(sc);
> 	
> 	DataFrameReader reader = sqlContext.read();
> 	DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
> 	DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000");
> 	filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>

Re: Problems with reading data from parquet files in a HDFS remotely

Posted by Henrik Baastrup <he...@netscout.com>.

Hi Ewan,

Thank you for your answer.
I have already tried what you suggest.

If I use:
    "hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC"
I get the AssertionError exception:
    Exception in thread "main" java.lang.AssertionError: assertion
failed: No predefined schema found, and no Parquet data files or summary
files found under hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC.
Note: The IP address of my Spark Master is: 172.27.13.57

If I do as as you suggest literally:
    "hdfs:///user/hdfs/parquet-multi/BICC"
I get an IOException:
    Exception in thread "main" java.io.IOException: Incomplete HDFS URI,
no host: hdfs:///user/hdfs/parquet-multi/BICC

To me it seams that the Spark library try to resolve the URI locally and
I suspect I miss something in my configuration of the SparkContext, but
do not know what.
Or could it be that I use the wrong port in the hdfs:// URI above?

Henrik




On 07/01/2016 19:41, Ewan Leith wrote:
>
> Try the path
>
>
> "hdfs:///user/hdfs/parquet-multi/BICC"
> Thanks,
> Ewan
>
>
> ------ Original message------
>
> *From: *Henrik Baastrup
>
> *Date: *Thu, 7 Jan 2016 17:54
>
> *To: *user@spark.apache.org;
>
> *Cc: *Baastrup, Henrik;
>
> *Subject:*Problems with reading data from parquet files in a HDFS remotely
>
>
> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()  
>
> But when I run the Java program below, from my client, I get: 
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist.
>
> Java program:
> 	SparkConf conf = new SparkConf()
> 		.setAppName("SparkTest")
> 		.setMaster("spark://172.27.13.57:7077");
> 	JavaSparkContext sc = new JavaSparkContext(conf);
> 	SQLContext sqlContext = new SQLContext(sc);
> 	
> 	DataFrameReader reader = sqlContext.read();
> 	DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
> 	DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000");
> 	filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>

Re: Problems with reading data from parquet files in a HDFS remotely

Posted by Ewan Leith <ew...@realitymine.com>.

Try the path


"hdfs:///user/hdfs/parquet-multi/BICC"

Thanks,

Ewan


------ Original message------

From: Henrik Baastrup

Date: Thu, 7 Jan 2016 17:54

To: user@spark.apache.org;

Cc: Baastrup, Henrik;

Subject:Problems with reading data from parquet files in a HDFS remotely


Hi All,

I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:

# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000")
# recSet.show()

But when I run the Java program below, from my client, I get:

Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC.

The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");

On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log:

16/01/07 18:27:47 INFO Master: Registering app SparkTest
16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801

If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist.

Java program:
        SparkConf conf = new SparkConf()
                .setAppName("SparkTest")
                .setMaster("spark://172.27.13.57:7077");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);

        DataFrameReader reader = sqlContext.read();
        DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
        DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000");
        filtered.show();

Are there someone there can help me?

Henrik

Re: Problems with reading data from parquet files in a HDFS remotely

Posted by Prem Sure <pr...@gmail.com>.

you many need to add

createDataFrame( for Python, inferschema) call before registerTempTable.

Thanks,

Prem


On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup <
henrik.baastrup@netscout.com> wrote:

> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet files. I have installed a Spark master service on one of the nodes and now would like to query my parquet files from a Spark client. When I run the following program from the spark-shell on the Spark Master node all function correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,endTime,called,calling FROM BICC WHERE endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000")
> # recSet.show()
>
> But when I run the Java program below, from my client, I get:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but try to work locally on the client, where the requested files do not exist.
>
> Java program:
> 	SparkConf conf = new SparkConf()
> 		.setAppName("SparkTest")
> 		.setMaster("spark://172.27.13.57:7077");
> 	JavaSparkContext sc = new JavaSparkContext(conf);
> 	SQLContext sqlContext = new SQLContext(sc);
> 	
> 	DataFrameReader reader = sqlContext.read();
> 	DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
> 	DataFrame filtered = df.filter("endTime>=1449421800000000 AND endTime<=1449422400000000 AND calling='6287870642893' AND p_endtime=1449422400000000");
> 	filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>
>
>