You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rajesh Radhakrishnan <Ra...@phe.gov.uk> on 2016/05/24 10:23:13 UTC

UUID coming as int while using SPARK SQL

Hi,


I got a Cassandra keyspace, but while reading the data(especially UUID) via Spark SQL using Python is not returning the correct value.

Cassandra:
--------------
My table 'SAM'' is described below:

CREATE table ks.sam (id uuid, dept text, workflow text, type double primary  key (id, dept))

SELECT id, workflow FROM sam WHERE dept='blah';

The above example  CQL gives me the following
id                                   | workflow
--------------------------------------+------------
 9547v26c-f528-12e5-da8b-001a4q3dac10 |       testWK


Spark/Python:
------------------
from pyspark import SparkConf
from pyspark.sql import SQLContext
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext

....
conf = SparkConf().set("spark.cassandra.connection.host",IP_ADDRESS).set("spark.cassandra.connection.native.port",PORT_NUMBER)
sparkContext = CassandraSparkContext(conf = conf)
sqlContext = SQLContext(sparkContext)

samTable =sparkContext.cassandraTable("ks", "sam").select('id', 'dept','workflow')
samTable.cache()

samdf.registerTempTable("samd")

 sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow='testWK'
 new_df = sqlContext.sql(sparkSQLl)
 results  =  new_df.collect()
 for row in results:
            print "dept=",row.dept
            print "wk=",row.workflow
            print "id=",row.id
...
The Python code above prints the following:
dept=Biology
wk=testWK
id=293946894141093607334963674332192894528


You can see here that the id (uuid) whose correct value at Cassandra is ' 9547v26c-f528-12e5-da8b-001a4q3dac10'  but via Spark I am getting an int '29394689414109360733496367433219289452'.
What I am doing wrong here? How to get the correct UUID value from Cassandra via Spark/Python ? Please help me.

Thank you
Rajesh R

**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE
**************************************************************************

RE: UUID coming as int while using SPARK SQL

Posted by Rajesh Radhakrishnan <Ra...@phe.gov.uk>.

Found it!
ie how to convert or represent the C* uuid using Spark CQL.

uuid.UUID(int=idval)

So putting into the context

...
import uuid
...
 sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow='testWK'
 new_df = sqlContext.sql(sparkSQLl)
 results  =  new_df.collect()
 for row in results:
            print "dept=",row.dept
            print "wk=",row.workflow
            print "id=",row.id.int<https://indigo.phe.gov.uk/owa/redir.aspx?REF=Ekz-xQiaVDfKtPrbs3gnQFdo76NC6xjayRUe2MKesbIBmJiWhoTTCAFodHRwOi8vcmVkaXIuYXNweD9SRUY9aHdiZkRmVE50TW41ZkE4Y2VoQ2FNNmVZbng0elpjNzJzQ0VYU2xlejd2b3JucGFad1lQVENBRm9kSFJ3T2k4dmNtOTNMbWxr>
            print  "uuid=",uuid.UUID(int=row.id<https://indigo.phe.gov.uk/owa/redir.aspx?REF=Ekz-xQiaVDfKtPrbs3gnQFdo76NC6xjayRUe2MKesbIBmJiWhoTTCAFodHRwOi8vcmVkaXIuYXNweD9SRUY9aHdiZkRmVE50TW41ZkE4Y2VoQ2FNNmVZbng0elpjNzJzQ0VYU2xlejd2b3JucGFad1lQVENBRm9kSFJ3T2k4dmNtOTNMbWxr>.int)
...
The Python code above prints the following:
dept=blah
wk=testWK
id=293946894141093607334963674332192894528
uuid= 9547v26c-f528-12e5-da8b-001a4q3dac10
________________________________
From: Laing, Michael [michael.laing@nytimes.com]
Sent: 24 May 2016 12:23
To: user@cassandra.apache.org
Subject: Re: UUID coming as int while using SPARK SQL

Yes - a UUID is just a 128 bit value. You can view it using any base or format.

If you are looking at the same row, you should see the same 128 bit value, otherwise my theory is incorrect :)

Cheers,
ml

On Tue, May 24, 2016 at 6:57 AM, Rajesh Radhakrishnan <Rajesh.Radhakrishnan@phe.gov.uk<redir.aspx?REF=RLZ74B63gWgYMSmbd2Gok1vvIOD6w4ASvSmPQMg9-SQBmJiWhoTTCAFtYWlsdG86UmFqZXNoLlJhZGhha3Jpc2huYW5AcGhlLmdvdi51aw..>> wrote:
Hi Michael,

Thank you for the quick reply.
So you are suggesting to convert this int value(UUID comes back as int via Spark SQL) to hex?


And selection is just a example to highlight the UUID convertion issue.
So in Cassandra it should be
SELECT id, workflow FROM sam WHERE dept='blah';

And in Spark with Python:
SELECT distinct id, dept, workflow FROM samd WHERE dept='blah';


Best,
Rajesh R


________________________________
From: Laing, Michael [michael.laing@nytimes.com<redir.aspx?REF=qRF9hK1vRFyPaKNfrKHOF5_aAUncz16Uycqy7am_sxEBmJiWhoTTCAFtYWlsdG86bWljaGFlbC5sYWluZ0BueXRpbWVzLmNvbQ..>]
Sent: 24 May 2016 11:40
To: user@cassandra.apache.org<redir.aspx?REF=gDiF-AEOFQqkqIHQ4vcXi4oFK2r9KrSCIUF_mYeE5toBmJiWhoTTCAFtYWlsdG86dXNlckBjYXNzYW5kcmEuYXBhY2hlLm9yZw..>
Subject: Re: UUID coming as int while using SPARK SQL

Try converting that int from decimal to hex and inserting dashes in the appropriate spots - or go the other way.

Also, you are looking at different rows, based upon your selection criteria...

ml

On Tue, May 24, 2016 at 6:23 AM, Rajesh Radhakrishnan <Rajesh.Radhakrishnan@phe.gov.uk<redir.aspx?REF=-MA89NnLeMLvVTIV-IE80ho8OSDgu0Ev9gTH0JI7-SwBmJiWhoTTCAFodHRwOi8vcmVkaXIuYXNweD9SRUY9NVc3OHJwWU1nQzBLM1RvSE56UlppUEFUOGhuV3M2Z25Sa3EtQTQxVDFGc3JucGFad1lQVENBRnRZV2xzZEc4NlVtRnFaWE5vTGxKaFpHaGhhM0pwYzJodVlXNUFjR2hsTG1kdmRpNTFhdy4u>> wrote:
Hi,


I got a Cassandra keyspace, but while reading the data(especially UUID) via Spark SQL using Python is not returning the correct value.

Cassandra:
--------------
My table 'SAM'' is described below:

CREATE table ks.sam (id uuid, dept text, workflow text, type double primary  key (id, dept))

SELECT id, workflow FROM sam WHERE dept='blah';

The above example  CQL gives me the following
id                                   | workflow
--------------------------------------+------------
 9547v26c-f528-12e5-da8b-001a4q3dac10 |       testWK


Spark/Python:
------------------
from pyspark import SparkConf
from pyspark.sql import SQLContext
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext

....
conf = SparkConf().set("spark.cassandra.connection.host",IP_ADDRESS).set("spark.cassandra.connection.native.port",PORT_NUMBER)
sparkContext = CassandraSparkContext(conf = conf)
sqlContext = SQLContext(sparkContext)

samTable =sparkContext.cassandraTable("ks", "sam").select('id', 'dept','workflow')
samTable.cache()

samdf.registerTempTable("samd")

 sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow='testWK'
 new_df = sqlContext.sql(sparkSQLl)
 results  =  new_df.collect()
 for row in results:
            print "dept=",row.dept
            print "wk=",row.workflow
            print "id=",row.id<redir.aspx?REF=Ekz-xQiaVDfKtPrbs3gnQFdo76NC6xjayRUe2MKesbIBmJiWhoTTCAFodHRwOi8vcmVkaXIuYXNweD9SRUY9aHdiZkRmVE50TW41ZkE4Y2VoQ2FNNmVZbng0elpjNzJzQ0VYU2xlejd2b3JucGFad1lQVENBRm9kSFJ3T2k4dmNtOTNMbWxr>
...
The Python code above prints the following:
dept=Biology
wk=testWK
id=293946894141093607334963674332192894528


You can see here that the id (uuid) whose correct value at Cassandra is ' 9547v26c-f528-12e5-da8b-001a4q3dac10'  but via Spark I am getting an int '29394689414109360733496367433219289452'.
What I am doing wrong here? How to get the correct UUID value from Cassandra via Spark/Python ? Please help me.

Thank you
Rajesh R

**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE<redir.aspx?REF=Rywozi1rdJso7bni-dgwiqy662OYclCjto9_wC_6rwoBmJiWhoTTCAFodHRwOi8vcmVkaXIuYXNweD9SRUY9ekQ1RlpWcW1hbU9xMmdOM255WGJEMHExbFdXZkVNWDl1d3FKYUE1ZU9XNHJucGFad1lQVENBRm9kSFJ3T2k4dmQzZDNMbWR2ZGk1MWF5OVFTRVUu>
**************************************************************************


**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE<redir.aspx?REF=rub_cOQ8L76gtfHxOIFDq0EweHRJSIWt--0S6zOXfTMBmJiWhoTTCAFodHRwOi8vd3d3Lmdvdi51ay9QSEU.>
**************************************************************************


**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE
**************************************************************************

Re: UUID coming as int while using SPARK SQL

Posted by "Laing, Michael" <mi...@nytimes.com>.

Yes - a UUID is just a 128 bit value. You can view it using any base or
format.

If you are looking at the same row, you should see the same 128 bit value,
otherwise my theory is incorrect :)

Cheers,
ml

On Tue, May 24, 2016 at 6:57 AM, Rajesh Radhakrishnan <
Rajesh.Radhakrishnan@phe.gov.uk> wrote:

> Hi Michael,
>
> Thank you for the quick reply.
> So you are suggesting to convert this int value(UUID comes back as int via
> Spark SQL) to hex?
>
>
> And selection is just a example to highlight the UUID convertion issue.
> So in Cassandra it should be
> SELECT id, workflow FROM sam WHERE dept='blah';
>
> And in Spark with Python:
> SELECT distinct id, dept, workflow FROM samd WHERE dept='blah';
>
>
> Best,
> Rajesh R
>
>
> ------------------------------
> *From:* Laing, Michael [michael.laing@nytimes.com]
> *Sent:* 24 May 2016 11:40
> *To:* user@cassandra.apache.org
> *Subject:* Re: UUID coming as int while using SPARK SQL
>
> Try converting that int from decimal to hex and inserting dashes in the
> appropriate spots - or go the other way.
>
> Also, you are looking at different rows, based upon your selection
> criteria...
>
> ml
>
> On Tue, May 24, 2016 at 6:23 AM, Rajesh Radhakrishnan <
> Rajesh.Radhakrishnan@phe.gov.uk
> <http://redir.aspx?REF=5W78rpYMgC0K3ToHNzRZiPAT8hnWs6gnRkq-A41T1FsrnpaZwYPTCAFtYWlsdG86UmFqZXNoLlJhZGhha3Jpc2huYW5AcGhlLmdvdi51aw..>
> > wrote:
>
>> Hi,
>>
>>
>> I got a Cassandra keyspace, but while reading the data(especially UUID)
>> via Spark SQL using Python is not returning the correct value.
>>
>> Cassandra:
>> --------------
>> My table 'SAM'' is described below:
>>
>> CREATE table ks.sam (id uuid, dept text, workflow text, type double
>> primary  key (id, dept))
>>
>> SELECT id, workflow FROM sam WHERE dept='blah';
>>
>> The above example  CQL gives me the following
>> id                                   | workflow
>> --------------------------------------+------------
>>  9547v26c-f528-12e5-da8b-001a4q3dac10 |       testWK
>>
>>
>> Spark/Python:
>> ------------------
>> from pyspark import SparkConf
>> from pyspark.sql import SQLContext
>> import pyspark_cassandra
>> from pyspark_cassandra import CassandraSparkContext
>>
>> ....
>> conf =
>> SparkConf().set("spark.cassandra.connection.host",IP_ADDRESS).set("spark.cassandra.connection.native.port",PORT_NUMBER)
>> sparkContext = CassandraSparkContext(conf = conf)
>> sqlContext = SQLContext(sparkContext)
>>
>> samTable =sparkContext.cassandraTable("ks", "sam").select('id', 'dept','
>> workflow')
>> samTable.cache()
>>
>> samdf.registerTempTable("samd")
>>
>>  sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow
>> ='testWK'
>>  new_df = sqlContext.sql(sparkSQLl)
>>  results  =  new_df.collect()
>>  for row in results:
>>             print "dept=",row.dept
>>             print "wk=",row.workflow
>>             print "id=",row.id
>> <http://redir.aspx?REF=hwbfDfTNtMn5fA8cehCaM6eYnx4zZc72sCEXSlez7vornpaZwYPTCAFodHRwOi8vcm93Lmlk>
>> ...
>> The Python code above prints the following:
>> dept=Biology
>> wk=testWK
>> id=293946894141093607334963674332192894528
>>
>>
>> You can see here that the id (uuid) whose correct value at Cassandra is '
>> 9547v26c-f528-12e5-da8b-001a4q3dac10'  but via Spark I am getting an int
>> '29394689414109360733496367433219289452'.
>> What I am doing wrong here? How to get the correct UUID value from
>> Cassandra via Spark/Python ? Please help me.
>>
>> Thank you
>> Rajesh R
>>
>> **************************************************************************
>> The information contained in the EMail and any attachments is
>> confidential and intended solely and for the attention and use of the named
>> addressee(s). It may not be disclosed to any other person without the
>> express authority of Public Health England, or the intended recipient, or
>> both. If you are not the intended recipient, you must not disclose, copy,
>> distribute or retain this message or any part of it. This footnote also
>> confirms that this EMail has been swept for computer viruses by
>> Symantec.Cloud, but please re-sweep any attachments before opening or
>> saving. http://www.gov.uk/PHE
>> <http://redir.aspx?REF=zD5FZVqmamOq2gN3nyXbD0q1lWWfEMX9uwqJaA5eOW4rnpaZwYPTCAFodHRwOi8vd3d3Lmdvdi51ay9QSEU.>
>> **************************************************************************
>>
>
>
> **************************************************************************
> The information contained in the EMail and any attachments is confidential
> and intended solely and for the attention and use of the named
> addressee(s). It may not be disclosed to any other person without the
> express authority of Public Health England, or the intended recipient, or
> both. If you are not the intended recipient, you must not disclose, copy,
> distribute or retain this message or any part of it. This footnote also
> confirms that this EMail has been swept for computer viruses by
> Symantec.Cloud, but please re-sweep any attachments before opening or
> saving. http://www.gov.uk/PHE
> **************************************************************************
>

RE: UUID coming as int while using SPARK SQL

Posted by Rajesh Radhakrishnan <Ra...@phe.gov.uk>.

Hi Michael,

Thank you for the quick reply.
So you are suggesting to convert this int value(UUID comes back as int via Spark SQL) to hex?


And selection is just a example to highlight the UUID convertion issue.
So in Cassandra it should be
SELECT id, workflow FROM sam WHERE dept='blah';

And in Spark with Python:
SELECT distinct id, dept, workflow FROM samd WHERE dept='blah';


Best,
Rajesh R


________________________________
From: Laing, Michael [michael.laing@nytimes.com]
Sent: 24 May 2016 11:40
To: user@cassandra.apache.org
Subject: Re: UUID coming as int while using SPARK SQL

Try converting that int from decimal to hex and inserting dashes in the appropriate spots - or go the other way.

Also, you are looking at different rows, based upon your selection criteria...

ml

On Tue, May 24, 2016 at 6:23 AM, Rajesh Radhakrishnan <Rajesh.Radhakrishnan@phe.gov.uk<redir.aspx?REF=5W78rpYMgC0K3ToHNzRZiPAT8hnWs6gnRkq-A41T1FsrnpaZwYPTCAFtYWlsdG86UmFqZXNoLlJhZGhha3Jpc2huYW5AcGhlLmdvdi51aw..>> wrote:
Hi,


I got a Cassandra keyspace, but while reading the data(especially UUID) via Spark SQL using Python is not returning the correct value.

Cassandra:
--------------
My table 'SAM'' is described below:

CREATE table ks.sam (id uuid, dept text, workflow text, type double primary  key (id, dept))

SELECT id, workflow FROM sam WHERE dept='blah';

The above example  CQL gives me the following
id                                   | workflow
--------------------------------------+------------
 9547v26c-f528-12e5-da8b-001a4q3dac10 |       testWK


Spark/Python:
------------------
from pyspark import SparkConf
from pyspark.sql import SQLContext
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext

....
conf = SparkConf().set("spark.cassandra.connection.host",IP_ADDRESS).set("spark.cassandra.connection.native.port",PORT_NUMBER)
sparkContext = CassandraSparkContext(conf = conf)
sqlContext = SQLContext(sparkContext)

samTable =sparkContext.cassandraTable("ks", "sam").select('id', 'dept','workflow')
samTable.cache()

samdf.registerTempTable("samd")

 sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow='testWK'
 new_df = sqlContext.sql(sparkSQLl)
 results  =  new_df.collect()
 for row in results:
            print "dept=",row.dept
            print "wk=",row.workflow
            print "id=",row.id<redir.aspx?REF=hwbfDfTNtMn5fA8cehCaM6eYnx4zZc72sCEXSlez7vornpaZwYPTCAFodHRwOi8vcm93Lmlk>
...
The Python code above prints the following:
dept=Biology
wk=testWK
id=293946894141093607334963674332192894528


You can see here that the id (uuid) whose correct value at Cassandra is ' 9547v26c-f528-12e5-da8b-001a4q3dac10'  but via Spark I am getting an int '29394689414109360733496367433219289452'.
What I am doing wrong here? How to get the correct UUID value from Cassandra via Spark/Python ? Please help me.

Thank you
Rajesh R

**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE<redir.aspx?REF=zD5FZVqmamOq2gN3nyXbD0q1lWWfEMX9uwqJaA5eOW4rnpaZwYPTCAFodHRwOi8vd3d3Lmdvdi51ay9QSEU.>
**************************************************************************


**************************************************************************
The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE
**************************************************************************

Re: UUID coming as int while using SPARK SQL

Posted by "Laing, Michael" <mi...@nytimes.com>.

Try converting that int from decimal to hex and inserting dashes in the
appropriate spots - or go the other way.

Also, you are looking at different rows, based upon your selection
criteria...

ml

On Tue, May 24, 2016 at 6:23 AM, Rajesh Radhakrishnan <
Rajesh.Radhakrishnan@phe.gov.uk> wrote:

> Hi,
>
>
> I got a Cassandra keyspace, but while reading the data(especially UUID)
> via Spark SQL using Python is not returning the correct value.
>
> Cassandra:
> --------------
> My table 'SAM'' is described below:
>
> CREATE table ks.sam (id uuid, dept text, workflow text, type double
> primary  key (id, dept))
>
> SELECT id, workflow FROM sam WHERE dept='blah';
>
> The above example  CQL gives me the following
> id                                   | workflow
> --------------------------------------+------------
>  9547v26c-f528-12e5-da8b-001a4q3dac10 |       testWK
>
>
> Spark/Python:
> ------------------
> from pyspark import SparkConf
> from pyspark.sql import SQLContext
> import pyspark_cassandra
> from pyspark_cassandra import CassandraSparkContext
>
> ....
> conf =
> SparkConf().set("spark.cassandra.connection.host",IP_ADDRESS).set("spark.cassandra.connection.native.port",PORT_NUMBER)
> sparkContext = CassandraSparkContext(conf = conf)
> sqlContext = SQLContext(sparkContext)
>
> samTable =sparkContext.cassandraTable("ks", "sam").select('id', 'dept','
> workflow')
> samTable.cache()
>
> samdf.registerTempTable("samd")
>
>  sparkSQLl ="SELECT distinct id, dept, workflow FROM samd WHERE workflow='
> testWK'
>  new_df = sqlContext.sql(sparkSQLl)
>  results  =  new_df.collect()
>  for row in results:
>             print "dept=",row.dept
>             print "wk=",row.workflow
>             print "id=",row.id
> ...
> The Python code above prints the following:
> dept=Biology
> wk=testWK
> id=293946894141093607334963674332192894528
>
>
> You can see here that the id (uuid) whose correct value at Cassandra is '
> 9547v26c-f528-12e5-da8b-001a4q3dac10'  but via Spark I am getting an int '
> 29394689414109360733496367433219289452'.
> What I am doing wrong here? How to get the correct UUID value from
> Cassandra via Spark/Python ? Please help me.
>
> Thank you
> Rajesh R
>
> **************************************************************************
> The information contained in the EMail and any attachments is confidential
> and intended solely and for the attention and use of the named
> addressee(s). It may not be disclosed to any other person without the
> express authority of Public Health England, or the intended recipient, or
> both. If you are not the intended recipient, you must not disclose, copy,
> distribute or retain this message or any part of it. This footnote also
> confirms that this EMail has been swept for computer viruses by
> Symantec.Cloud, but please re-sweep any attachments before opening or
> saving. http://www.gov.uk/PHE
> **************************************************************************
>