You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2015/11/05 10:11:27 UTC

[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

    [ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991279#comment-14991279 ] 

Stefania edited comment on CASSANDRA-9304 at 11/5/15 9:11 AM:
--------------------------------------------------------------

Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal with the low level platform specific details. Queues can also be safely used from the callback threads, which was not the case for pipes.

Regarding the problem with the driver, -I haven't tested in 2.2 but I don't think it matters which version since- I verified the problem applies to 2.2 as well, yesterday I was using the latest cassandra-test driver version, today I used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} is the same, {{<bound method CassandraTypeType.from_binary of <class 'cassandra.cqltypes.BytesType'>>}} but {{cls.serialize}} in {{from_binary}} is a lambda for the case that works and the default implementation {{CassandraType.deserialize}} for  the case that does not work. I don't know where the lambda comes from but I noticed there is a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how cython works but if this is picked up in the normal case then the problem is again with the way multiprocessing imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, like it's done for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
     def serialize(val, protocol_version):
         return six.binary_type(val)

+    @staticmethod
+    def deserialize(byts, protocol_version):
+        return bytearray(byts)
+

 class DecimalType(_CassandraType):
     typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I used cassandra-stress. Then run {{COPY <anytable> TO 'anyfile';}} from cqlsh and this should result in a Unicode decode error on Windows because the blob is received as a string. If you prefer me to test things for you, that works too.



was (Author: stefania):
Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal with the low level platform specific details. Queues can also be safely used from the callback threads, which was not the case for pipes.

Regarding the problem with the driver, I haven't tested in 2.2 but I don't think it matters which version since yesterday I was using the latest cassandra-test driver version. Today I used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} is the same, {{<bound method CassandraTypeType.from_binary of <class 'cassandra.cqltypes.BytesType'>>}} but {{cls.serialize}} in {{from_binary}} is a lambda for the case that works and the default implementation {{CassandraType.deserialize}} for  the case that does not work. I don't know where the lambda comes from but I noticed there is a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how cython works but if this is picked up in the normal case then the problem is again with the way multiprocessing imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, like it's done for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
     def serialize(val, protocol_version):
         return six.binary_type(val)

+    @staticmethod
+    def deserialize(byts, protocol_version):
+        return bytearray(byts)
+

 class DecimalType(_CassandraType):
     typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I used cassandra-stress. Then run {{COPY <anytable> TO 'anyfile';}} from cqlsh and this should result in a Unicode decode error on Windows because the blob is received as a string. If you prefer me to test things for you, that works too.


> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 3.x, 2.1.x, 2.2.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)