You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Bryński (JIRA)" <ji...@apache.org> on 2016/07/15 10:21:20 UTC
[jira] [Updated] (SPARK-16569) Use Cython in Pyspark internals
[ https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Bryński updated SPARK-16569:
-----------------------------------
Description:
CC: [~davies]
Many operations I do are like:
{code}
dataframe.rdd.map(some_function)
{code}
In Pyspark this mean creating Row object for every record and this is slow.
IDEA:
Use Cython to speed up Pyspark internals
What do you think ?
Sample profile:
{code}
============================================================
Profile of RDD<id=9>
============================================================
2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
14948 427.117 0.029 1811.622 0.121 {built-in method loads}
199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row)
199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40}
199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal)
199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__)
199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>)
199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__)
199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>)
200189064 91.928 0.000 91.928 0.000 {built-in method isinstance}
199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter)
{code}
was:
CC: [~davies]
Many operations I do are like:
{code}
dataframe.rdd.map(some_function)
{code}
In Pyspark this mean creating Row object for every record and this is slow.
IDEA:
Use Cython to speed up Pyspark internals
Sample profile:
{code}
============================================================
Profile of RDD<id=9>
============================================================
2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
14948 427.117 0.029 1811.622 0.121 {built-in method loads}
199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row)
199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40}
199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal)
199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__)
199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>)
199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__)
199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>)
200189064 91.928 0.000 91.928 0.000 {built-in method isinstance}
199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter)
{code}
> Use Cython in Pyspark internals
> -------------------------------
>
> Key: SPARK-16569
> URL: https://issues.apache.org/jira/browse/SPARK-16569
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 1.6.2, 2.0.0
> Reporter: Maciej Bryński
> Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> ============================================================
> Profile of RDD<id=9>
> ============================================================
> 2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds
> Ordered by: internal time, cumulative time
> ncalls tottime percall cumtime percall filename:lineno(function)
> 14948 427.117 0.029 1811.622 0.121 {built-in method loads}
> 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row)
> 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40}
> 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal)
> 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__)
> 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>)
> 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__)
> 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>)
> 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance}
> 199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org