You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jan-Willem van der Sijp (JIRA)" <ji...@apache.org> on 2018/08/09 12:07:00 UTC
[jira] [Created] (SPARK-25072) PySpark custom Row class can be given extra parameters

Jan-Willem van der Sijp created SPARK-25072:
-----------------------------------------------

             Summary: PySpark custom Row class can be given extra parameters
                 Key: SPARK-25072
                 URL: https://issues.apache.org/jira/browse/SPARK-25072
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.2.0
         Environment: {noformat}
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 3.4.5 (default, Dec 11 2017, 16:57:19)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/01 04:49:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/01 04:49:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/08/01 04:49:27 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 3.4.5 (default, Dec 11 2017 16:57:19)
SparkSession available as 'spark'.
{noformat}

{{CentOS release 6.9 (Final)}}
{{Linux sandbox-hdp.hortonworks.com 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 12 20:21:04 EST 2017 x86_64 x86_64 x86_64 GNU/Linux}}
{noformat}openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode){noformat}
            Reporter: Jan-Willem van der Sijp


When a custom Row class is made in PySpark, it is possible to provide the constructor of this class with more parameters than there are columns. These extra parameters affect the value of the Row, but are not part of the {{repr}} or {{str}} output, making it hard to debug errors due to these "invisible" values. The hidden values can be accessed through integer-based indexing though.

Some examples:

{code:python}
In [69]: RowClass = Row("column1", "column2")

In [70]: RowClass(1, 2) == RowClass(1, 2)
Out[70]: True

In [71]: RowClass(1, 2) == RowClass(1, 2, 3)
Out[71]: False

In [75]: RowClass(1, 2, 3)
Out[75]: Row(column1=1, column2=2)

In [76]: RowClass(1, 2)
Out[76]: Row(column1=1, column2=2)

In [77]: RowClass(1, 2, 3).asDict()
Out[77]: {'column1': 1, 'column2': 2}

In [78]: RowClass(1, 2, 3)[2]
Out[78]: 3

In [79]: repr(RowClass(1, 2, 3))
Out[79]: 'Row(column1=1, column2=2)'

In [80]: str(RowClass(1, 2, 3))
Out[80]: 'Row(column1=1, column2=2)'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org