You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/26 20:01:52 UTC
[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-546634920
 
 
   
   @zero323 Thanks for looking into this
   
   > Personally I'd prefer to wait a moment and see where the discussion on SPARK-22232 goes. If the resolution is introduction of legacy mode, then the scope of this particular change could be conditioned on it and Python version.
   
    I agree that an update to `Row` is required and I'm happy to see discussions going in the 
   right direction. However, this will be part of Spark 3.x and for many Spark users using it in production it will be a long wait until this is going to happen.
   
   Do you think it makes sense to apply this change for Spark 2.x?
   
   > If not I'd like to see some memory profiling data (especially memory - timings might be actually better for now, as we skip all the nasty `obj[n]`, but that's not very meaningful*) first.
   
   I can try to do that. Do you have any example how to do proper timings with spark?
   
   > It is also worth mentioning that SPARK-24915 is really an edge case. If user wants to provide schema then it is hard to justify using `Rows` (plain `tuple` or different flavors of `dict` are much better choice in such case), so making everyone worse (if I am right about performance impact) to support it doesn't make much sense.
   
   In my case it was about reproducing input for a test case. I just wanted to created a dataframe containing the same rows as the problematic dataframe. In the end, I chose a different way to produce the data but it feels strange to not be able to simple create `Row`s the way you see them in some dataframe.
   
   > * Is there any reason why we do this:
   > 
   > https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L615
   > 
   > instead of just `tuple(obj)`? That's huge performance bottleneck with wide schemas. Depending on the resolution of this one, that's something to fix, don't you think?
   
   This was a workaround introduced by #14469. `tuple(obj)` relies on the order of fields - which for `Row` is alphabetically. If this doesn't correspond to the schema the order of the fields will be messed up.
   
   My change here is actually just doing the same workaround for schemas with fields that need some serialization.
   
   I'm unclear if the proposed change will have any negative effect. The codepath should only be taken in cases that either failed (as in [SPARK-24915](https://issues.apache.org/jira/browse/SPARK-24915)) or might have silently mixed up the `.toInternal()` calls of different types.
   
   Even with sub-optimal performance, this would only improve the situation for users.
   
   Would you prefer to replace
   
   https://github.com/qudade/spark/blob/a52de2e4b258e7fecad4143e00f01df4b096a513/python/pyspark/sql/types.py#L603
   
   ```
   return self.toInternal(obj.asDict())
   ```
   
   by
   
   ```
   return tuple(f.toInternal(obj[n]) if c else obj[n]
                                for n, f, c in zip(self.names, self.fields, self._needConversion))
   ```
   
   to reduce one indirection?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org