You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bjornjorgensen (via GitHub)" <gi...@apache.org> on 2023/04/23 12:01:57 UTC
[GitHub] [spark] bjornjorgensen commented on pull request #40913: [SPARK-43239][PS] Remove `null_counts` from info()
bjornjorgensen commented on PR #40913:
URL: https://github.com/apache/spark/pull/40913#issuecomment-1519049541
For me it seams like we can just add `show_counts` to this function. We already have this max row to calculate on.
Or we can implement something like this..
```
from collections import Counter
from pyspark.sql.functions import col, count, when
def spark_info(df):
# Print basic DataFrame information
print(f"<class '{df.__class__.__module__}.{df.__class__.__name__}'>")
print(f"Number of rows: {df.count()}")
print(f"Number of columns: {len(df.columns)}")
# Print column header for the detailed DataFrame information
print("\nColumn" + " " * 110 + "Non-Null Count" + " " + "Dtype")
print("-" * 6, " " * 108, "-" * 14, "-" * 5)
# Calculate non-null counts for each column
non_null_counts = df.agg(*[count(when(col(f"`{c}`").isNotNull(), f"`{c}`")).alias(c) for c in df.columns]).collect()[0]
# Initialize a counter to store data type counts
dtype_counter = Counter()
# Iterate through the schema fields and print detailed column information
for i, field in enumerate(df.schema.fields):
non_null_count = non_null_counts[field.name]
dtype = field.dataType.simpleString()
print(f"{field.name:<90} {non_null_count:>30} non-null {dtype}")
# Update the data type counter
dtype_counter[dtype] += 1
# Print data type summary
dtypes_summary = ", ".join([f"{dtype}({count})" for dtype, count in dtype_counter.items()])
print(f"\ndtypes: {dtypes_summary}")
```
![image](https://user-images.githubusercontent.com/47577197/233838325-b1b7b5ef-b358-4c41-a20c-f841f3484d2c.png)
(...)
![image](https://user-images.githubusercontent.com/47577197/233838368-5599bfe9-2a05-44d6-b583-cd2bbb444127.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org