You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bjornjorgensen (via GitHub)" <gi...@apache.org> on 2023/04/23 12:01:57 UTC

[GitHub] [spark] bjornjorgensen commented on pull request #40913: [SPARK-43239][PS] Remove `null_counts` from info()

bjornjorgensen commented on PR #40913:
URL: https://github.com/apache/spark/pull/40913#issuecomment-1519049541

   For me it seams like we can just add `show_counts` to this function. We already have this max row to calculate on.   
   
   Or we can implement something like this..
   
   ```
   from collections import Counter
   from pyspark.sql.functions import col, count, when
   
   def spark_info(df):
       # Print basic DataFrame information
       print(f"<class '{df.__class__.__module__}.{df.__class__.__name__}'>")
       print(f"Number of rows: {df.count()}")
       print(f"Number of columns: {len(df.columns)}")
   
       # Print column header for the detailed DataFrame information
       print("\nColumn" + " " * 110 + "Non-Null Count" + " " + "Dtype")
       print("-" * 6, " " * 108, "-" * 14, "-" * 5)
   
       # Calculate non-null counts for each column
       non_null_counts = df.agg(*[count(when(col(f"`{c}`").isNotNull(), f"`{c}`")).alias(c) for c in df.columns]).collect()[0]
   
       # Initialize a counter to store data type counts
       dtype_counter = Counter()
   
       # Iterate through the schema fields and print detailed column information
       for i, field in enumerate(df.schema.fields):
           non_null_count = non_null_counts[field.name]
           dtype = field.dataType.simpleString()
           print(f"{field.name:<90} {non_null_count:>30} non-null {dtype}")
   
           # Update the data type counter
           dtype_counter[dtype] += 1
   
       # Print data type summary
       dtypes_summary = ", ".join([f"{dtype}({count})" for dtype, count in dtype_counter.items()])
       print(f"\ndtypes: {dtypes_summary}")
    ```
   
   ![image](https://user-images.githubusercontent.com/47577197/233838325-b1b7b5ef-b358-4c41-a20c-f841f3484d2c.png)
   (...)
   
   ![image](https://user-images.githubusercontent.com/47577197/233838368-5599bfe9-2a05-44d6-b583-cd2bbb444127.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org