You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Oliver Ruebenacker <ol...@broadinstitute.org> on 2022/12/19 15:27:07 UTC

[PySpark] Getting the best row from each group

     Hello,

  How can I retain from each group only the row for which one value is the
maximum of the group? For example, imagine a DataFrame containing all major
cities in the world, with three columns: (1) City name (2) Country (3)
population. How would I get a DataFrame that only contains the largest city
in each country? Thanks!

     Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Bjørn Jørgensen <bj...@gmail.com>.

https://github.com/apache/spark/pull/39134

tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker <
oliverr@broadinstitute.org>:

> Thank you for the suggestion. This would, however, involve converting my
> Dataframe to an RDD (and back later), which involves additional costs.
>
> On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <
> raghavendra.g@gmail.com> wrote:
>
>> you can groupBy(country). and use mapPartitions method in which you can
>> iterate over all rows keeping 2 variables for maxPopulationSoFar and
>> corresponding city. Then return the city with max population.
>> I think as others suggested, it may be possible to use Bucketing, it
>> would give a more friendly SQL'ish way of doing and but not be the best in
>> performance as it needs to order/sort.
>> --
>> Raghavendra
>>
>>
>> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

Thank you for the suggestion. This would, however, involve converting my
Dataframe to an RDD (and back later), which involves additional costs.

On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <ra...@gmail.com>
wrote:

> you can groupBy(country). and use mapPartitions method in which you can
> iterate over all rows keeping 2 variables for maxPopulationSoFar and
> corresponding city. Then return the city with max population.
> I think as others suggested, it may be possible to use Bucketing, it would
> give a more friendly SQL'ish way of doing and but not be the best in
> performance as it needs to order/sort.
> --
> Raghavendra
>
>
> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   How can I retain from each group only the row for which one value is
>> the maximum of the group? For example, imagine a DataFrame containing all
>> major cities in the world, with three columns: (1) City name (2) Country
>> (3) population. How would I get a DataFrame that only contains the largest
>> city in each country? Thanks!
>>
>>      Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Raghavendra Ganesh <ra...@gmail.com>.

you can groupBy(country). and use mapPartitions method in which you can
iterate over all rows keeping 2 variables for maxPopulationSoFar and
corresponding city. Then return the city with max population.
I think as others suggested, it may be possible to use Bucketing, it would
give a more friendly SQL'ish way of doing and but not be the best in
performance as it needs to order/sort.
--
Raghavendra


On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
oliverr@broadinstitute.org> wrote:

>
>      Hello,
>
>   How can I retain from each group only the row for which one value is the
> maximum of the group? For example, imagine a DataFrame containing all major
> cities in the world, with three columns: (1) City name (2) Country (3)
> population. How would I get a DataFrame that only contains the largest city
> in each country? Thanks!
>
>      Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

Wow, thank you so much!

On Wed, Dec 21, 2022 at 10:27 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> OK let us try this
>
> 1) we have a csv file as below called cities.csv
>
> country,city,population
> Germany,Berlin,3520031
> Germany,Hamburg,1787408
> Germany,Munich,1450381
> Turkey,Ankara,4587558
> Turkey,Istanbul,14025646
> Turkey,Izmir,2847691
> United States,Chicago IL,2670406
> United States,Los Angeles CA,085014
> United States,New York City NY,8622357
>
> 2) Put this in HDFS as below
>
> hdfs dfs -put cities.csv /data/stg/test
>
> Read it into dataframe in PySpark as below
>
>  csv_file="hdfs://rhes75:9000/data/stg/test/cities.csv"
> # read hive table in spark
> listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
>  listing_df.printSchema()
>  print(f"""\n Reading from Hive table {csv_file}\n""")
>  listing_df.show(100,False)
>
> 3) create  spark temp table from the Dataframe. I call it temp
>
>  print(f"""\n Reading from temp table temp created on listing_df\n""")
>  listing_df.createOrReplaceTempView("temp")
>
> 4) use standard sql with windowing to get the result out
>
>      sqltext = """
>     SELECT country, city
>     FROM
>     (
>     SELECT
>               country AS country
>             ,  city AS city
>             , DENSE_RANK() OVER (PARTITION BY country ORDER BY population)
> AS RANK
>             , max(population) OVER (PARTITION by country ORDER BY country,
> city) AS population
>     FROM temp
>     GROUP BY country, city, population
>     )
>     WHERE RANK in (3)
>     ORDER by population DESC
>     """
>     spark.sql(sqltext).show()
>
> 4) let us test it
>
> root
>  |-- country: string (nullable = true)
>  |-- city: string (nullable = true)
>  |-- population: double (nullable = true)
>
>
>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/cities.csv
>
> +-------------+----------------+-----------+
> |country      |city            |population |
> +-------------+----------------+-----------+
> |Germany      |Berlin          |3520031.0  |
> |Germany      |Hamburg         |1787408.0  |
> |Germany      |Munich          |1450381.0  |
> |Turkey       |Ankara          |4587558.0  |
> |Turkey       |Istanbul        |1.4025646E7|
> |Turkey       |Izmir           |2847691.0  |
> |United States|Chicago IL      |2670406.0  |
> |United States|Los Angeles CA  |85014.0    |
> |United States|New York City NY|8622357.0  |
> +-------------+----------------+-----------+
>
>
>  Reading from temp table temp created on listing_df
>
> +-------------+----------------+
> |      country|            city|
> +-------------+----------------+
> |       Turkey|        Istanbul|
> |United States|New York City NY|
> |      Germany|          Berlin|
> +-------------+----------------+
>
> The codes are attached
>
> I am sure it can be improved.
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Dec 2022 at 20:35, Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Let's say the data is like this:
>>
>> +---------------+-------------------+------------+
>> | country       | city              | population |
>> +---------------+-------------------+------------+
>> | Germany       | Berlin            | 3520031    |
>> | Germany       | Hamburg           | 1787408    |
>> | Germany       | Munich            | 1450381    |
>> | Turkey        | Ankara            | 4587558    |
>> | Turkey        | Istanbul          | 14025646   |
>> | Turkey        | Izmir             | 2847691    |
>> | United States | Chicago, IL       | 2670406    |
>> | United States | Los Angeles, CA   | 4085014    |
>> | United States | New York City, NY | 8622357    |
>> +---------------+-------------------+------------+
>>
>> I want to get the largest city in each country:
>>
>> +---------------+-------------------+
>> | country       | city              |
>> +---------------+-------------------+
>> | Germany       | Berlin            |
>> | Turkey        | Istanbul          |
>> | United States | New York City, NY |
>> +---------------+-------------------+
>>
>> Thanks!
>>
>>      Best, Oliver
>>
>> On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Windowing functions were invented to avoid doing lengthy group by etc.
>>>
>>> As usual there is a lot of heat but little light
>>>
>>> Please provide:
>>>
>>>
>>>    1. Sample input. I gather this data is stored in some csv, tsv,
>>>    table format
>>>    2. The output that you would like to see.
>>>
>>>
>>> Have a look at this article of mine  Technical Analysis of the latest
>>> UK House Price Index, Deploying Modern tools
>>> <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>
>>>
>>>
>>> The PySpark code and windowing functions are here
>>> <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>
>>>
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <
>>> oliverr@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   Thank you for the response!
>>>>
>>>>   I can think of two ways to get the largest city by country, but both
>>>> seem to be inefficient:
>>>>
>>>>   (1) I could group by country, sort each group by population, add the
>>>> row number within each group, and then retain only cities with a row number
>>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>>> largest of each country
>>>>
>>>>   (2) I could group by country, get the maximum city population for
>>>> each country, join that with the original data frame, and then retain only
>>>> cities with population equal to the maximum population in the country. But
>>>> that seems also expensive because I need to join.
>>>>
>>>>   Am I missing something?
>>>>
>>>>   Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> In spark you can use windowing function
>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>> achieve this
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>> oliverr@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   How can I retain from each group only the row for which one value
>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>> Country (3) population. How would I get a DataFrame that only contains the
>>>>>> largest city in each country? Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK let us try this

1) we have a csv file as below called cities.csv

country,city,population
Germany,Berlin,3520031
Germany,Hamburg,1787408
Germany,Munich,1450381
Turkey,Ankara,4587558
Turkey,Istanbul,14025646
Turkey,Izmir,2847691
United States,Chicago IL,2670406
United States,Los Angeles CA,085014
United States,New York City NY,8622357

2) Put this in HDFS as below

hdfs dfs -put cities.csv /data/stg/test

Read it into dataframe in PySpark as below

 csv_file="hdfs://rhes75:9000/data/stg/test/cities.csv"
# read hive table in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()
 print(f"""\n Reading from Hive table {csv_file}\n""")
 listing_df.show(100,False)

3) create  spark temp table from the Dataframe. I call it temp

 print(f"""\n Reading from temp table temp created on listing_df\n""")
 listing_df.createOrReplaceTempView("temp")

4) use standard sql with windowing to get the result out

     sqltext = """
    SELECT country, city
    FROM
    (
    SELECT
              country AS country
            ,  city AS city
            , DENSE_RANK() OVER (PARTITION BY country ORDER BY population)
AS RANK
            , max(population) OVER (PARTITION by country ORDER BY country,
city) AS population
    FROM temp
    GROUP BY country, city, population
    )
    WHERE RANK in (3)
    ORDER by population DESC
    """
    spark.sql(sqltext).show()

4) let us test it

root
 |-- country: string (nullable = true)
 |-- city: string (nullable = true)
 |-- population: double (nullable = true)


 Reading from Hive table hdfs://rhes75:9000/data/stg/test/cities.csv

+-------------+----------------+-----------+
|country      |city            |population |
+-------------+----------------+-----------+
|Germany      |Berlin          |3520031.0  |
|Germany      |Hamburg         |1787408.0  |
|Germany      |Munich          |1450381.0  |
|Turkey       |Ankara          |4587558.0  |
|Turkey       |Istanbul        |1.4025646E7|
|Turkey       |Izmir           |2847691.0  |
|United States|Chicago IL      |2670406.0  |
|United States|Los Angeles CA  |85014.0    |
|United States|New York City NY|8622357.0  |
+-------------+----------------+-----------+


 Reading from temp table temp created on listing_df

+-------------+----------------+
|      country|            city|
+-------------+----------------+
|       Turkey|        Istanbul|
|United States|New York City NY|
|      Germany|          Berlin|
+-------------+----------------+

The codes are attached

I am sure it can be improved.



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Dec 2022 at 20:35, Oliver Ruebenacker <ol...@broadinstitute.org>
wrote:

>
>      Hello,
>
>   Let's say the data is like this:
>
> +---------------+-------------------+------------+
> | country       | city              | population |
> +---------------+-------------------+------------+
> | Germany       | Berlin            | 3520031    |
> | Germany       | Hamburg           | 1787408    |
> | Germany       | Munich            | 1450381    |
> | Turkey        | Ankara            | 4587558    |
> | Turkey        | Istanbul          | 14025646   |
> | Turkey        | Izmir             | 2847691    |
> | United States | Chicago, IL       | 2670406    |
> | United States | Los Angeles, CA   | 4085014    |
> | United States | New York City, NY | 8622357    |
> +---------------+-------------------+------------+
>
> I want to get the largest city in each country:
>
> +---------------+-------------------+
> | country       | city              |
> +---------------+-------------------+
> | Germany       | Berlin            |
> | Turkey        | Istanbul          |
> | United States | New York City, NY |
> +---------------+-------------------+
>
> Thanks!
>
>      Best, Oliver
>
> On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Windowing functions were invented to avoid doing lengthy group by etc.
>>
>> As usual there is a lot of heat but little light
>>
>> Please provide:
>>
>>
>>    1. Sample input. I gather this data is stored in some csv, tsv, table
>>    format
>>    2. The output that you would like to see.
>>
>>
>> Have a look at this article of mine  Technical Analysis of the latest UK
>> House Price Index, Deploying Modern tools
>> <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>
>>
>>
>> The PySpark code and windowing functions are here
>> <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   Thank you for the response!
>>>
>>>   I can think of two ways to get the largest city by country, but both
>>> seem to be inefficient:
>>>
>>>   (1) I could group by country, sort each group by population, add the
>>> row number within each group, and then retain only cities with a row number
>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>> largest of each country
>>>
>>>   (2) I could group by country, get the maximum city population for each
>>> country, join that with the original data frame, and then retain only
>>> cities with population equal to the maximum population in the country. But
>>> that seems also expensive because I need to join.
>>>
>>>   Am I missing something?
>>>
>>>   Thanks!
>>>
>>>      Best, Oliver
>>>
>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> In spark you can use windowing function
>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>> achieve this
>>>>
>>>> HTH
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>> oliverr@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   How can I retain from each group only the row for which one value is
>>>>> the maximum of the group? For example, imagine a DataFrame containing all
>>>>> major cities in the world, with three columns: (1) City name (2) Country
>>>>> (3) population. How would I get a DataFrame that only contains the largest
>>>>> city in each country? Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Artemis User <ar...@dtechspace.com>.

Try this one:  "select country, city, max(population) from your_table 
group by country"

Please note this returns a table of three columns, instead of two. This 
is a standard SQL query, and supported by Spark as well.

On 12/20/22 3:35 PM, Oliver Ruebenacker wrote:
>
>      Hello,
>
>   Let's say the data is like this:
>
> +---------------+-------------------+------------+
> | country       | city              | population |
> +---------------+-------------------+------------+
> | Germany       | Berlin | 3520031    |
> | Germany       | Hamburg | 1787408    |
> | Germany       | Munich | 1450381    |
> | Turkey        | Ankara | 4587558    |
> | Turkey        | Istanbul | 14025646   |
> | Turkey        | Izmir | 2847691    |
> | United States | Chicago, IL | 2670406    |
> | United States | Los Angeles, CA | 4085014    |
> | United States | New York City, NY | 8622357    |
> +---------------+-------------------+------------+
>
> I want to get the largest city in each country:
>
> +---------------+-------------------+
> | country       | city              |
> +---------------+-------------------+
> | Germany       | Berlin    |
> | Turkey        | Istanbul    |
> | United States | New York City, NY |
> +---------------+-------------------+
>
> Thanks!
>
>      Best, Oliver
>
> On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh 
> <mi...@gmail.com> wrote:
>
>     Hi,
>
>     Windowing functions were invented to avoid doing lengthy group by etc.
>
>     As usual there is a lot of heat but little light
>
>     Please provide:
>
>      1. Sample input. I gather this data is stored in some csv, tsv,
>         table format
>      2. The output that you would like to see.
>
>
>     Have a look at this article of mine Technical Analysis of the
>     latest UK House Price Index, Deploying Modern tools
>     <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>
>
>
>     The PySpark code and windowing functions arehere
>     <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>
>
>
>     HTH
>
>
>     **view my Linkedin profile
>     <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>     https://en.everybodywiki.com/Mich_Talebzadeh
>
>     *Disclaimer:* Use it at your own risk.Any and all responsibility
>     for any loss, damage or destruction of data or any other property
>     which may arise from relying on this email's technical content is
>     explicitly disclaimed. The author will in no case be liable for
>     any monetary damages arising from such loss, damage or destruction.
>
>
>
>     On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker
>     <ol...@broadinstitute.org> wrote:
>
>
>              Hello,
>
>           Thank you for the response!
>
>           I can think of two ways to get the largest city by country,
>         but both seem to be inefficient:
>
>           (1) I could group by country, sort each group by population,
>         add the row number within each group, and then retain only
>         cities with a row number equal to 1. But it seems wasteful to
>         sort everything when I only want the largest of each country
>
>           (2) I could group by country, get the maximum city
>         population for each country, join that with the original data
>         frame, and then retain only cities with population equal to
>         the maximum population in the country. But that seems also
>         expensive because I need to join.
>
>           Am I missing something?
>
>           Thanks!
>
>              Best, Oliver
>
>         On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh
>         <mi...@gmail.com> wrote:
>
>             In spark you can use windowing function
>             <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>             achieve this
>
>             HTH
>
>
>             **view my Linkedin profile
>             <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>             https://en.everybodywiki.com/Mich_Talebzadeh
>
>             *Disclaimer:* Use it at your own risk.Any and all
>             responsibility for any loss, damage or destruction of data
>             or any other property which may arise from relying on this
>             email's technical content is explicitly disclaimed. The
>             author will in no case be liable for any monetary damages
>             arising from such loss, damage or destruction.
>
>
>
>             On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker
>             <ol...@broadinstitute.org> wrote:
>
>
>                      Hello,
>
>                   How can I retain from each group only the row for
>                 which one value is the maximum of the group? For
>                 example, imagine a DataFrame containing all major
>                 cities in the world, with three columns: (1) City name
>                 (2) Country (3) population. How would I get a
>                 DataFrame that only contains the largest city in each
>                 country? Thanks!
>
>                      Best, Oliver
>
>                 -- 
>                 Oliver Ruebenacker, Ph.D. (he)
>                 Senior Software Engineer, Knowledge Portal Network
>                 <http://kp4cd.org/>, Flannick Lab
>                 <http://www.flannicklab.org/>, Broad Institute
>                 <http://www.broadinstitute.org/>
>
>
>
>         -- 
>         Oliver Ruebenacker, Ph.D. (he)
>         Senior Software Engineer, Knowledge Portal Network
>         <http://kp4cd.org/>, Flannick Lab
>         <http://www.flannicklab.org/>, Broad Institute
>         <http://www.broadinstitute.org/>
>
>
>
> -- 
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network 
> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad 
> Institute <http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

     Hello,

  Let's say the data is like this:

+---------------+-------------------+------------+
| country       | city              | population |
+---------------+-------------------+------------+
| Germany       | Berlin            | 3520031    |
| Germany       | Hamburg           | 1787408    |
| Germany       | Munich            | 1450381    |
| Turkey        | Ankara            | 4587558    |
| Turkey        | Istanbul          | 14025646   |
| Turkey        | Izmir             | 2847691    |
| United States | Chicago, IL       | 2670406    |
| United States | Los Angeles, CA   | 4085014    |
| United States | New York City, NY | 8622357    |
+---------------+-------------------+------------+

I want to get the largest city in each country:

+---------------+-------------------+
| country       | city              |
+---------------+-------------------+
| Germany       | Berlin            |
| Turkey        | Istanbul          |
| United States | New York City, NY |
+---------------+-------------------+

Thanks!

     Best, Oliver

On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Windowing functions were invented to avoid doing lengthy group by etc.
>
> As usual there is a lot of heat but little light
>
> Please provide:
>
>
>    1. Sample input. I gather this data is stored in some csv, tsv, table
>    format
>    2. The output that you would like to see.
>
>
> Have a look at this article of mine  Technical Analysis of the latest UK
> House Price Index, Deploying Modern tools
> <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>
>
>
> The PySpark code and windowing functions are here
> <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but both
>> seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add the
>> row number within each group, and then retain only cities with a row number
>> equal to 1. But it seems wasteful to sort everything when I only want the
>> largest of each country
>>
>>   (2) I could group by country, get the maximum city population for each
>> country, join that with the original data frame, and then retain only
>> cities with population equal to the maximum population in the country. But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>      Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> In spark you can use windowing function
>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> oliverr@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   How can I retain from each group only the row for which one value is
>>>> the maximum of the group? For example, imagine a DataFrame containing all
>>>> major cities in the world, with three columns: (1) City name (2) Country
>>>> (3) population. How would I get a DataFrame that only contains the largest
>>>> city in each country? Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Windowing functions were invented to avoid doing lengthy group by etc.

As usual there is a lot of heat but little light

Please provide:


   1. Sample input. I gather this data is stored in some csv, tsv, table
   format
   2. The output that you would like to see.


Have a look at this article of mine  Technical Analysis of the latest UK
House Price Index, Deploying Modern tools
<https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>


The PySpark code and windowing functions are here
<https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <ol...@broadinstitute.org>
wrote:

>
>      Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>      Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> In spark you can use windowing function
>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>> achieve this
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Post an example dataframe and how you will have the result.

man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker <
oliverr@broadinstitute.org>:

> Thank you, that is an interesting idea. Instead of finding the maximum
> population, we are finding the maximum (population, city name) tuple.
>
> On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> We have pandas API on spark
>> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
>> which is very good.
>>
>> from pyspark import pandas as ps
>>
>> You can use pdf = df.pandas_api()
>> Where df is your pyspark dataframe.
>>
>>
>> [image: image.png]
>>
>> Does this help you?
>>
>> df.groupby(['Country'])[['Population', 'City']].max()
>>
>> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tucci@gmail.com
>> >:
>>
>>> Window functions don't work like traditional GROUP BYs. They allow you
>>> to partition data and pull any relevant column, whether it's used in the
>>> partition or not.
>>>
>>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>>> be something like this:
>>>
>>> WITH InputData AS
>>> (
>>>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>>>   UNION
>>>   SELECT 'USA' Country, 'Miami', 6200000 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
>>> )
>>>
>>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>>> DESC) PopulationRank
>>>  FROM InputData;
>>>
>>> Results would be something like this:
>>>
>>> Country    City       Population     PopulationRank
>>> Ukraine    Kyiv       3000000        1
>>> Ukraine    Kharkiv    1400000        2
>>> USA        New York   9000000        1
>>> USA        Miami      6200000        2
>>>
>>> Which you could further filter in another CTE or subquery where
>>> PopulationRank = 1.
>>>
>>> As I mentioned, I'm not sure how this translates into PySpark, but
>>> that's the general concept in SQL.
>>>
>>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>>> oliverr@broadinstitute.org> wrote:
>>>
>>>> If we only wanted to know the biggest population, max function would
>>>> suffice. The problem is I also want the name of the city with the biggest
>>>> population.
>>>>
>>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> As Mich says, isn't this just max by population partitioned by country
>>>>> in a window function?
>>>>>
>>>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>>>> oliverr@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   Thank you for the response!
>>>>>>
>>>>>>   I can think of two ways to get the largest city by country, but
>>>>>> both seem to be inefficient:
>>>>>>
>>>>>>   (1) I could group by country, sort each group by population, add
>>>>>> the row number within each group, and then retain only cities with a row
>>>>>> number equal to 1. But it seems wasteful to sort everything when I only
>>>>>> want the largest of each country
>>>>>>
>>>>>>   (2) I could group by country, get the maximum city population for
>>>>>> each country, join that with the original data frame, and then retain only
>>>>>> cities with population equal to the maximum population in the country. But
>>>>>> that seems also expensive because I need to join.
>>>>>>
>>>>>>   Am I missing something?
>>>>>>
>>>>>>   Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> In spark you can use windowing function
>>>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>>>> achieve this
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>>>> oliverr@broadinstitute.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>      Hello,
>>>>>>>>
>>>>>>>>   How can I retain from each group only the row for which one value
>>>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>>>> Country (3) population. How would I get a DataFrame that only contains the
>>>>>>>> largest city in each country? Thanks!
>>>>>>>>
>>>>>>>>      Best, Oliver
>>>>>>>>
>>>>>>>> --
>>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

Thank you, that is an interesting idea. Instead of finding the maximum
population, we are finding the maximum (population, city name) tuple.

On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> We have pandas API on spark
> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
> which is very good.
>
> from pyspark import pandas as ps
>
> You can use pdf = df.pandas_api()
> Where df is your pyspark dataframe.
>
>
> [image: image.png]
>
> Does this help you?
>
> df.groupby(['Country'])[['Population', 'City']].max()
>
> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tucci@gmail.com
> >:
>
>> Window functions don't work like traditional GROUP BYs. They allow you to
>> partition data and pull any relevant column, whether it's used in the
>> partition or not.
>>
>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>> be something like this:
>>
>> WITH InputData AS
>> (
>>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>>   UNION
>>   SELECT 'USA' Country, 'Miami', 6200000 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
>> )
>>
>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>> DESC) PopulationRank
>>  FROM InputData;
>>
>> Results would be something like this:
>>
>> Country    City       Population     PopulationRank
>> Ukraine    Kyiv       3000000        1
>> Ukraine    Kharkiv    1400000        2
>> USA        New York   9000000        1
>> USA        Miami      6200000        2
>>
>> Which you could further filter in another CTE or subquery where
>> PopulationRank = 1.
>>
>> As I mentioned, I'm not sure how this translates into PySpark, but that's
>> the general concept in SQL.
>>
>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>> If we only wanted to know the biggest population, max function would
>>> suffice. The problem is I also want the name of the city with the biggest
>>> population.
>>>
>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> As Mich says, isn't this just max by population partitioned by country
>>>> in a window function?
>>>>
>>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>>> oliverr@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   Thank you for the response!
>>>>>
>>>>>   I can think of two ways to get the largest city by country, but both
>>>>> seem to be inefficient:
>>>>>
>>>>>   (1) I could group by country, sort each group by population, add the
>>>>> row number within each group, and then retain only cities with a row number
>>>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>>>> largest of each country
>>>>>
>>>>>   (2) I could group by country, get the maximum city population for
>>>>> each country, join that with the original data frame, and then retain only
>>>>> cities with population equal to the maximum population in the country. But
>>>>> that seems also expensive because I need to join.
>>>>>
>>>>>   Am I missing something?
>>>>>
>>>>>   Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> In spark you can use windowing function
>>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>>> achieve this
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>>> oliverr@broadinstitute.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>      Hello,
>>>>>>>
>>>>>>>   How can I retain from each group only the row for which one value
>>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>>> Country (3) population. How would I get a DataFrame that only contains the
>>>>>>> largest city in each country? Thanks!
>>>>>>>
>>>>>>>      Best, Oliver
>>>>>>>
>>>>>>> --
>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Bjørn Jørgensen <bj...@gmail.com>.

We have pandas API on spark
<https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
which is very good.

from pyspark import pandas as ps

You can use pdf = df.pandas_api()
Where df is your pyspark dataframe.


[image: image.png]

Does this help you?

df.groupby(['Country'])[['Population', 'City']].max()

man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <pa...@gmail.com>:

> Window functions don't work like traditional GROUP BYs. They allow you to
> partition data and pull any relevant column, whether it's used in the
> partition or not.
>
> I'm not sure what the syntax is for PySpark, but the standard SQL would be
> something like this:
>
> WITH InputData AS
> (
>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>   UNION
>   SELECT 'USA' Country, 'Miami', 6200000 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
> )
>
>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
> DESC) PopulationRank
>  FROM InputData;
>
> Results would be something like this:
>
> Country    City       Population     PopulationRank
> Ukraine    Kyiv       3000000        1
> Ukraine    Kharkiv    1400000        2
> USA        New York   9000000        1
> USA        Miami      6200000        2
>
> Which you could further filter in another CTE or subquery where
> PopulationRank = 1.
>
> As I mentioned, I'm not sure how this translates into PySpark, but that's
> the general concept in SQL.
>
> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>> If we only wanted to know the biggest population, max function would
>> suffice. The problem is I also want the name of the city with the biggest
>> population.
>>
>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> As Mich says, isn't this just max by population partitioned by country
>>> in a window function?
>>>
>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>> oliverr@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   Thank you for the response!
>>>>
>>>>   I can think of two ways to get the largest city by country, but both
>>>> seem to be inefficient:
>>>>
>>>>   (1) I could group by country, sort each group by population, add the
>>>> row number within each group, and then retain only cities with a row number
>>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>>> largest of each country
>>>>
>>>>   (2) I could group by country, get the maximum city population for
>>>> each country, join that with the original data frame, and then retain only
>>>> cities with population equal to the maximum population in the country. But
>>>> that seems also expensive because I need to join.
>>>>
>>>>   Am I missing something?
>>>>
>>>>   Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> In spark you can use windowing function
>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>> achieve this
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>> oliverr@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   How can I retain from each group only the row for which one value
>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>> Country (3) population. How would I get a DataFrame that only contains the
>>>>>> largest city in each country? Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [PySpark] Getting the best row from each group

Posted by Patrick Tucci <pa...@gmail.com>.

Window functions don't work like traditional GROUP BYs. They allow you to
partition data and pull any relevant column, whether it's used in the
partition or not.

I'm not sure what the syntax is for PySpark, but the standard SQL would be
something like this:

WITH InputData AS
(
  SELECT 'USA' Country, 'New York' City, 9000000 Population
  UNION
  SELECT 'USA' Country, 'Miami', 6200000 Population
  UNION
  SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
  UNION
  SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
)

 SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population DESC)
PopulationRank
 FROM InputData;

Results would be something like this:

Country    City       Population     PopulationRank
Ukraine    Kyiv       3000000        1
Ukraine    Kharkiv    1400000        2
USA        New York   9000000        1
USA        Miami      6200000        2

Which you could further filter in another CTE or subquery where
PopulationRank = 1.

As I mentioned, I'm not sure how this translates into PySpark, but that's
the general concept in SQL.

On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
oliverr@broadinstitute.org> wrote:

> If we only wanted to know the biggest population, max function would
> suffice. The problem is I also want the name of the city with the biggest
> population.
>
> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sr...@gmail.com> wrote:
>
>> As Mich says, isn't this just max by population partitioned by country in
>> a window function?
>>
>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   Thank you for the response!
>>>
>>>   I can think of two ways to get the largest city by country, but both
>>> seem to be inefficient:
>>>
>>>   (1) I could group by country, sort each group by population, add the
>>> row number within each group, and then retain only cities with a row number
>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>> largest of each country
>>>
>>>   (2) I could group by country, get the maximum city population for each
>>> country, join that with the original data frame, and then retain only
>>> cities with population equal to the maximum population in the country. But
>>> that seems also expensive because I need to join.
>>>
>>>   Am I missing something?
>>>
>>>   Thanks!
>>>
>>>      Best, Oliver
>>>
>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> In spark you can use windowing function
>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>> achieve this
>>>>
>>>> HTH
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>> oliverr@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   How can I retain from each group only the row for which one value is
>>>>> the maximum of the group? For example, imagine a DataFrame containing all
>>>>> major cities in the world, with three columns: (1) City name (2) Country
>>>>> (3) population. How would I get a DataFrame that only contains the largest
>>>>> city in each country? Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

If we only wanted to know the biggest population, max function would
suffice. The problem is I also want the name of the city with the biggest
population.

On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sr...@gmail.com> wrote:

> As Mich says, isn't this just max by population partitioned by country in
> a window function?
>
> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but both
>> seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add the
>> row number within each group, and then retain only cities with a row number
>> equal to 1. But it seems wasteful to sort everything when I only want the
>> largest of each country
>>
>>   (2) I could group by country, get the maximum city population for each
>> country, join that with the original data frame, and then retain only
>> cities with population equal to the maximum population in the country. But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>      Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> In spark you can use windowing function
>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> oliverr@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   How can I retain from each group only the row for which one value is
>>>> the maximum of the group? For example, imagine a DataFrame containing all
>>>> major cities in the world, with three columns: (1) City name (2) Country
>>>> (3) population. How would I get a DataFrame that only contains the largest
>>>> city in each country? Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Sean Owen <sr...@gmail.com>.

As Mich says, isn't this just max by population partitioned by country in a
window function?

On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <ol...@broadinstitute.org>
wrote:

>
>      Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>      Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> In spark you can use windowing function
>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>> achieve this
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliverr@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Posted by Oliver Ruebenacker <ol...@broadinstitute.org>.

     Hello,

  Thank you for the response!

  I can think of two ways to get the largest city by country, but both seem
to be inefficient:

  (1) I could group by country, sort each group by population, add the row
number within each group, and then retain only cities with a row number
equal to 1. But it seems wasteful to sort everything when I only want the
largest of each country

  (2) I could group by country, get the maximum city population for each
country, join that with the original data frame, and then retain only
cities with population equal to the maximum population in the country. But
that seems also expensive because I need to join.

  Am I missing something?

  Thanks!

     Best, Oliver

On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> In spark you can use windowing function
> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
> achieve this
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
> oliverr@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   How can I retain from each group only the row for which one value is
>> the maximum of the group? For example, imagine a DataFrame containing all
>> major cities in the world, with three columns: (1) City name (2) Country
>> (3) population. How would I get a DataFrame that only contains the largest
>> city in each country? Thanks!
>>
>>      Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Posted by Mich Talebzadeh <mi...@gmail.com>.

In spark you can use windowing function
<https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
achieve this

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <ol...@broadinstitute.org>
wrote:

>
>      Hello,
>
>   How can I retain from each group only the row for which one value is the
> maximum of the group? For example, imagine a DataFrame containing all major
> cities in the world, with three columns: (1) City name (2) Country (3)
> population. How would I get a DataFrame that only contains the largest city
> in each country? Thanks!
>
>      Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>