You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Charles vinodh <mi...@gmail.com> on 2023/04/11 14:56:13 UTC

Non string type partitions

Hi Team,

We are running into the below error when we are trying to run a simple
query a partitioned table in Spark.

*MetaException(message:Filtering is supported only on partition keys
of type string)
*


Our the partition column has been to type *date *instead of string and
query is a very simple SQL as shown below.

*SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*

Any idea why spark mandates partition columns to be of type string?. Is
there a recommended work around for this issue?

Re: Non string type partitions

Posted by Bjørn Jørgensen <bj...@gmail.com>.
I guess that it has to do with indexing and partitioning data to nodes.
Have a look at data partitioning system design concept
<https://www.enjoyalgorithms.com/blog/data-partitioning-system-design-concept>
 and key range partitions
<https://martinfowler.com/articles/patterns-of-distributed-systems/key-range-partitions.html>


You can work around this by creating a temp view where date is casted to
string.

Note I did have to test this sometimes so I'm using .mode("overwrite") on
the file.


from pyspark.sql import SparkSession, Row
from datetime import date

spark = SparkSession.builder.getOrCreate()

mock_data = [
    Row(id=1, name="John", partition_col=date(2023, 4, 11)),
    Row(id=2, name="Jane", partition_col=date(2023, 4, 11)),
    Row(id=3, name="Alice", partition_col=date(2023, 4, 12)),
    Row(id=4, name="Bob", partition_col=date(2023, 4, 12)),
]

mock_df = spark.createDataFrame(mock_data)

parquet_data_path = "test_date"
mock_df.write.partitionBy("partition_col").mode("overwrite").parquet(parquet_data_path)

create_table_sql = f"""
CREATE TABLE IF NOT EXISTS my_table (
    id INT,
    name STRING)
USING parquet
PARTITIONED BY (partition_col DATE)
OPTIONS ('path' = '{parquet_data_path}')
"""

spark.sql(create_table_sql)

# temp view with the string partition column
create_view_sql = f"""
CREATE OR REPLACE TEMPORARY VIEW my_table_with_string_partition AS
SELECT *, CAST(partition_col AS STRING) AS partition_col_str FROM my_table;
"""
spark.sql(create_view_sql)

query = f"SELECT * FROM my_table_with_string_partition WHERE
partition_col_str = '2023-04-11';"
result = spark.sql(query)

result.show()


+---+----+-------------+-----------------+
| id|name|partition_col|partition_col_str|
+---+----+-------------+-----------------+
|  1|John|   2023-04-11|       2023-04-11|
|  2|Jane|   2023-04-11|       2023-04-11|
+---+----+-------------+-----------------+



lør. 15. apr. 2023 kl. 21:41 skrev Charles vinodh <mi...@gmail.com>:

>
> bumping this up again for suggestions?.. Is the official recommendation to
> not have *int* or *date* typed partition columns?
>
> On Wed, 12 Apr 2023 at 10:44, Charles vinodh <mi...@gmail.com>
> wrote:
>
>> There are  other distributed execution engines (like hive, trino) that do
>> support non-string data types for partition columns such as date and
>> integer.
>> Any idea why this restriction exists in Spark? ..
>>
>>
>> On Tue, 11 Apr 2023 at 20:34, Chitral Verma <ch...@gmail.com>
>> wrote:
>>
>>> Because the name of the directory cannot be an object, it has to be a
>>> string to create partitioned dirs like "date=2023-04-10"
>>>
>>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <mi...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi Team,
>>>>
>>>> We are running into the below error when we are trying to run a simple
>>>> query a partitioned table in Spark.
>>>>
>>>> *MetaException(message:Filtering is supported only on partition keys of type string)
>>>> *
>>>>
>>>>
>>>> Our the partition column has been to type *date *instead of string and
>>>> query is a very simple SQL as shown below.
>>>>
>>>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>>>
>>>> Any idea why spark mandates partition columns to be of type string?. Is
>>>> there a recommended work around for this issue?
>>>>
>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Non string type partitions

Posted by Charles vinodh <mi...@gmail.com>.
bumping this up again for suggestions?.. Is the official recommendation to
not have *int* or *date* typed partition columns?

On Wed, 12 Apr 2023 at 10:44, Charles vinodh <mi...@gmail.com> wrote:

> There are  other distributed execution engines (like hive, trino) that do
> support non-string data types for partition columns such as date and
> integer.
> Any idea why this restriction exists in Spark? ..
>
>
> On Tue, 11 Apr 2023 at 20:34, Chitral Verma <ch...@gmail.com>
> wrote:
>
>> Because the name of the directory cannot be an object, it has to be a
>> string to create partitioned dirs like "date=2023-04-10"
>>
>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <mi...@gmail.com>
>> wrote:
>>
>>>
>>> Hi Team,
>>>
>>> We are running into the below error when we are trying to run a simple
>>> query a partitioned table in Spark.
>>>
>>> *MetaException(message:Filtering is supported only on partition keys of type string)
>>> *
>>>
>>>
>>> Our the partition column has been to type *date *instead of string and
>>> query is a very simple SQL as shown below.
>>>
>>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>>
>>> Any idea why spark mandates partition columns to be of type string?. Is
>>> there a recommended work around for this issue?
>>>
>>>
>>>

Re: Non string type partitions

Posted by Charles vinodh <mi...@gmail.com>.
There are  other distributed execution engines (like hive, trino) that do
support non-string data types for partition columns such as date and
integer.
Any idea why this restriction exists in Spark? ..


On Tue, 11 Apr 2023 at 20:34, Chitral Verma <ch...@gmail.com> wrote:

> Because the name of the directory cannot be an object, it has to be a
> string to create partitioned dirs like "date=2023-04-10"
>
> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <mi...@gmail.com>
> wrote:
>
>>
>> Hi Team,
>>
>> We are running into the below error when we are trying to run a simple
>> query a partitioned table in Spark.
>>
>> *MetaException(message:Filtering is supported only on partition keys of type string)
>> *
>>
>>
>> Our the partition column has been to type *date *instead of string and
>> query is a very simple SQL as shown below.
>>
>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>
>> Any idea why spark mandates partition columns to be of type string?. Is
>> there a recommended work around for this issue?
>>
>>
>>

Re: Non string type partitions

Posted by Chitral Verma <ch...@gmail.com>.
Because the name of the directory cannot be an object, it has to be a
string to create partitioned dirs like "date=2023-04-10"

On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <mi...@gmail.com> wrote:

>
> Hi Team,
>
> We are running into the below error when we are trying to run a simple
> query a partitioned table in Spark.
>
> *MetaException(message:Filtering is supported only on partition keys of type string)
> *
>
>
> Our the partition column has been to type *date *instead of string and
> query is a very simple SQL as shown below.
>
> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>
> Any idea why spark mandates partition columns to be of type string?. Is
> there a recommended work around for this issue?
>
>
>