You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lchorbadjiev <lu...@gmail.com> on 2018/10/29 13:25:08 UTC

dremel paper example schema

Hi,

I'm trying to reproduce the example from dremel paper
(https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
pyspark and I wonder if it is possible at all?

Trying to follow the paper example as close as possible I created this
document type:

from pyspark.sql.types import *

links_type = StructType([
    StructField("Backward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
    StructField("Forward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
])

language_type = StructType([
    StructField("Code", StringType(), nullable=False),
    StructField("Country", StringType())
])

names_type = StructType([
    StructField("Language", ArrayType(language_type, containsNull=False)),
    StructField("Url", StringType()),
])

document_type = StructType([
    StructField("DocId", LongType(), nullable=False),
    StructField("Links", links_type, nullable=True),
    StructField("Name", ArrayType(names_type, containsNull=False))
])

But when I store data in parquet using this type, the resulting parquet
schema is different from the described in the paper:

message spark_schema {
  required int64 DocId;
  optional group Links {
    required group Backward (LIST) {
      repeated group list {
        required int32 element;
      }
    }
    required group Forward (LIST) {
      repeated group list {
        required int32 element;
      }
    }
  }
  optional group Name (LIST) {
    repeated group list {
      required group element {
        optional group Language (LIST) {
          repeated group list {
            required group element {
              required binary Code (UTF8);
              optional binary Country (UTF8);
            }
          }
        }
        optional binary Url (UTF8);
      }
    }
  }
}

Moreover, if I create a parquet file with schema described in the dremel
paper using Apache Parquet Java API and try to read it into Apache Spark, I
get an exception:

org.apache.spark.sql.execution.QueryExecutionException: Encounter error
while reading parquet files. One possible cause: Parquet column cannot be
converted in the corresponding files

Is it possible to create example schema described in the dremel paper using
Apache Spark and what is the correct approach to build this example?

Regards,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by Gourav Sengupta <go...@gmail.com>.

Super,


Now it makes sense, I am copying Holden in this email.

Regards,
Gourav

On Tue, 30 Oct 2018, 06:34 lchorbadjiev, <lu...@gmail.com>
wrote:

> Hi Gourav,
>
> the question in fact is are there any the limitations of Apache Spark
> support for Parquet file format.
>
> The example schema from the dremel paper is something that is supported in
> Apache Parquet (using Apache Parquet Java API).
>
> Now I am trying to implement the same schema using Apache Spark SQL types,
> but not very successful. And probably this is not unexpected.
>
> What was unexpected is that Apache Spark can't read a parquet file with the
> dremel example schema.
>
> Probably there are some limitations of what Apache Spark can support from
> Apache Parquet file format, but for me it is not obvious what this
> limitations are.
>
> Thanks,
> Lubomir Chorbadjiev
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: dremel paper example schema

Posted by lchorbadjiev <lu...@gmail.com>.

Hi Jorn,

Thanks for the help. I switched to using Apache Parquet 1.8.3 and now Spark
successfully loads the parquet file.

Do you have any hint for the other part of my question? What is the correct
way to reproduce this schema:

message Document {
  required int64 DocId;
  optional group Links {
    repeated int64 Backward;
    repeated int64 Forward;
  }
  repeated group Name {
    repeated group Language {
      required binary Code;
      optional binary Country;
    }
    optional binary Url;
  }
}

using Apache Spark SQL types?

Thanks,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by Jörn Franke <jo...@gmail.com>.

I would try with the same version as Spark uses first. I don’t have the changelog of parquet in my head (but you can find it ok the Internet), but it could be the cause of your issues.

> Am 31.10.2018 um 12:26 schrieb lchorbadjiev <lu...@gmail.com>:
> 
> Hi Jorn,
> 
> I am using Apache Spark 2.3.1.
> 
> For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10.
> This does not match the version of parquet used in Apache Spark 2.3.1 and if
> you think that this could be the problem I could try to use Apache Parquet
> version 1.8.3.
> 
> I created a parquet file using Apache Spark SQL types, but can not make the
> resulting schema to match the schema described in the paper.
> 
> What I do is to use Spark SQL array type for repeated values. For example,
> where papers says
> 
>    repeated int64 Backward;
> 
> I use array type:
> 
>    StructField("Backward", ArrayType(IntegerType(), containsNull=False),
> nullable=False)
> 
> The resulting schema, reported by parquet-tools is:
> 
>    optional group backward (LIST) {
>      repeated group list {
>        required int32 element;
>      }
>    }
> 
> Thanks,
> Lubomir Chorbadjiev
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by lchorbadjiev <lu...@gmail.com>.

Hi Jorn,

I am using Apache Spark 2.3.1.

For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10.
This does not match the version of parquet used in Apache Spark 2.3.1 and if
you think that this could be the problem I could try to use Apache Parquet
version 1.8.3.

I created a parquet file using Apache Spark SQL types, but can not make the
resulting schema to match the schema described in the paper.

What I do is to use Spark SQL array type for repeated values. For example,
where papers says

    repeated int64 Backward;

I use array type:

    StructField("Backward", ArrayType(IntegerType(), containsNull=False),
nullable=False)
    
The resulting schema, reported by parquet-tools is:

    optional group backward (LIST) {
      repeated group list {
        required int32 element;
      }
    }

Thanks,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by Jörn Franke <jo...@gmail.com>.

Are you using the same parquet version as Spark uses? Are you using a recent version of Spark? Why don’t you create the file in Spark?

> Am 30.10.2018 um 07:34 schrieb lchorbadjiev <lu...@gmail.com>:
> 
> Hi Gourav,
> 
> the question in fact is are there any the limitations of Apache Spark
> support for Parquet file format.
> 
> The example schema from the dremel paper is something that is supported in
> Apache Parquet (using Apache Parquet Java API).
> 
> Now I am trying to implement the same schema using Apache Spark SQL types,
> but not very successful. And probably this is not unexpected. 
> 
> What was unexpected is that Apache Spark can't read a parquet file with the
> dremel example schema.
> 
> Probably there are some limitations of what Apache Spark can support from
> Apache Parquet file format, but for me it is not obvious what this
> limitations are.
> 
> Thanks,
> Lubomir Chorbadjiev
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by lchorbadjiev <lu...@gmail.com>.

Hi Gourav,

the question in fact is are there any the limitations of Apache Spark
support for Parquet file format.

The example schema from the dremel paper is something that is supported in
Apache Parquet (using Apache Parquet Java API).

Now I am trying to implement the same schema using Apache Spark SQL types,
but not very successful. And probably this is not unexpected. 

What was unexpected is that Apache Spark can't read a parquet file with the
dremel example schema.

Probably there are some limitations of what Apache Spark can support from
Apache Parquet file format, but for me it is not obvious what this
limitations are.

Thanks,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: dremel paper example schema

Posted by Debasish Das <de...@gmail.com>.

Open source impl of dremel is parquet !

On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadjiev@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the example from dremel paper
>> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark
>> using
>> pyspark and I wonder if it is possible at all?
>>
>> Trying to follow the paper example as close as possible I created this
>> document type:
>>
>> from pyspark.sql.types import *
>>
>> links_type = StructType([
>>     StructField("Backward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>>     StructField("Forward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> ])
>>
>> language_type = StructType([
>>     StructField("Code", StringType(), nullable=False),
>>     StructField("Country", StringType())
>> ])
>>
>> names_type = StructType([
>>     StructField("Language", ArrayType(language_type, containsNull=False)),
>>     StructField("Url", StringType()),
>> ])
>>
>> document_type = StructType([
>>     StructField("DocId", LongType(), nullable=False),
>>     StructField("Links", links_type, nullable=True),
>>     StructField("Name", ArrayType(names_type, containsNull=False))
>> ])
>>
>> But when I store data in parquet using this type, the resulting parquet
>> schema is different from the described in the paper:
>>
>> message spark_schema {
>>   required int64 DocId;
>>   optional group Links {
>>     required group Backward (LIST) {
>>       repeated group list {
>>         required int32 element;
>>       }
>>     }
>>     required group Forward (LIST) {
>>       repeated group list {
>>         required int32 element;
>>       }
>>     }
>>   }
>>   optional group Name (LIST) {
>>     repeated group list {
>>       required group element {
>>         optional group Language (LIST) {
>>           repeated group list {
>>             required group element {
>>               required binary Code (UTF8);
>>               optional binary Country (UTF8);
>>             }
>>           }
>>         }
>>         optional binary Url (UTF8);
>>       }
>>     }
>>   }
>> }
>>
>> Moreover, if I create a parquet file with schema described in the dremel
>> paper using Apache Parquet Java API and try to read it into Apache Spark,
>> I
>> get an exception:
>>
>> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
>> while reading parquet files. One possible cause: Parquet column cannot be
>> converted in the corresponding files
>>
>> Is it possible to create example schema described in the dremel paper
>> using
>> Apache Spark and what is the correct approach to build this example?
>>
>> Regards,
>> Lubomir Chorbadjiev
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: dremel paper example schema

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

why not just use dremel?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <lu...@gmail.com>
wrote:

> Hi,
>
> I'm trying to reproduce the example from dremel paper
> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
> pyspark and I wonder if it is possible at all?
>
> Trying to follow the paper example as close as possible I created this
> document type:
>
> from pyspark.sql.types import *
>
> links_type = StructType([
>     StructField("Backward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
>     StructField("Forward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
> ])
>
> language_type = StructType([
>     StructField("Code", StringType(), nullable=False),
>     StructField("Country", StringType())
> ])
>
> names_type = StructType([
>     StructField("Language", ArrayType(language_type, containsNull=False)),
>     StructField("Url", StringType()),
> ])
>
> document_type = StructType([
>     StructField("DocId", LongType(), nullable=False),
>     StructField("Links", links_type, nullable=True),
>     StructField("Name", ArrayType(names_type, containsNull=False))
> ])
>
> But when I store data in parquet using this type, the resulting parquet
> schema is different from the described in the paper:
>
> message spark_schema {
>   required int64 DocId;
>   optional group Links {
>     required group Backward (LIST) {
>       repeated group list {
>         required int32 element;
>       }
>     }
>     required group Forward (LIST) {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
>   optional group Name (LIST) {
>     repeated group list {
>       required group element {
>         optional group Language (LIST) {
>           repeated group list {
>             required group element {
>               required binary Code (UTF8);
>               optional binary Country (UTF8);
>             }
>           }
>         }
>         optional binary Url (UTF8);
>       }
>     }
>   }
> }
>
> Moreover, if I create a parquet file with schema described in the dremel
> paper using Apache Parquet Java API and try to read it into Apache Spark, I
> get an exception:
>
> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
> while reading parquet files. One possible cause: Parquet column cannot be
> converted in the corresponding files
>
> Is it possible to create example schema described in the dremel paper using
> Apache Spark and what is the correct approach to build this example?
>
> Regards,
> Lubomir Chorbadjiev
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>