You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by joesu <gi...@git.apache.org> on 2014/08/02 10:39:25 UTC

[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

GitHub user joesu opened a pull request:

    https://github.com/apache/spark/pull/1737

    [SPARK-2489] [SQL] Parquet support for fixed_len_byte_array

    Make it possible to read parquet files with fied_len_byte_array type columns.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/joesu/spark feature-casefixedlen

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1737.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1737
    
----
commit 79bf8bb02be89e275d4fb3b2c0e1cfaf26ae8b92
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-01T14:19:57Z

    initial commit

commit 40d422e7704c25fd60c0205a50189c50dd74d073
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-01T14:44:15Z

    properly handle convert

commit 9f9ac6d69933061b3dc715754856ed09340cf8dd
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-01T15:15:31Z

    fixed length array cast to string

commit 885c1c28fef03302ec167362d83c0391dda2c478
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-02T07:16:37Z

    scala style

commit f15372fe6ac79a19a7fe9939b9d979ecca37735f
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-02T07:43:21Z

    cast between binary and fixedlenbinary

commit f59458b941e37fb03904627f9c3adc406dee4a13
Author: ChiaYung Su <ch...@appier.com>
Date:   2014-08-02T08:12:50Z

    change type name

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-50957711
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53632099
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-60827613
  
    The problem is that dataTypes are a public api so once we add one we are stuck with it for ever.  Also, each new datatype adds significant overhead so I'd like to be pretty cautious about adding them when they are just special cases of existing types.
    
    We are already exploring the pattern of a single datatype with multiple settings elsewhere.  There is a patch in the works that adds support for fixed and arbitrary precision decimal arithmetic using a single type.  So if it is possible to do here as well I think that would be good.
    
    If the concern is primarily reading data from existing systems, what about a smaller initial patch that allows Spark SQL to read fixed length binary data, but just uses the existing BinaryType?  We wouldn't be able to write out fixed length data, but this does seems like a good first step


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by josephsu <gi...@git.apache.org>.
Github user josephsu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-56555651
  
    I did some experiments on reusing existing BinaryType but it does not quite work as expected.
    
    BinaryType is originally a case object, thus the length field in it will be shared among all apps.  We have to change BinaryType to class class to hold the length information. However, it breaks all codes that assume BinaryType is case object. 
    
    Do you have any suggestion? 
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53641399
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19326/consoleFull) for   PR 1737 at commit [`f66e658`](https://github.com/apache/spark/commit/f66e658729b3ca84cd3fe62daf7e503a4f40ec7b).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true) `
      * `case class FixedLenByteArrayType( length:Int ) extends DataType with PrimitiveType `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-54916874
  
    @joesu, thanks for clarifying the issues with reading data from the parquet library.  I like the idea of adding a new field to `BinaryType`, `fixedLength: Option[Int]`, that could be used to distinguish these two storage representation.  We can have this field default to `None` so we don't break any existing code.  In particular, since both types are going to be represented as `Array[Byte]` elsewhere in the Spark SQL execution engine, this means we don't have to add any extra handling code.  This is purely an optimization when writing out data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by joesu <gi...@git.apache.org>.
Github user joesu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53976828
  
    Another way is to include max length information in the BinaryType type, just like the FixedLenByteArray type in this pull request. Thus we can maintain only one binary data type for the fixed length and the variable length ones. What do you think about this approach?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53632075
  
    Hi @joesu, thanks for reporting and working on this issue.  Instead of creating a new datatype, what do you think about just reading in fixed length byte arrays as our already existing BinaryType?  This would give us compatibility without the added overhead of creating a new datatype.
    
    While I think it might be a reasonable optimization to add a fixed length byte type at some point in the future, doing so is a fairly major undertaking.  Basically every place in the code where we match on datatypes will need to be updated.  Therefore, before doing this I'd want to see a use case where the optimization paid off and a design doc on how we would implement it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-54694522
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by joesu <gi...@git.apache.org>.
Github user joesu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53976708
  
    It's not that straightforward to reuse BinaryType for handling parquet's binary type and fixed_len_byte_array types because these two types are incompatible in the parquet library and we have to specify the data type when reading data from parquet files thought the library. Parquet library refuses to read data if you ask it to read binary typed data from a fixed_len_byte_array typed field.
    
    If we really want to reuse the BinaryType, we have to change all the Catalyst-to-Parquet type conversion functions ( e.g. convertFromAttributes() function in ParquetTypes.scala) to consider the underlying file schema when mapping BinaryType to corresponding parquet types. Do you have any suggested way to do this?
    
    In the long run we might want to optimize storage for common fixed length things like UUID, IPv6 address, MD5 hashes, etc.. Parquet files prepend data length in every data field for regular binary typed fields, but it only store the data length once in the metadata for fixed length byte array typed fields. It's a good fit to use fixed length byte array typed field to store the fixed length data.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-57566129
  
    You are right that we would have to change the BinaryType to be a case class instead to hold this information and then change the rest of the code to deal with that.  It is possible that we could play some tricks with the `unapply` method in the BinaryType companion object to minimize the changes to pattern matching code, I'd have to play around with it more to see if that is actually feasible though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #1737: [SPARK-2489] [SQL] Parquet support for fixed_len_byte_arr...

Posted by mukunku <gi...@git.apache.org>.
Github user mukunku commented on the issue:

    https://github.com/apache/spark/pull/1737
  
    I'm using spark 2.2.0 and still have this issue:
    Illegal Parquet type: FIXED_LEN_BYTE_ARRAY


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #1737: [SPARK-2489] [SQL] Parquet support for fixed_len_byte_arr...

Posted by spetsnaz84 <gi...@git.apache.org>.
Github user spetsnaz84 commented on the issue:

    https://github.com/apache/spark/pull/1737
  
    No updates on this ?
    We are still hitting 
    org.apache.spark.sql.AnalysisException: Illegal Parquet type: FIXED_LEN_BYTE_ARRAY;


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-54916880
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-65168627
  
    Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review.  I'm happy to discuss the implementation further whenever you have time :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1737


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by josephsu <gi...@git.apache.org>.
Github user josephsu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-60708357
  
    This patch enables the possibility of working with fixed_len_data_type fields in existing parquet files. This should be helpful for users migrating from other parquet-based systems. Must we reuse BinaryType to represent both existing binary data type and the fixed length type? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-53632466
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19326/consoleFull) for   PR 1737 at commit [`f66e658`](https://github.com/apache/spark/commit/f66e658729b3ca84cd3fe62daf7e503a4f40ec7b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2489] [SQL] Parquet support for fixed_l...

Posted by josephsu <gi...@git.apache.org>.
Github user josephsu commented on the pull request:

    https://github.com/apache/spark/pull/1737#issuecomment-65169706
  
    no problem. thanks for heads up! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org