You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by RotemShaul <gi...@git.apache.org> on 2016/06/18 12:55:41 UTC

[GitHub] spark pull request #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support S...

GitHub user RotemShaul opened a pull request:

    https://github.com/apache/spark/pull/13761

    [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema Repo

    [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema Repo
    ## What changes were proposed in this pull request?
    Adding SchemaRepository for Avro Schemas when using Spark-Core and GenericRecords
    ## How was this patch tested?
    Unit tests and manual tests committed.
    Also, we've done this change in our private repo and used it in our app.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/RotemShaul/spark AvroSchemaRepo2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13761.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13761
    
----
commit c84958e9785c67d83922b17f62659444f3003156
Author: rotems <ro...@liveperson.com>
Date:   2016-06-18T07:52:08Z

    [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema Repo

commit 70dc015ffabc78a1ac436469eabc472f23c38ec7
Author: rotems <ro...@liveperson.com>
Date:   2016-06-18T07:55:24Z

    Merge remote-tracking branch 'remotes/sparkOriginMaster/master' into AvroSchemaRepo2.0

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by RotemShaul <gi...@git.apache.org>.

Github user RotemShaul commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Indeed it is, but then you lose the already implemented
    GenericAvroSerializer abilities which come out of the box with Spark.
    (Caching / Registering of static schemas )
    
    As Spark already chose to (partially) support Avro from within SparkCore,
    to me it makes sense it will also support schema repos, as they are very
    common with Avro users to deal with Schema Evolution.
    It was this partial support that actually sparked the idea of 'if they
    support registering of Avro schemas, why not go all the way  ?' and that's
    why I created the PR in the first place.
    
    Avro Generic Records and Spark-Core users will always face the
    serialization problem of schemas, some might be able to solve it with
    static schemas, and other will need the dynamic solution. It makes sense
    that either SparkCore will provide solution for both use cases or none of
    them. (and let it be resolved by custom serializer)
    
    Just my opinion. In my current workplace - I took your
    GenericAvroSerializer, added few lines of code to it, and used it as custom
    serializer. But it could be generalized - hence the PR.
    
    
    
    
    On Sat, Jul 23, 2016 at 5:14 AM, Reynold Xin <no...@github.com>
    wrote:
    
    > @RotemShaul <https://github.com/RotemShaul> is this something doable by
    > implementing a custom serializer outside Spark?
    >
    > \u2014
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/13761#issuecomment-234693402>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AHlUNO0V3mMbYBJNDLLn3HEVMlxD4vXYks5qYXkhgaJpZM4I48Q9>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    @RotemShaul is this something doable by implementing a custom serializer outside Spark?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support S...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13761


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    I generally wouldn't open a PR three times for one issue when it's not getting traction, which is why the JIRA was closed. This is decent discussion, but if nobody's on board with the change this time, please let's leave it closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by RotemShaul <gi...@git.apache.org>.

Github user RotemShaul commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Sure - thought it was closed since the PR got old and had conflicts.
    This PR basically generalizes an already implemented solution for the problem of serializing schema overhead. The solution introduced in spark 1.5 solves the problem for known static schemas(with spark.avro.registeredSchemas property ), this add the ability to solve the problem for dynamic schemas. (using repo) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by RotemShaul <gi...@git.apache.org>.

Github user RotemShaul commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Hi, this the JIRA bug that was closed on 'wont fix' since the PR had conflicts, I created a new clean one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Don't `Dataset`s and `Encoder`s make this less relevant? What would be the use case here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13761: [SPARK-12197] [SparkCore] Kryo & Avro - Support Schema R...

Posted by RotemShaul <gi...@git.apache.org>.

Github user RotemShaul commented on the issue:

    https://github.com/apache/spark/pull/13761
  
    Not as far as I understand, I'll explain the use case :
    
    1. We do not want to serialize the Avro schema of our processed events,
    only its id
    
    2. Our job can process multiple events with different schemas in a single
    RDD [for instance, we don't use Avro specific records, so this can't be our
    type in that DataSet. The DataSet will still have to remain
    DataSet[GenericRecord], can't be DataSet[Event]. Also GenericRecord or
    SpecificRecord of Avro is not a case class.  )
    
    3. Most important - we do not know the schemas ahead of time, and we should
    be able to process an RDD with multiple different schemas (which are not
    known to us ahead of time, as it is constantly evolving)
    
    The whole point of using GenericRecords in Avro and not specific records is
    since we don't know the schemas ahead of time, we are processing events
    each with different schema version (according to Avro schema resolution
    rules and evolution).
    
    Our specific use case - kind of sessionization of events by key. We don't
    do analytics or aggregations, we get input data RDD[GenericRecord] and
    return RDD[Frame[K,Iterator[GenericRecord]] and we store that output
    somewhere.
    
    Now, since we only do groupBy, we do not care about the events body
    content. (we're infrastructure. the key is one field and sits somewhere in
    header that doesn't change), and such don't access any of it body fields,
    so we don't care about the evolving body schema and do not access any
    fields other than the key.
    If we were to put that GenericRecord into some Avro generated Specific
    Record (or some CaseClass) we'd have to know all the fields ahead of time,
    and the encoding into that class will result in failure or partial when
    we'll have mismatch in the events new added fields or old removed fields.
    
    If basically a DataSet has a tabular format behind the scenes, we can't
    have such table format for our data set - as our table is 'dynamic', each
    event has different schema. Taking the superset schema will result in new
    empty columns and basically changing the events.
    
    
    
    On Mon, Jun 20, 2016 at 8:35 AM, Herman van Hovell <notifications@github.com
    > wrote:
    
    > Don't Datasets and Encoders make this less relevant? What would be the
    > use case here?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/13761#issuecomment-227053724>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe/AHlUNCQ6FThi-DA7V9gkqvUrgG3c4cprks5qNiasgaJpZM4I48Q9>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org