You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Shiyan Xu <xu...@gmail.com> on 2020/06/02 17:37:02 UTC

Re: [DISCUSS] Write failed records

Thank you for the feedback, Vinoth. Agreed with your points. Also created a
small RFC for easy alignment on the changes
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records

On Sun, May 24, 2020 at 12:06 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Raymond,
>
> Thanks for starting this discussion.
>
> Agree on 1.. (we may also need some CLI support for inspecting bad/record
> and also code samples to consume them etc?)
>
> On 2, these place seem appropriate. We can figure it out, in more detail
> when we get to implementation?
>
> On 3. +1 on logs.. We should also define a standard schema for error
> record.. I see some tricky issues to handle here, for schema mismatch
> errors. For e.g if the core problem was schema mismatch, then
> serializing/deserializing the error record without a working schema
> specific to that record may not be possible? May be we need the record data
> itself in some format like json, that is schemaless?
> I also wonder if we should write the error table as another internal
> HoodieTable (we are abstracting out HoodieTable, FileGroupIO etc anyway)?
>
> On 4, +1 again.
>
> On Fri, May 22, 2020 at 7:47 PM Shiyan Xu <xu...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I'd like to bring up this discussion around handling errors in Hudi write
> > paths.
> > https://issues.apache.org/jira/browse/HUDI-648
> >
> > Trying to gather some feedbacks about the implementation details
> > 1. Error location
> > I'm thinking of writing the failed records to `.hoodie/errors/` for
> > a) encapsulate data within the Hudi table for ease of management
> > b) make use of existing dedicated directory
> >
> > 2. Write path
> > org.apache.hudi.client.HoodieWriteClient#postWrite
> > org.apache.hudi.client.HoodieWriteClient#completeCompaction
> > These 2 methods should be the places to persist failed records in
> > `org.apache.hudi.table.action.HoodieWriteMetadata#writeStatuses`
> > to the designated location
> >
> > 3. Format
> > Records should be written as logs (avro)
> >
> > 4. Metric
> > Post writing failed records, it should send a metric of basic count of
> > errors written. Easier for monitoring system to pick up and send alert.
> >
> > Foreseeably, some details may need to be adjusted throughout the
> > development. To begin with, we may agree on a feasible plan at high
> level.
> >
> > Please kindly share thoughts and feedbacks. Thank you.
> >
> >
> >
> > Regards,
> > Raymond
> >
>

Re: [DISCUSS] Write failed records

Posted by Vinoth Chandar <vi...@apache.org>.
Thanks! Will review and get back to you

On Tue, Jun 2, 2020 at 10:37 AM Shiyan Xu <xu...@gmail.com>
wrote:

> Thank you for the feedback, Vinoth. Agreed with your points. Also created a
> small RFC for easy alignment on the changes
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records
>
> On Sun, May 24, 2020 at 12:06 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Raymond,
> >
> > Thanks for starting this discussion.
> >
> > Agree on 1.. (we may also need some CLI support for inspecting bad/record
> > and also code samples to consume them etc?)
> >
> > On 2, these place seem appropriate. We can figure it out, in more detail
> > when we get to implementation?
> >
> > On 3. +1 on logs.. We should also define a standard schema for error
> > record.. I see some tricky issues to handle here, for schema mismatch
> > errors. For e.g if the core problem was schema mismatch, then
> > serializing/deserializing the error record without a working schema
> > specific to that record may not be possible? May be we need the record
> data
> > itself in some format like json, that is schemaless?
> > I also wonder if we should write the error table as another internal
> > HoodieTable (we are abstracting out HoodieTable, FileGroupIO etc anyway)?
> >
> > On 4, +1 again.
> >
> > On Fri, May 22, 2020 at 7:47 PM Shiyan Xu <xu...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to bring up this discussion around handling errors in Hudi
> write
> > > paths.
> > > https://issues.apache.org/jira/browse/HUDI-648
> > >
> > > Trying to gather some feedbacks about the implementation details
> > > 1. Error location
> > > I'm thinking of writing the failed records to `.hoodie/errors/` for
> > > a) encapsulate data within the Hudi table for ease of management
> > > b) make use of existing dedicated directory
> > >
> > > 2. Write path
> > > org.apache.hudi.client.HoodieWriteClient#postWrite
> > > org.apache.hudi.client.HoodieWriteClient#completeCompaction
> > > These 2 methods should be the places to persist failed records in
> > > `org.apache.hudi.table.action.HoodieWriteMetadata#writeStatuses`
> > > to the designated location
> > >
> > > 3. Format
> > > Records should be written as logs (avro)
> > >
> > > 4. Metric
> > > Post writing failed records, it should send a metric of basic count of
> > > errors written. Easier for monitoring system to pick up and send alert.
> > >
> > > Foreseeably, some details may need to be adjusted throughout the
> > > development. To begin with, we may agree on a feasible plan at high
> > level.
> > >
> > > Please kindly share thoughts and feedbacks. Thank you.
> > >
> > >
> > >
> > > Regards,
> > > Raymond
> > >
> >
>