You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@gobblin.apache.org by Abhishek Tiwari <fi...@gmail.com> on 2018/03/23 18:49:19 UTC

Re: TimeBasedPartitioner for Hive ORC Files

I think you need an implementation of TimeBasedWriterPartitioner. eg. refer
to this: TimeBasedAvroWriterPartitioner
So, the partitioning will be at Gobblin side rather than ORC Serde, so you
wouldn't need to modify / extend it.

Abhishek

On Mon, Dec 11, 2017 at 7:35 PM, Prateek Gupta <pr...@myntra.com>
wrote:

> Any advice or information aimed at resolving this difficulty, would be
> appreciated.
>
> On 8 Dec 2017 11:28 a.m., "Prateek Gupta" <pr...@myntra.com>
> wrote:
>
>> Hi,
>>
>> How can one go about creating the TimeBasedPartitioner for Hive ORC
>> files? Class *OrcSerdeRow* is *final* and available at *package* level
>> only!
>>
>> Thanks & Regards,
>> Prateek Gupta
>>
>

Re: TimeBasedPartitioner for Hive ORC Files

Posted by Prateek Gupta <pr...@myntra.com>.

Thanks for the response, Abhishek!
PFB the code for the implementation of TimeBasedOrcWriterPartitioner. The
*hitch* was that we're not able to access the field *realrow* of class
*OrcSerdeRow* as it is a *final* class and available at *package* level
only.

public class TimeBasedOrcWriterPartitioner extends
TimeBasedWriterPartitioner<Object> {
  private static final Log LOG =
LogFactory.getLog(TimeBasedOrcWriterPartitioner.class);
  private static final String orcSerdeRow = "realRow";

  public TimeBasedOrcWriterPartitioner(gobblin.configuration.State
state, int numBranches, int branchId) {
    super(state, numBranches, branchId);
  }

  @Override
  public long getRecordTimestamp(Object orcRecord) {
    return getRecordTimestampUtil(orcRecord);
  }

  private static ArrayList<Object> extractRealRow(Object orcRecord,
Field f) throws IllegalAccessException {
    return (ArrayList<Object>) f.get(orcRecord);
  }

  private static long getRecordTimestampUtil(Object orcRecord) {
    Class<?> clazz = orcRecord.getClass();
    ArrayList<Object> realRow = null;
    try {
      Field f = clazz.getDeclaredField(orcSerdeRow);
      f.setAccessible(true);
      realRow = extractRealRow(orcRecord, f);
    } catch (NoSuchFieldException e) {
      e.printStackTrace();
    } catch (IllegalAccessException e) {
      e.printStackTrace();
    }
    int timestampIndex = 1;
    long timestamp = (Long) realRow.get(timestampIndex);
    LOG.debug("Timestamp of the OrcSerdeRow" + timestamp);
    return timestamp;
  }
}


On Sat, Mar 24, 2018 at 12:19 AM, Abhishek Tiwari <fi...@gmail.com>
wrote:

> I think you need an implementation of TimeBasedWriterPartitioner. eg.
> refer to this: TimeBasedAvroWriterPartitioner
> So, the partitioning will be at Gobblin side rather than ORC Serde, so you
> wouldn't need to modify / extend it.
>
> Abhishek
>
>
> On Mon, Dec 11, 2017 at 7:35 PM, Prateek Gupta <pr...@myntra.com>
> wrote:
>
>> Any advice or information aimed at resolving this difficulty, would be
>> appreciated.
>>
>> On 8 Dec 2017 11:28 a.m., "Prateek Gupta" <pr...@myntra.com>
>> wrote:
>>
>>> Hi,
>>>
>>> How can one go about creating the TimeBasedPartitioner for Hive ORC
>>> files? Class *OrcSerdeRow* is *final* and available at *package* level
>>> only!
>>>
>>> Thanks & Regards,
>>> Prateek Gupta
>>>
>>
>