You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2021/07/15 13:57:58 UTC

Reading metadata tables

Hi Team,

I am working to enable running queries above metadata tables through Hive.
I was able to load the correct metadata table though the Catalogs, and I created the TableScan, but I am stuck there ATM.

What is the recommended way to get the Record-s for the Schema defined by the MetadataTable using the Java API?
For data files we create our own readers, but I guess we already has some better way to do that for metadata.

Any pointers would be welcome.

Thanks,
Peter

Re: Reading metadata tables

Posted by Ryan Blue <bl...@tabular.io>.

Oh, I think it must be that Spark wraps StructLike to act like an
InternalRow, so we can reuse it for Record or the metadata rows. Would it
be possible to adapt the Record code to use StructLike instead?

On Mon, Jul 19, 2021 at 10:23 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Thanks Ryan for checking this out!
>
> IcebergWritable wraps a Record to a Container, and a Writable, so that is
> why I try to create a Record here.
>
> The problem is that the metadata table scan returns a StructLike and I
> have to match that with the metadata schema and then with the read schema.
>
> I have seen in the doc that metadata table query is already working for
> Spark and I was hoping to avoid the duplicated work, but could not find the
> relevant part of the code yet.
>
> Thanks Peter
>
>
> On Tue, 20 Jul 2021, 02:05 Ryan Blue, <bl...@tabular.io> wrote:
>
>> Peter,
>>
>> The "data" tasks produce records using Iceberg's Record class and the
>> internal representations. I believe that's what the existing Iceberg object
>> inspectors use. Couldn't you just wrap this with an IcebergWritable and use
>> the regular object inspectors?
>>
>> On Thu, Jul 15, 2021 at 8:53 AM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> I have put together a somewhat working solution:
>>>
>>> case METADATA:
>>>   return (CloseableIterable) CloseableIterable.transform(((DataTask) currentTask).rows(), row -> {
>>>     Record record = GenericRecord.create(readSchema);
>>>     List<Types.NestedField> tableFields = tableSchema.asStruct().fields();
>>>     for (int i = 0; i < row.size(); i++) {
>>>       Types.NestedField tableField = tableFields.get(i);
>>>       if (readSchema.findField(tableField.name()) != null) {
>>>         record.setField(tableField.name(), row.get(i, tableField.type().typeId().javaClass()));
>>>       }
>>>     }
>>>     return record;
>>>   });
>>>
>>> Which is working only for int/long/string etc types and it has problems
>>> with Long->OffsetDateTime conversion and friends.
>>> I am almost sure that this should have an existing and better solution
>>> already somewhere :)
>>>
>>> On Jul 15, 2021, at 15:57, Peter Vary <pv...@cloudera.com> wrote:
>>>
>>> Hi Team,
>>>
>>> I am working to enable running queries above metadata tables through
>>> Hive.
>>> I was able to load the correct metadata table though the Catalogs, and I
>>> created the TableScan, but I am stuck there ATM.
>>>
>>> What is the recommended way to get the Record-s for the Schema defined
>>> by the MetadataTable using the Java API?
>>> For data files we create our own readers, but I guess we already has
>>> some better way to do that for metadata.
>>>
>>> Any pointers would be welcome.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Reading metadata tables

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Thanks Ryan for checking this out!

IcebergWritable wraps a Record to a Container, and a Writable, so that is
why I try to create a Record here.

The problem is that the metadata table scan returns a StructLike and I have
to match that with the metadata schema and then with the read schema.

I have seen in the doc that metadata table query is already working for
Spark and I was hoping to avoid the duplicated work, but could not find the
relevant part of the code yet.

Thanks Peter


On Tue, 20 Jul 2021, 02:05 Ryan Blue, <bl...@tabular.io> wrote:

> Peter,
>
> The "data" tasks produce records using Iceberg's Record class and the
> internal representations. I believe that's what the existing Iceberg object
> inspectors use. Couldn't you just wrap this with an IcebergWritable and use
> the regular object inspectors?
>
> On Thu, Jul 15, 2021 at 8:53 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> I have put together a somewhat working solution:
>>
>> case METADATA:
>>   return (CloseableIterable) CloseableIterable.transform(((DataTask) currentTask).rows(), row -> {
>>     Record record = GenericRecord.create(readSchema);
>>     List<Types.NestedField> tableFields = tableSchema.asStruct().fields();
>>     for (int i = 0; i < row.size(); i++) {
>>       Types.NestedField tableField = tableFields.get(i);
>>       if (readSchema.findField(tableField.name()) != null) {
>>         record.setField(tableField.name(), row.get(i, tableField.type().typeId().javaClass()));
>>       }
>>     }
>>     return record;
>>   });
>>
>> Which is working only for int/long/string etc types and it has problems
>> with Long->OffsetDateTime conversion and friends.
>> I am almost sure that this should have an existing and better solution
>> already somewhere :)
>>
>> On Jul 15, 2021, at 15:57, Peter Vary <pv...@cloudera.com> wrote:
>>
>> Hi Team,
>>
>> I am working to enable running queries above metadata tables through Hive.
>> I was able to load the correct metadata table though the Catalogs, and I
>> created the TableScan, but I am stuck there ATM.
>>
>> What is the recommended way to get the Record-s for the Schema defined by
>> the MetadataTable using the Java API?
>> For data files we create our own readers, but I guess we already has some
>> better way to do that for metadata.
>>
>> Any pointers would be welcome.
>>
>> Thanks,
>> Peter
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Reading metadata tables

Posted by Ryan Blue <bl...@tabular.io>.

Peter,

The "data" tasks produce records using Iceberg's Record class and the
internal representations. I believe that's what the existing Iceberg object
inspectors use. Couldn't you just wrap this with an IcebergWritable and use
the regular object inspectors?

On Thu, Jul 15, 2021 at 8:53 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> I have put together a somewhat working solution:
>
> case METADATA:
>   return (CloseableIterable) CloseableIterable.transform(((DataTask) currentTask).rows(), row -> {
>     Record record = GenericRecord.create(readSchema);
>     List<Types.NestedField> tableFields = tableSchema.asStruct().fields();
>     for (int i = 0; i < row.size(); i++) {
>       Types.NestedField tableField = tableFields.get(i);
>       if (readSchema.findField(tableField.name()) != null) {
>         record.setField(tableField.name(), row.get(i, tableField.type().typeId().javaClass()));
>       }
>     }
>     return record;
>   });
>
> Which is working only for int/long/string etc types and it has problems
> with Long->OffsetDateTime conversion and friends.
> I am almost sure that this should have an existing and better solution
> already somewhere :)
>
> On Jul 15, 2021, at 15:57, Peter Vary <pv...@cloudera.com> wrote:
>
> Hi Team,
>
> I am working to enable running queries above metadata tables through Hive.
> I was able to load the correct metadata table though the Catalogs, and I
> created the TableScan, but I am stuck there ATM.
>
> What is the recommended way to get the Record-s for the Schema defined by
> the MetadataTable using the Java API?
> For data files we create our own readers, but I guess we already has some
> better way to do that for metadata.
>
> Any pointers would be welcome.
>
> Thanks,
> Peter
>
>
>

-- 
Ryan Blue
Tabular

Re: Reading metadata tables

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

I have put together a somewhat working solution:
case METADATA:
  return (CloseableIterable) CloseableIterable.transform(((DataTask) currentTask).rows(), row -> {
    Record record = GenericRecord.create(readSchema);
    List<Types.NestedField> tableFields = tableSchema.asStruct().fields();
    for (int i = 0; i < row.size(); i++) {
      Types.NestedField tableField = tableFields.get(i);
      if (readSchema.findField(tableField.name()) != null) {
        record.setField(tableField.name(), row.get(i, tableField.type().typeId().javaClass()));
      }
    }
    return record;
  });
Which is working only for int/long/string etc types and it has problems with Long->OffsetDateTime conversion and friends.
I am almost sure that this should have an existing and better solution already somewhere :)

> On Jul 15, 2021, at 15:57, Peter Vary <pv...@cloudera.com> wrote:
> 
> Hi Team,
> 
> I am working to enable running queries above metadata tables through Hive.
> I was able to load the correct metadata table though the Catalogs, and I created the TableScan, but I am stuck there ATM.
> 
> What is the recommended way to get the Record-s for the Schema defined by the MetadataTable using the Java API?
> For data files we create our own readers, but I guess we already has some better way to do that for metadata.
> 
> Any pointers would be welcome.
> 
> Thanks,
> Peter