You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Gabriel Gatto (JIRA)" <ji...@apache.org> on 2019/02/06 15:39:00 UTC

[jira] [Created] (IMPALA-8167) Refresh´s on NON-partitioned tables ALWAYS reads all the files block locations taking too long on BIG TABLES.

Gabriel Gatto created IMPALA-8167:
-------------------------------------

             Summary: Refresh´s on NON-partitioned tables ALWAYS reads all the files block locations taking too long on BIG TABLES.
                 Key: IMPALA-8167
                 URL: https://issues.apache.org/jira/browse/IMPALA-8167
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
    Affects Versions: Impala 2.12.0
            Reporter: Gabriel Gatto


REFRESH's on NON-PARTITIONED tables always fetches their block locations using the "getFileBlockLocations" method on all files, no matter if there are new files or not.

We think the problem is located in the method "updateUnpartitionedTableFileMd".

This method always resets partitions and adds a "new one" with NO FILEDESCRIPTORS. So the method refreshPartitionFileMetadata(part), always needs to read all the files of the new partition to rebuild the information. This behaviour causes that getBlockLocation is always call for all the files, despite they are new or old.

This is confirmed by looking at the code:

 

{color:#000000}  private void updateUnpartitionedTableFileMd() throws Exception {{color} 
{color:#000000}    if (LOG.isTraceEnabled()) {{color} 
{color:#000000}      LOG.trace("update unpartitioned table: " + getFullName());{color} 
{color:#000000}    }{color} 
{color:#000000}    resetPartitions();{color}  ---> DROP PARTITION WITH PREVIOUS FILEDESCRIPTOR INFO. 
{color:#000000}    org.apache.hadoop.hive.metastore.api.Table msTbl = getMetaStoreTable();{color} 
{color:#000000}    Preconditions.checkNotNull(msTbl);{color} 
{color:#000000}    addDefaultPartition(msTbl.getSd());{color} 
{color:#000000}    HdfsPartition part = createPartition(msTbl.getSd(), null);{color} ---> CREATES NEW PARTITION.
{color:#000000}    addPartition(part);{color} 
{color:#000000}    if (isMarkedCached_) part.markCached();{color}

{color:#000000}    LOG.info("Refreshing-updateUnpartitionedTableFileMd(): " + getFullName() + {color} 
{color:#000000}              " Location: " + part.getLocation() +{color} 
{color:#000000}              " FileDescriptors: " + part.getFileDescriptors().size());{color}

{color:#000000}    refreshPartitionFileMetadata(part);{color}

{color:#000000}    LOG.info("Refreshed-updateUnpartitionedTableFileMd(): " + getFullName() + {color} 
{color:#000000}             " Location: " + part.getLocation() +{color} 
{color:#000000}             " FileDescriptors: " + part.getFileDescriptors().size());{color} 
{color:#000000}  }{color}  

 

Running examples:

1) The first run after no files added or changed{color:#6e6e73} .{color}

{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh prod_ar.aux_tas_call_details_rt02; {color}{color}

{color:#6e6e73}{color:#000000}LOG:{color}{color}

I0206 11:18:16.581826 34494 HdfsTable.java:1333] Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux *FileDescriptors: 0* 

I0206 11:25:35.748185 34494 HdfsTable.java:1340] Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux *FileDescriptors: 148398*

 

2) Second run 2 min after the other with no files added or changed in the middle. In this case we see that no filedescriptors exists because of the resetPartitions(), so it needs to read all the files again.

{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh prod_ar.aux_tas_call_details_rt02;{color} {color}

{color:#6e6e73}{color:#000000}LOG:{color}{color}

 

{color:#000000}I0206 11:27:54.086167 33902 HdfsTable.java:1333] Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux *FileDescriptors: 0*{color} 

{color:#000000}I0206 11:36:35.344233 33902 HdfsTable.java:1340] Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux *FileDescriptors: 148398*{color} 

{color:#6e6e73} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)