You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2019/08/14 09:49:44 UTC
[GitHub] [hadoop] sodonnel commented on issue #1028: HDFS-14617 - Improve fsimage load time by writing sub-sections to the fsimage index

sodonnel commented on issue #1028: HDFS-14617 - Improve fsimage load time by writing sub-sections to the fsimage index
URL: https://github.com/apache/hadoop/pull/1028#issuecomment-521180160
 
 
   > And we should make sure oiv tool works with this change. We can file another jira to address the oiv issue.
   
   I checked OIV, and it can load the images with parallel sections in the image index with no problems and it does not produce any warnings. The reason, is that this change simply adds additional sections to the image index, so we still have:
   
   ```
   INODE START_OFFSET LENGTH
     INODE_SUB START_OFFSET LENGTH
     INODE_SUB START_OFFSET LENGTH
     INODE_SUB START_OFFSET LENGTH
     ...
   INODE_DIR START_OFFSET LENGTH
     INODE_DIR_SUB START_OFFSET LENGTH
     INODE_DIR_SUB START_OFFSET LENGTH
     INODE_DIR_SUB START_OFFSET LENGTH
     ...
   ```
   
   This means that if a loader looks for certain sections, it does not matter which other sections are there, provided it ignores them. In the case of the OIV for the "delimited" processor, it uses this pattern:
   
   ```
       for (FileSummary.Section section : sections) {
         if (SectionName.fromString(section.getName()) == SectionName.INODE) {
           fin.getChannel().position(section.getOffset());
           is = FSImageUtil.wrapInputStreamForCompression(conf,
               summary.getCodec(), new BufferedInputStream(new LimitInputStream(
                   fin, section.getLength())));
           outputINodes(is);
         }
       }
   ```
   
   It loops over all the sections in the "FileSummary Index" looking for one it ones (INODE in the above example) and then ignore all others.
   
   In the case of the XML processor, which is probably the most important, it works in a very similar way to how the namenode loads the image. It loops over all sections and uses a case statement to process the sections it is interested in, and skips others:
   
   ```
        for (FileSummary.Section s : sections) {
           fin.getChannel().position(s.getOffset());
           InputStream is = FSImageUtil.wrapInputStreamForCompression(conf,
               summary.getCodec(), new BufferedInputStream(new LimitInputStream(
                   fin, s.getLength())));
   
           SectionName sectionName = SectionName.fromString(s.getName());
           if (sectionName == null) {
             throw new IOException("Unrecognized section " + s.getName());
           }
           switch (sectionName) {
           case NS_INFO:
             dumpNameSection(is);
             break;
           case STRING_TABLE:
             loadStringTable(is);
             break;
           case ERASURE_CODING:
             dumpErasureCodingSection(is);
             break;
           case INODE:
             dumpINodeSection(is);
             break;
           case INODE_REFERENCE:
             dumpINodeReferenceSection(is);
             break;
   	  
           <snipped>
   	
           default:
             break;
           }
         }
         out.print("</fsimage>\n");
       }
   ```
   
   Note the default clause, where it does nothing if it encounters a section name it does not expect.
   
   I tested running the other processors, File Distribution, DetectCorruption and Web and they all worked with no issues.
   
   Two future improvements we could do in a new Jiras, are:
   
   1. Make the ReverseXML processor write out the sub-section headers so it creates a parallel enabled image (if the relevant settings are enabled)
   
   2. Investigate allowing OIV to process the image in parallel if it has the sub-sections in the index and parallel is enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org