You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Colin Patrick McCabe (JIRA)" <ji...@apache.org> on 2012/11/29 09:40:58 UTC

[jira] [Commented] (HADOOP-9103) UTF8 class does not properly decode Unicode characters outside the basic multilingual plane

    [ https://issues.apache.org/jira/browse/HADOOP-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506325#comment-13506325 ] 

Colin Patrick McCabe commented on HADOOP-9103:
----------------------------------------------

I said:

bq. since we always encode/decode using hadoop.io.UTF8, and never anything else, there should be no problem...

I take this back; looks like we don't always encode/decode using {{hadoop.io.UTF8}}.  D'oh!

bq. Attached patch should fix this issue.

Nice.  Should we test for rejecting 5-byte and 6-byte sequences, since I notice you added some code to do that?

I'm also a little scared by the idea that we have differently-encoded byte[] running around for the same file name strings.  We have to be very careful about this.  Unfortunately, we can't change the decoder to emit real UTF-8 (rather than CESU-8) without making a backwards-incompatible change, since as INode.java reminds us, 

{code}
   *  The name in HdfsFileStatus should keep the same encoding as this.
   *  if this encoding is changed, implicitly getFileInfo and listStatus in
   *  clientProtocol are changed; The decoding at the client
   *  side should change accordingly.
{code}

I also wonder if this means that we need to hunt down all the places not using CESU-8.  Otherwise older clients are just not going to work with astral plane code points, even after this fix... However, we could do that in a separate JIRA, not here.
                
> UTF8 class does not properly decode Unicode characters outside the basic multilingual plane
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9103
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.20.1
>         Environment: SUSE LINUX
>            Reporter: yixiaohua
>            Assignee: Todd Lipcon
>         Attachments: FSImage.java, hadoop-9103.txt, ProblemString.txt, TestUTF8AndStringGetBytes.java, TestUTF8AndStringGetBytes.java
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> this the log information  of the  exception  from the SecondaryNameNode: 
> 2012-03-28 00:48:42,553 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.IOException: Found lease for
>  non-existent file /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/????@???????????????
> ??????????tor.qzone.qq.com/keypart-00174
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFilesUnderConstruction(FSImage.java:1211)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:959)
>         at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589)
>         at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
>         at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
>         at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
>         at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
>         at java.lang.Thread.run(Thread.java:619)
> this is the log information  about the file from namenode:
> 2012-03-28 00:32:26,528 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=boss,boss	ip=/10.131.16.34	cmd=create	src=/user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/  @?            tor.qzone.qq.com/keypart-00174	dst=null	perm=boss:boss:rw-r--r--
> 2012-03-28 00:37:42,387 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/  @?            tor.qzone.qq.com/keypart-00174. blk_2751836614265659170_184668759
> 2012-03-28 00:37:42,696 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/  @?            tor.qzone.qq.com/keypart-00174 is closed by DFSClient_attempt_201203271849_0016_r_000174_0
> 2012-03-28 00:37:50,315 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=boss,boss	ip=/10.131.16.34	cmd=rename	src=/user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/  @?            tor.qzone.qq.com/keypart-00174	dst=/user/boss/pgv/fission/task16/split/  @?            tor.qzone.qq.com/keypart-00174	perm=boss:boss:rw-r--r--
> after check the code that save FSImage,I found there are a problem that maybe a bug of HDFS Code,I past below:
> -------------this is the saveFSImage method  in  FSImage.java, I make some mark at the problem code------------
> /**
>    * Save the contents of the FS image to the file.
>    */
>   void saveFSImage(File newFile) throws IOException {
>     FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
>     FSDirectory fsDir = fsNamesys.dir;
>     long startTime = FSNamesystem.now();
>     //
>     // Write out data
>     //
>     DataOutputStream out = new DataOutputStream(
>                                                 new BufferedOutputStream(
>                                                                          new FileOutputStream(newFile)));
>     try {
>       .........
>     
>       // save the rest of the nodes
>       saveImage(strbuf, 0, fsDir.rootDir, out);------------------problem
>       fsNamesys.saveFilesUnderConstruction(out);------------------problem  detail is below
>       strbuf = null;
>     } finally {
>       out.close();
>     }
>     LOG.info("Image file of size " + newFile.length() + " saved in " 
>         + (FSNamesystem.now() - startTime)/1000 + " seconds.");
>   }
>  /**
>    * Save file tree image starting from the given root.
>    * This is a recursive procedure, which first saves all children of
>    * a current directory and then moves inside the sub-directories.
>    */
>   private static void saveImage(ByteBuffer parentPrefix,
>                                 int prefixLength,
>                                 INodeDirectory current,
>                                 DataOutputStream out) throws IOException {
>     int newPrefixLength = prefixLength;
>     if (current.getChildrenRaw() == null)
>       return;
>     for(INode child : current.getChildren()) {
>       // print all children first
>       parentPrefix.position(prefixLength);
>       parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());------------------problem
>       saveINode2Image(parentPrefix, child, out);
>     }
>    ..........
>   }
>  // Helper function that writes an INodeUnderConstruction
>   // into the input stream
>   //
>   static void writeINodeUnderConstruction(DataOutputStream out,
>                                            INodeFileUnderConstruction cons,
>                                            String path) 
>                                            throws IOException {
>     writeString(path, out);------------------problem
>     ..........
>   }
>   
>   static private final UTF8 U_STR = new UTF8();
>   static void writeString(String str, DataOutputStream out) throws IOException {
>     U_STR.set(str);
>     U_STR.write(out);------------------problem 
>   }
>   /**
>    * Converts a string to a byte array using UTF8 encoding.
>    */
>   static byte[] string2Bytes(String str) {
>     try {
>       return str.getBytes("UTF8");------------------problem 
>     } catch(UnsupportedEncodingException e) {
>       assert false : "UTF8 encoding is not supported ";
>     }
>     return null;
>   }
> ------------------------------------------below is the explain------------------------
> in  saveImage method:  child.getLocalNameBytes(),the  bytes use the method of str.getBytes("UTF8");
> but in writeINodeUnderConstruction, the bytes user the method of Class  UTF8 to get the bytes.
> I make a test use our messy code file name , found the the two bytes arrsy are not equal. so I both use the class UTF8,then the problem desappare.
> I think this is a bug of HDFS or UTF8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira