You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by hisname <hi...@126.com> on 2012/10/26 03:03:21 UTC

python write hdfs over thrift,gzip file changes

 hi,all;
 
I want to write file to hdfs over thrift .
If the file is gzip or tar file , after uploading the files,i find the file size changes and can not tar xvzf/xvf anymore .
For normal plain text file ,it works well .
 
[hadoop@HOST s_cripts]$ echo $LANG
en_US.UTF-8
[hadoop@HOST s_cripts]$
[hadoop@HOST s_cripts]$ jps
25868 TaskTracker
9116 Jps
25928 HadoopThriftServer   #the thrift server
25749 JobTracker
25655 SecondaryNameNode
25375 NameNode
25495 DataNode
[hadoop@HOST s_cripts]$
[hadoop@HOST s_cripts]$ pwd
/home/hadoop/hadoop/src/contrib/thriftfs/s_cripts
[hadoop@HOST s_cripts]$ hadoop fs -ls log/ff.tar.gz
ls: Cannot access log/ff.tar.gz: No such file or directory.
[hadoop@HOST s_cripts]$ python hdfs.py
hdfs>> put ./my.tar.gz log/ff.tar.gz
<thrift.protocol.TBinaryProtocol.TBinaryProtocol instance at 0x2348e60>
in writeString :688
upload over:688
hdfs>> quit
[hadoop@HOST s_cripts]$ hadoop fs -ls log/ff.tar.gz
Found 1 items
-rw-r--r--   1 hadoop supergroup       1253 2012-10-25 08:57 /user/hadoop/log/ff.tar.gz    #notice the size here is 1253
[hadoop@HOST s_cripts]$ ls -l my.tar.gz
-rw-rw-r-- 1 hadoop hadoop 688 Oct 24 14:43 my.tar.gz     #notice the size here is 688
[hadoop@HOST s_cripts]$ file my.tar.gz
my.tar.gz: gzip compressed data, from Unix, last modified: Wed Oct 24 14:43:29 2012    #the file format
[hadoop@HOST s_cripts]$ hadoop fs -get log/ff.tar.gz .
[hadoop@HOST s_cripts]$ file ff.tar.gz
ff.tar.gz: data   #the file format
[hadoop@HOST s_cripts]$ tar xvzf ff.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
[hadoop@HOST s_cripts]$
[hadoop@HOST s_cripts]$ head -2 my.tar.gz |xxd
0000000: 1f8b 0800 118e 8750 0003 ed99 4d53 db30  .......P....MS.0
0000010: 1086 732d bf42 070e 7040 966c c78e 7da3  ..s-.B..p@.l..}.
0000020: 4006 2ec0 8c69 7be8 7418 c551 1c37 b2e4  @....i{.t..Q.7..
[hadoop@HOST s_cripts]$ head -2 ff.tar.gz |xxd
0000000: 1fef bfbd 0800 11ef bfbd efbf bd50 0003  .............P..
0000010: efbf bd4d 53ef bfbd 3010 efbf bd73 2def  ...MS...0....s-.
0000020: bfbd 4207 0e70 40ef bfbd 6cc7 8e7d efbf  ..B..p@...l..}..
0000030: bd40 062e efbf bdef bfbd 697b efbf bd74  .@........i{...t
0000040: 18ef bfbd 511c 37ef bfbd efbf bd65 0aef  ....Q.7......e..
 
thrift server and hdfs.py client on the same box(HOST) .
If i use hadoop shell cmd to put/get the files,everything goes ok .
It seems that thrift client  write in binnary mode to thrift server,but the thrift server write the data encoded in other charset to hdfs files .
Why the uploaded files changes ? thanks a lot !