You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Thon, Ingo" <in...@siemens.com> on 2015/02/06 15:53:45 UTC
Performance with har files
Hello,
I have a pretty huge har (Hadoop Archive) file containing roughly 600k files.
I want to convert the files in this archive, but strangely when I do something like:
RAW_LOGS = LOAD 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';
Basically nothing happens and the grunt shell is blocking.
I was wondering whether this is a (known) performance issue or I do something wrong?
Thanks in advance
Ingo
AW: Performance with har files
Posted by "Thon, Ingo" <in...@siemens.com>.
Meanwhile the process crashed. The Log file shows an out of memory exception
================================================================================
Pig Stack Trace
---------------
ERROR 1200: Failed to create DataStorage
Failed to parse: Failed to create DataStorage
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1650)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1623)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.RuntimeException: Failed to create DataStorage
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:191)
at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:517)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:886)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 15 more
Caused by: java.net.ConnectException: Call From ........ failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:699)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1762)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:147)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
... 28 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 52 more
================================================================================
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.net.URLDecoder.decode(URLDecoder.java:169)
at org.apache.hadoop.fs.HarFileSystem.decodeString(HarFileSystem.java:257)
at org.apache.hadoop.fs.HarFileSystem.decodeFileName(HarFileSystem.java:264)
at org.apache.hadoop.fs.HarFileSystem.access$200(HarFileSystem.java:50)
at org.apache.hadoop.fs.HarFileSystem$HarMetaData.parseMetaData(HarFileSystem.java:1200)
at org.apache.hadoop.fs.HarFileSystem$HarMetaData.access$000(HarFileSystem.java:1098)
at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:166)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:191)
at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:517)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:886)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1650)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1623)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
================================================================================
-----Ursprüngliche Nachricht-----
Von: Thon, Ingo [mailto:ingo.thon@siemens.com]
Gesendet: Freitag, 6. Februar 2015 15:54
An: user@pig.apache.org
Betreff: Performance with har files
Hello,
I have a pretty huge har (Hadoop Archive) file containing roughly 600k files.
I want to convert the files in this archive, but strangely when I do something like:
RAW_LOGS = LOAD 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';
Basically nothing happens and the grunt shell is blocking.
I was wondering whether this is a (known) performance issue or I do something wrong?
Thanks in advance
Ingo
Re: Performance with har files
Posted by Pete Kelly <pk...@gmail.com>.
unsubscribe
On 6 February 2015 at 14:53, Thon, Ingo <in...@siemens.com> wrote:
> Hello,
>
> I have a pretty huge har (Hadoop Archive) file containing roughly 600k
> files.
> I want to convert the files in this archive, but strangely when I do
> something like:
> RAW_LOGS = LOAD
> 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';
>
> Basically nothing happens and the grunt shell is blocking.
>
> I was wondering whether this is a (known) performance issue or I do
> something wrong?
>
>
> Thanks in advance
> Ingo
>
>