You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Thon, Ingo" <in...@siemens.com> on 2015/02/06 15:53:45 UTC

Performance with har files

Hello,

I have a pretty huge har (Hadoop Archive) file containing roughly 600k files.
I want to convert the files in this archive, but  strangely when I do something like:
RAW_LOGS = LOAD 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';

Basically nothing happens and the grunt shell is blocking.

I was wondering whether this is a (known) performance issue or I do something wrong?


Thanks in advance
Ingo


AW: Performance with har files

Posted by "Thon, Ingo" <in...@siemens.com>.
Meanwhile the process crashed. The Log file shows an out of memory exception
================================================================================
Pig Stack Trace
---------------
ERROR 1200: Failed to create DataStorage

Failed to parse: Failed to create DataStorage
        at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
        at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1650)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1623)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:541)
        at org.apache.pig.Main.main(Main.java:156)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.RuntimeException: Failed to create DataStorage
        at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
        at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
        at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
        at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:191)
        at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:517)
        at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
        at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
        at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:886)
        at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
        at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
        at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
        at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
        at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
        at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
        ... 15 more
Caused by: java.net.ConnectException: Call From ........ failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1414)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
        at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:699)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1762)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
        at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:147)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
        ... 28 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        ... 52 more
================================================================================
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Java heap space

java.lang.OutOfMemoryError: Java heap space
        at java.net.URLDecoder.decode(URLDecoder.java:169)
        at org.apache.hadoop.fs.HarFileSystem.decodeString(HarFileSystem.java:257)
        at org.apache.hadoop.fs.HarFileSystem.decodeFileName(HarFileSystem.java:264)
        at org.apache.hadoop.fs.HarFileSystem.access$200(HarFileSystem.java:50)
        at org.apache.hadoop.fs.HarFileSystem$HarMetaData.parseMetaData(HarFileSystem.java:1200)
        at org.apache.hadoop.fs.HarFileSystem$HarMetaData.access$000(HarFileSystem.java:1098)
        at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:166)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
        at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
        at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
        at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:191)
        at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:517)
        at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
        at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
        at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:886)
        at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
        at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
        at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
        at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
        at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
        at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
        at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1650)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1623)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
================================================================================



-----Ursprüngliche Nachricht-----
Von: Thon, Ingo [mailto:ingo.thon@siemens.com] 
Gesendet: Freitag, 6. Februar 2015 15:54
An: user@pig.apache.org
Betreff: Performance with har files

Hello,

I have a pretty huge har (Hadoop Archive) file containing roughly 600k files.
I want to convert the files in this archive, but  strangely when I do something like:
RAW_LOGS = LOAD 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';

Basically nothing happens and the grunt shell is blocking.

I was wondering whether this is a (known) performance issue or I do something wrong?


Thanks in advance
Ingo


Re: Performance with har files

Posted by Pete Kelly <pk...@gmail.com>.
unsubscribe

On 6 February 2015 at 14:53, Thon, Ingo <in...@siemens.com> wrote:

> Hello,
>
> I have a pretty huge har (Hadoop Archive) file containing roughly 600k
> files.
> I want to convert the files in this archive, but  strangely when I do
> something like:
> RAW_LOGS = LOAD
> 'har://hdfs-192.....:8020/user/thon/logs.har/somfile.txt.gz';
>
> Basically nothing happens and the grunt shell is blocking.
>
> I was wondering whether this is a (known) performance issue or I do
> something wrong?
>
>
> Thanks in advance
> Ingo
>
>