You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Anastasis Andronidis <an...@hotmail.com> on 2013/09/27 16:36:05 UTC

Small files

Hello,

I am working on a very small project for my university and I have a small cluster with 2 worker nodes and 1 master node. I'm using Pig to do some calculations and I have a question regarding small files.

I have a UDF that is reading a small input (around 200k) and correlates the data from HDFS. My first approach was to upload the small file onto HDFS and later, by using getCacheFiles(), access it into my UDF.

After though, I needed to change things in this small file and this meant to delete the file on HDFS, re-upload it and re-run Pig. But in the end I need to change this small file frequently and I wanted to bypass HDFS (because all those read + write + read in pig again is very very slow for multiple iterations of my script), so what I did was:

=== pig script ===
%declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'`

.... MyUDF( line, '$MYFILE') .....

In the beginning, it worked great. But later (when my file started to get larger of 100KB) on pig was stacking and I had to kill it:

2013-09-27 16:14:47,722 [main] INFO  org.apache.pig.tools.parameters.PreprocessorContext - Executing command : cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Error executing shell command: cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of 130

(btw is this a bug or something? should hung like that?)

How can I manage small files in such cases so I don't need to re upload everything in HDFS every time and make my iteration faster?

Thanks,
Anastasis

Re: Small files

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Hi,

It says that your command returns non-zero code. Does it return it in case
you invoke it manually outside of Pig?
I think I don't have any valuable ideas otherwise.

Thanks


On Mon, Sep 30, 2013 at 10:37 AM, Anastasis Andronidis <
andronat_asf@hotmail.com> wrote:

> Hello again,
>
> any comments on this?
>
> Thanks,
> Anastasis
>
> On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <
> andronat_asf@hotmail.com> wrote:
>
> > Hello,
> >
> > I am working on a very small project for my university and I have a
> small cluster with 2 worker nodes and 1 master node. I'm using Pig to do
> some calculations and I have a question regarding small files.
> >
> > I have a UDF that is reading a small input (around 200k) and correlates
> the data from HDFS. My first approach was to upload the small file onto
> HDFS and later, by using getCacheFiles(), access it into my UDF.
> >
> > After though, I needed to change things in this small file and this
> meant to delete the file on HDFS, re-upload it and re-run Pig. But in the
> end I need to change this small file frequently and I wanted to bypass HDFS
> (because all those read + write + read in pig again is very very slow for
> multiple iterations of my script), so what I did was:
> >
> > === pig script ===
> > %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"}
> {print $0}'`
> >
> > .... MyUDF( line, '$MYFILE') .....
> >
> > In the beginning, it worked great. But later (when my file started to
> get larger of 100KB) on pig was stacking and I had to kill it:
> >
> > 2013-09-27 16:14:47,722 [main] INFO
>  org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> > ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Error executing shell command: cat myfile.txt |
> awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of
> 130
> >
> > (btw is this a bug or something? should hung like that?)
> >
> > How can I manage small files in such cases so I don't need to re upload
> everything in HDFS every time and make my iteration faster?
> >
> > Thanks,
> > Anastasis
>
>

Re: Small files

Posted by TianYi Zhu <ti...@facilitatedigital.com>.

Hi Anastasis,

Have you tried to mount hdfs as a local directory via hdfs-fuse? This might
get over your upload problem.

Thanks,
TianYi ZHU


On 30 September 2013 16:37, Anastasis Andronidis
<an...@hotmail.com>wrote:

> Hello again,
>
> any comments on this?
>
> Thanks,
> Anastasis
>
> On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <
> andronat_asf@hotmail.com> wrote:
>
> > Hello,
> >
> > I am working on a very small project for my university and I have a
> small cluster with 2 worker nodes and 1 master node. I'm using Pig to do
> some calculations and I have a question regarding small files.
> >
> > I have a UDF that is reading a small input (around 200k) and correlates
> the data from HDFS. My first approach was to upload the small file onto
> HDFS and later, by using getCacheFiles(), access it into my UDF.
> >
> > After though, I needed to change things in this small file and this
> meant to delete the file on HDFS, re-upload it and re-run Pig. But in the
> end I need to change this small file frequently and I wanted to bypass HDFS
> (because all those read + write + read in pig again is very very slow for
> multiple iterations of my script), so what I did was:
> >
> > === pig script ===
> > %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"}
> {print $0}'`
> >
> > .... MyUDF( line, '$MYFILE') .....
> >
> > In the beginning, it worked great. But later (when my file started to
> get larger of 100KB) on pig was stacking and I had to kill it:
> >
> > 2013-09-27 16:14:47,722 [main] INFO
>  org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> > ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Error executing shell command: cat myfile.txt |
> awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of
> 130
> >
> > (btw is this a bug or something? should hung like that?)
> >
> > How can I manage small files in such cases so I don't need to re upload
> everything in HDFS every time and make my iteration faster?
> >
> > Thanks,
> > Anastasis
>
>

Re: Small files

Posted by Anastasis Andronidis <an...@hotmail.com>.

Hello again,

any comments on this?

Thanks,
Anastasis

On 27 Σεπ 2013, at 5:36 μ.μ., Anastasis Andronidis <an...@hotmail.com> wrote:

> Hello,
> 
> I am working on a very small project for my university and I have a small cluster with 2 worker nodes and 1 master node. I'm using Pig to do some calculations and I have a question regarding small files.
> 
> I have a UDF that is reading a small input (around 200k) and correlates the data from HDFS. My first approach was to upload the small file onto HDFS and later, by using getCacheFiles(), access it into my UDF.
> 
> After though, I needed to change things in this small file and this meant to delete the file on HDFS, re-upload it and re-run Pig. But in the end I need to change this small file frequently and I wanted to bypass HDFS (because all those read + write + read in pig again is very very slow for multiple iterations of my script), so what I did was:
> 
> === pig script ===
> %declare MYFILE     `cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'`
> 
> .... MyUDF( line, '$MYFILE') .....
> 
> In the beginning, it worked great. But later (when my file started to get larger of 100KB) on pig was stacking and I had to kill it:
> 
> 2013-09-27 16:14:47,722 [main] INFO  org.apache.pig.tools.parameters.PreprocessorContext - Executing command : cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'
> ^C2013-09-27 16:15:28,102 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Error executing shell command: cat myfile.txt | awk 'BEGIN {ORS="|"; RS="\r\n"} {print $0}'. Command exit with exit code of 130
> 
> (btw is this a bug or something? should hung like that?)
> 
> How can I manage small files in such cases so I don't need to re upload everything in HDFS every time and make my iteration faster?
> 
> Thanks,
> Anastasis