You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Nathan Grice <ng...@gmail.com> on 2013/08/24 02:56:33 UTC

io.file.buffer.size different when not running in proper bash shell?

Thanks in advance for any help. I have been banging my head against the
wall on this one all day.
When I run the cmd:
hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
hadoop shell dutifully copies my entire file correctly, no matter the size.


I wrote a webservice client for an external service in python and I am
simply trying to replicate the same command after retreiving some csv
delimited results from the webservice

cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
bufsize=256*1024*1024)
output, errors = p.communicate()
if p.returncode:
   raise OSError(errors)
else:
  LOG.info( output )

without fail the hadoop shell only writes the first 4096 bytes of the input
file (which according to the documentation is the default value for
io.file.buffer.size)

I have tried almost everything including adding
-Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
NOTHING seems to work.

Please help!

Re: io.file.buffer.size different when not running in proper bash shell?

Posted by Nathan Grice <ng...@gmail.com>.

Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always
larger)

On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ng...@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
>
>
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
>
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
>
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
>
> Please help!
>

Re: io.file.buffer.size different when not running in proper bash shell?

Posted by Nathan Grice <ng...@gmail.com>.

Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always
larger)

On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ng...@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
>
>
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
>
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
>
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
>
> Please help!
>

Re: io.file.buffer.size different when not running in proper bash shell?

Posted by Nathan Grice <ng...@gmail.com>.

Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always
larger)

On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ng...@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
>
>
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
>
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
>
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
>
> Please help!
>

Re: io.file.buffer.size different when not running in proper bash shell?

Posted by Nathan Grice <ng...@gmail.com>.

Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always
larger)

On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ng...@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
>
>
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
>
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
>
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
>
> Please help!
>