You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Christian Prokopp <ch...@rangespan.com> on 2013/03/28 16:35:15 UTC

/tmp dir for import configurable?

Hi,

I am using sqoop to copy data from MySQL to S3:

(Sqoop 1.4.2-cdh4.2.0)
$ sqoop import --connect jdbc:mysql://server:port/db --username user
--password pass  --table tablename --target-dir s3n://xyz@somehwere/a/b/c
--fields-terminated-by='\001' -m 1 --direct

My problem is that sqoop temporarily stores the data on /tmp, which is not
big enough for the data. I am unable to find a configuration option to
point sqoop to a bigger partition/disk. Any suggestions?

Cheers,
Christian

Re: /tmp dir for import configurable?

Posted by Christian Prokopp <ch...@rangespan.com>.

Hi Jarcec,

Perfect solution. Thank you very much!

Cheers,
Christian


On Sat, Apr 6, 2013 at 6:05 AM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Christian,
> thank you very much for sharing the log and please accept my apologies for
> late response.
>
> Closely looking into your exception, I can confirm that it's the S3 file
> system that is creating the files in /tmp and not Sqoop itself.
>
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
>
> Taking a brief look into the source code [1], it seems that it's the
> method newBackupFile() defined on line 195 that is responsible for creating
> the temporary file. And also it seems that it's behaviour can be altered
> using fs.s3.buffer.dir property. Would you mind to try use it in your Sqoop
> execution?
>
>   sqoop import -Dfs.s3.buffer.dir=/custom/path ...
>
> I've also noticed that you're using the LocalJobRunner which is suggesting
> Sqoop is executing all jobs locally on your machine and not on your Hadoop
> cluster. I would recommend checking Hadoop configuration in case that your
> intention is to run your data transfer in parallel.
>
> Jarcec
>
> Links:
> 1:
> http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
>
> On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote:
> > Hi Jarcec,
> >
> > I am running the command on the CLI of a cluster node. It appears to run
> a
> > local MR job writing the results to /tmp before sending it to S3:
> >
> > [..]
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Beginning mysqldump fast path import
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Performing import of table image from database some_db
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > Converting data to use specified delimiters.
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
> > the fastest possible import, use
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > --mysql-delimiters to specify the same field
> > [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> > delimiters as are used by mysqldump.)
> > [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
> > reduce 0%
> > [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
> > [..]
> > [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> > Transfer loop complete.
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> > Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
> > OutputStream for key
> >
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> > closed. Now beginning upload
> > [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
> > OutputStream for key
> >
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> > upload complete
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
> > Task:attempt_local555455791_0001_m_000000_0 is done. And is in the
> process
> > of commiting
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
> > attempt_local555455791_0001_m_000000_0 is allowed to commit now
> > [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
> > Failed to delete the temporary output directory of task:
> > attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
> > /some_table/_temporary/_attempt_local555455791_0001_m_000000_0
> > [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter:
> Saved
> > output of task 'attempt_local555455791_0001_m_000000_0' to
> > s3n://secret@bucketsomewhere/some_table
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
> > 'attempt_local555455791_0001_m_000000_0' done.
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> Finishing
> > task: attempt_local555455791_0001_m_000000_0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
> > executor complete.
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> > /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
> > [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> > OutputStream for key 'some_table/_SUCCESS' upload complete
> > [...deleting cached jars...]
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
> > job_local555455791_0001
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
> > Counters
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of bytes read=6471451
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of bytes written=6623109
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> > Number of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of bytes read=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of bytes written=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> > Number of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of bytes read=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of bytes written=773081963
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of large read operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N:
> Number
> > of write operations=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
> > Framework
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
> > records=1
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map
> output
> > records=14324124
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input
> split
> > bytes=87
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
> > Records=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
> > spent (ms)=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
> > memory (bytes) snapshot=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
> > memory (bytes) snapshot=0
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
> > committed heap usage (bytes)=142147584
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> > Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
> > [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> > Retrieved 14324124 records.
> >
> > On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <jarcec@apache.org
> >wrote:
> >
> > > Hi Christian,
> > > would you mind describing a bit more the behaviour you're observing?
> > >
> > > Sqoop should be touching /tmp only on machine where you've executed it
> for
> > > generating and compiling code (<1MB!). The data transfer itself is
> done on
> > > your Hadoop cluster from within a mapreduce job and the output is
> directly
> > > stored in your destination folder. I'm not familiar with s3 file system
> > > implementation, but can it happen that it's the S3 library which is
> storing
> > > the data in /tmp?
> > >
> > > Jarcec
> > >
> > > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > > > Thanks for the idea Alex. I considered this but that would mean I
> have to
> > > > change my cluster setup for sqoop (last resort solution). I'd very
> much
> > > > rather point sqoop to existing large disks.
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > > >
> > > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> > > wget.null@gmail.com
> > > > > wrote:
> > > >
> > > > > You could mount a bigger disk into /tmp - or symlink /tmp to
> another
> > > > > directory which have enough space.
> > > > >
> > > > > Best
> > > > > - Alex
> > > > >
> > > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> > > christian@rangespan.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am using sqoop to copy data from MySQL to S3:
> > > > > >
> > > > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username
> user
> > > > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> > > /a/b/c
> > > > > --fields-terminated-by='\001' -m 1 --direct
> > > > > >
> > > > > > My problem is that sqoop temporarily stores the data on /tmp,
> which
> > > is
> > > > > not big enough for the data. I am unable to find a configuration
> > > option to
> > > > > point sqoop to a bigger partition/disk. Any suggestions?
> > > > > >
> > > > > > Cheers,
> > > > > > Christian
> > > > > >
> > > > >
> > > > > --
> > > > > Alexander Alten-Lorenz
> > > > > http://mapredit.blogspot.com
> > > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > *Christian Prokopp*
> > > > Data Scientist, PhD
> > > > Rangespan Ltd. <http://www.rangespan.com/>
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > *Christian Prokopp*
> > Data Scientist, PhD
> > Rangespan Ltd. <http://www.rangespan.com/>
>



-- 
Best regards,

*Christian Prokopp*
Data Scientist, PhD
Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Christian,
thank you very much for sharing the log and please accept my apologies for late response. 

Closely looking into your exception, I can confirm that it's the S3 file system that is creating the files in /tmp and not Sqoop itself.

> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'

Taking a brief look into the source code [1], it seems that it's the method newBackupFile() defined on line 195 that is responsible for creating the temporary file. And also it seems that it's behaviour can be altered using fs.s3.buffer.dir property. Would you mind to try use it in your Sqoop execution?

  sqoop import -Dfs.s3.buffer.dir=/custom/path ...

I've also noticed that you're using the LocalJobRunner which is suggesting Sqoop is executing all jobs locally on your machine and not on your Hadoop cluster. I would recommend checking Hadoop configuration in case that your intention is to run your data transfer in parallel.

Jarcec

Links:
1: http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html

On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote:
> Hi Jarcec,
> 
> I am running the command on the CLI of a cluster node. It appears to run a
> local MR job writing the results to /tmp before sending it to S3:
> 
> [..]
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Beginning mysqldump fast path import
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Performing import of table image from database some_db
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Converting data to use specified delimiters.
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
> the fastest possible import, use
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> --mysql-delimiters to specify the same field
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> delimiters as are used by mysqldump.)
> [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
> reduce 0%
> [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
> [..]
> [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transfer loop complete.
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> upload complete
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
> Task:attempt_local555455791_0001_m_000000_0 is done. And is in the process
> of commiting
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
> attempt_local555455791_0001_m_000000_0 is allowed to commit now
> [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
> Failed to delete the temporary output directory of task:
> attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
> /some_table/_temporary/_attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: Saved
> output of task 'attempt_local555455791_0001_m_000000_0' to
> s3n://secret@bucketsomewhere/some_table
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
> 'attempt_local555455791_0001_m_000000_0' done.
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Finishing
> task: attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
> executor complete.
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' upload complete
> [...deleting cached jars...]
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
> job_local555455791_0001
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
> Counters
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes read=6471451
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes written=6623109
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes written=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes written=773081963
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
> Framework
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
> records=1
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map output
> records=14324124
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input split
> bytes=87
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
> Records=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
> spent (ms)=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
> committed heap usage (bytes)=142147584
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Retrieved 14324124 records.
> 
> On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
> 
> > Hi Christian,
> > would you mind describing a bit more the behaviour you're observing?
> >
> > Sqoop should be touching /tmp only on machine where you've executed it for
> > generating and compiling code (<1MB!). The data transfer itself is done on
> > your Hadoop cluster from within a mapreduce job and the output is directly
> > stored in your destination folder. I'm not familiar with s3 file system
> > implementation, but can it happen that it's the S3 library which is storing
> > the data in /tmp?
> >
> > Jarcec
> >
> > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > > Thanks for the idea Alex. I considered this but that would mean I have to
> > > change my cluster setup for sqoop (last resort solution). I'd very much
> > > rather point sqoop to existing large disks.
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> > wget.null@gmail.com
> > > > wrote:
> > >
> > > > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > > > directory which have enough space.
> > > >
> > > > Best
> > > > - Alex
> > > >
> > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> > christian@rangespan.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am using sqoop to copy data from MySQL to S3:
> > > > >
> > > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> > /a/b/c
> > > > --fields-terminated-by='\001' -m 1 --direct
> > > > >
> > > > > My problem is that sqoop temporarily stores the data on /tmp, which
> > is
> > > > not big enough for the data. I am unable to find a configuration
> > option to
> > > > point sqoop to a bigger partition/disk. Any suggestions?
> > > > >
> > > > > Cheers,
> > > > > Christian
> > > > >
> > > >
> > > > --
> > > > Alexander Alten-Lorenz
> > > > http://mapredit.blogspot.com
> > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > *Christian Prokopp*
> > > Data Scientist, PhD
> > > Rangespan Ltd. <http://www.rangespan.com/>
> >
> 
> 
> 
> -- 
> Best regards,
> 
> *Christian Prokopp*
> Data Scientist, PhD
> Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Posted by Christian Prokopp <ch...@rangespan.com>.

Hi Jarcec,

I am running the command on the CLI of a cluster node. It appears to run a
local MR job writing the results to /tmp before sending it to S3:

[..]
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Beginning mysqldump fast path import
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Performing import of table image from database some_db
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Converting data to use specified delimiters.
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
the fastest possible import, use
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
--mysql-delimiters to specify the same field
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
delimiters as are used by mysqldump.)
[hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
reduce 0%
[hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
[..]
[hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
Transfer loop complete.
[hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
[hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
OutputStream for key
'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
closed. Now beginning upload
[hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
OutputStream for key
'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
upload complete
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
Task:attempt_local555455791_0001_m_000000_0 is done. And is in the process
of commiting
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
attempt_local555455791_0001_m_000000_0 is allowed to commit now
[hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
Failed to delete the temporary output directory of task:
attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
/some_table/_temporary/_attempt_local555455791_0001_m_000000_0
[hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: Saved
output of task 'attempt_local555455791_0001_m_000000_0' to
s3n://secret@bucketsomewhere/some_table
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
'attempt_local555455791_0001_m_000000_0' done.
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Finishing
task: attempt_local555455791_0001_m_000000_0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
executor complete.
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
/tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' upload complete
[...deleting cached jars...]
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
job_local555455791_0001
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
Counters
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of bytes read=6471451
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of bytes written=6623109
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of bytes read=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of bytes written=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of bytes read=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of bytes written=773081963
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
Framework
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
records=1
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map output
records=14324124
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input split
bytes=87
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
Records=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
spent (ms)=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
memory (bytes) snapshot=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
memory (bytes) snapshot=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
committed heap usage (bytes)=142147584
[hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
[hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
Retrieved 14324124 records.

On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Christian,
> would you mind describing a bit more the behaviour you're observing?
>
> Sqoop should be touching /tmp only on machine where you've executed it for
> generating and compiling code (<1MB!). The data transfer itself is done on
> your Hadoop cluster from within a mapreduce job and the output is directly
> stored in your destination folder. I'm not familiar with s3 file system
> implementation, but can it happen that it's the S3 library which is storing
> the data in /tmp?
>
> Jarcec
>
> On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > Thanks for the idea Alex. I considered this but that would mean I have to
> > change my cluster setup for sqoop (last resort solution). I'd very much
> > rather point sqoop to existing large disks.
> >
> > Cheers,
> > Christian
> >
> >
> > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> wget.null@gmail.com
> > > wrote:
> >
> > > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > > directory which have enough space.
> > >
> > > Best
> > > - Alex
> > >
> > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> christian@rangespan.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am using sqoop to copy data from MySQL to S3:
> > > >
> > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> /a/b/c
> > > --fields-terminated-by='\001' -m 1 --direct
> > > >
> > > > My problem is that sqoop temporarily stores the data on /tmp, which
> is
> > > not big enough for the data. I am unable to find a configuration
> option to
> > > point sqoop to a bigger partition/disk. Any suggestions?
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > >
> > > --
> > > Alexander Alten-Lorenz
> > > http://mapredit.blogspot.com
> > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > >
> > >
> >
> >
> > --
> > Best regards,
> >
> > *Christian Prokopp*
> > Data Scientist, PhD
> > Rangespan Ltd. <http://www.rangespan.com/>
>



-- 
Best regards,

*Christian Prokopp*
Data Scientist, PhD
Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Christian,
would you mind describing a bit more the behaviour you're observing? 

Sqoop should be touching /tmp only on machine where you've executed it for generating and compiling code (<1MB!). The data transfer itself is done on your Hadoop cluster from within a mapreduce job and the output is directly stored in your destination folder. I'm not familiar with s3 file system implementation, but can it happen that it's the S3 library which is storing the data in /tmp?

Jarcec

On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> Thanks for the idea Alex. I considered this but that would mean I have to
> change my cluster setup for sqoop (last resort solution). I'd very much
> rather point sqoop to existing large disks.
> 
> Cheers,
> Christian
> 
> 
> On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <wget.null@gmail.com
> > wrote:
> 
> > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > directory which have enough space.
> >
> > Best
> > - Alex
> >
> > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <ch...@rangespan.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am using sqoop to copy data from MySQL to S3:
> > >
> > > (Sqoop 1.4.2-cdh4.2.0)
> > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > --password pass  --table tablename --target-dir s3n://xyz@somehwere/a/b/c
> > --fields-terminated-by='\001' -m 1 --direct
> > >
> > > My problem is that sqoop temporarily stores the data on /tmp, which is
> > not big enough for the data. I am unable to find a configuration option to
> > point sqoop to a bigger partition/disk. Any suggestions?
> > >
> > > Cheers,
> > > Christian
> > >
> >
> > --
> > Alexander Alten-Lorenz
> > http://mapredit.blogspot.com
> > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> >
> >
> 
> 
> -- 
> Best regards,
> 
> *Christian Prokopp*
> Data Scientist, PhD
> Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Posted by Christian Prokopp <ch...@rangespan.com>.

Thanks for the idea Alex. I considered this but that would mean I have to
change my cluster setup for sqoop (last resort solution). I'd very much
rather point sqoop to existing large disks.

Cheers,
Christian


On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> You could mount a bigger disk into /tmp - or symlink /tmp to another
> directory which have enough space.
>
> Best
> - Alex
>
> On Mar 28, 2013, at 4:35 PM, Christian Prokopp <ch...@rangespan.com>
> wrote:
>
> > Hi,
> >
> > I am using sqoop to copy data from MySQL to S3:
> >
> > (Sqoop 1.4.2-cdh4.2.0)
> > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> --password pass  --table tablename --target-dir s3n://xyz@somehwere/a/b/c
> --fields-terminated-by='\001' -m 1 --direct
> >
> > My problem is that sqoop temporarily stores the data on /tmp, which is
> not big enough for the data. I am unable to find a configuration option to
> point sqoop to a bigger partition/disk. Any suggestions?
> >
> > Cheers,
> > Christian
> >
>
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>
>


-- 
Best regards,

*Christian Prokopp*
Data Scientist, PhD
Rangespan Ltd. <http://www.rangespan.com/>

Re: /tmp dir for import configurable?

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

You could mount a bigger disk into /tmp - or symlink /tmp to another directory which have enough space.

Best
- Alex

On Mar 28, 2013, at 4:35 PM, Christian Prokopp <ch...@rangespan.com> wrote:

> Hi,
> 
> I am using sqoop to copy data from MySQL to S3:
> 
> (Sqoop 1.4.2-cdh4.2.0)
> $ sqoop import --connect jdbc:mysql://server:port/db --username user --password pass  --table tablename --target-dir s3n://xyz@somehwere/a/b/c --fields-terminated-by='\001' -m 1 --direct
> 
> My problem is that sqoop temporarily stores the data on /tmp, which is not big enough for the data. I am unable to find a configuration option to point sqoop to a bigger partition/disk. Any suggestions?
> 
> Cheers,
> Christian
> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF