You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by tim robertson <ti...@gmail.com> on 2009/04/14 11:35:01 UTC

Generating many small PNGs to Amazon S3 with MapReduce

Hi all,

I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql.  On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev server) example of the final map client:
http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
dynamic grids as you zoom are all pre-calculated.

I am considering (for better throughput as maps generate huge request
volumes) pregenerating all my tiles (PNG) and storing them in S3 with
cloudfront.  There will be billions of PNGs produced each at 1-3KB
each.

Could someone please recommend the best place to generate the PNGs and
when to push them to S3 in a MR system?
If I did the PNG generation and upload to S3 in the reduce the same
task on multiple machines will compete with each other right?  Should
I generate the PNGs to a local directory and then on Task success push
the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
good idea.

I will use EC2 for the MR for the time being, but this will be moved
to a local cluster still pushing to S3...

Cheers,

Tim

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
> However, do the math on the costs for S3. We were doing something similar,
> and found that we were spending a fortune on our put requests at $0.01 per
> 1000, and next to nothing on storage. I've since moved to a more complicated
> model where I pack many small items in each object and store an index in
> simpledb. You'll need to partition your SimpleDBs if you do this.

Thanks a lot for Kevin for this - I stupidly overlooked the S3 put
cost thinking EC2->S3 transfer was free, without realising there is
still a PUT cost...

I will reconsider and look at copying your approach and compare it
with a few rendering EC2 instances running off mysql or so.

Thanks again.

Tim

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
If anyone is interested I did finally get round to processing it all,
and due to the sparsity of data we have, for all 23 zoom levels and
all species we have information on, the result was 807 million PNGs,
which is $8,000 to PUT to S3 - too much for me to pay.

So like most things I will probably go for a compromise and pre
process 10 zoom levels into S3 which will only come in at $457 (only
the PUT into S3) and then render the rest on the fly.  Only people
browsing beyond zoom 10 are then hitting the real time rendering
servers so I think this will work out ok performance wise.

Cheers,

Tim


On Thu, Apr 23, 2009 at 5:45 PM, Stuart Sierra
<th...@gmail.com> wrote:
> On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock <ad...@gmail.com> wrote:
>> 1 billion * ($0.01 / 1000) = 10,000
>
> Oh yeah, I was thinking $0.01 for a single PUT.  Silly me.
>
> -S
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Stuart Sierra <th...@gmail.com>.
On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock <ad...@gmail.com> wrote:
> 1 billion * ($0.01 / 1000) = 10,000

Oh yeah, I was thinking $0.01 for a single PUT.  Silly me.

-S

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Andrew Hitchcock <ad...@gmail.com>.
How do you figure? Puts are one penny per thousand, so I think it'd
only cost $10,000. Here's the math I'm using:

1 billion * ($0.01 / 1000) = 10,000
Math courtesy of Google:
http://www.google.com/search?q=1+billion+*+(0.01+%2F+1000)

Still expensive, but not unreasonably so.

Andrew

On Thu, Apr 23, 2009 at 7:08 AM, Stuart Sierra
<th...@gmail.com> wrote:
> On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson <kp...@biz360.com> wrote:
>> However, do the math on the costs for S3. We were doing something similar,
>> and found that we were spending a fortune on our put requests at $0.01 per
>> 1000, and next to nothing on storage.
>
> I made a similar discovery.  The cost of PUT adds up fast.  One
> billion PUTs will cost you $10 million!
>
> -Stuart Sierra
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Stuart Sierra <th...@gmail.com>.
On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson <kp...@biz360.com> wrote:
> However, do the math on the costs for S3. We were doing something similar,
> and found that we were spending a fortune on our put requests at $0.01 per
> 1000, and next to nothing on storage.

I made a similar discovery.  The cost of PUT adds up fast.  One
billion PUTs will cost you $10 million!

-Stuart Sierra

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
Thanks Kevin,

"... well, you're doing it wrong." This is what I'm afraid of :o)

I know the TaskTracker for the Maps for example can run on the same
part of the input file but not so sure on the Reduce.  In the reduce,
will the same keys be run on multiple machines in competition?




On Thu, Apr 16, 2009 at 2:21 AM, Kevin Peterson <kp...@biz360.com> wrote:
> On Tue, Apr 14, 2009 at 2:35 AM, tim robertson <ti...@gmail.com>wrote:
>
>>
>> I am considering (for better throughput as maps generate huge request
>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>> each.
>>
>
> Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
> there is no need to generate them and then push them at once, if you are
> storing them each in their own S3 object (which they must be, if you intend
> to fetch them using cloudfront). Each S3 object is unique, and can be
> written fully in parallel. If you are writing to the same S3 object twice,
> ... well, you're doing it wrong.
>
> However, do the math on the costs for S3. We were doing something similar,
> and found that we were spending a fortune on our put requests at $0.01 per
> 1000, and next to nothing on storage. I've since moved to a more complicated
> model where I pack many small items in each object and store an index in
> simpledb. You'll need to partition your SimpleDBs if you do this.
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Kevin Peterson <kp...@biz360.com>.
On Tue, Apr 14, 2009 at 2:35 AM, tim robertson <ti...@gmail.com>wrote:

>
> I am considering (for better throughput as maps generate huge request
> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
> cloudfront.  There will be billions of PNGs produced each at 1-3KB
> each.
>

Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
there is no need to generate them and then push them at once, if you are
storing them each in their own S3 object (which they must be, if you intend
to fetch them using cloudfront). Each S3 object is unique, and can be
written fully in parallel. If you are writing to the same S3 object twice,
... well, you're doing it wrong.

However, do the math on the costs for S3. We were doing something similar,
and found that we were spending a fortune on our put requests at $0.01 per
1000, and next to nothing on storage. I've since moved to a more complicated
model where I pack many small items in each object and store an index in
simpledb. You'll need to partition your SimpleDBs if you do this.

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Apr 16, 2009 at 1:27 AM, tim robertson <ti...@gmail.com>wrote:

>
> What is not 100% clear to me is when to push to S3:
> In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
> along with the count, and in the Reduce I group the counts into larger
> tiles, and create the PNG.  I could write to Sequencefile here... but
> I suspect I could just push to the s3 bucket here also - as long as
> the task tracker does not send the same Keys to multiple reduce tasks
> - my Hadoop naivity showing here (I wrote an in memory threaded
> MapReduceLite which does not compete reducers, but not got into the
> Hadoop code quite so much yet).
>
>
Hi Tim,

If I understand what you mean by "compete reducers", then you're referring
to the feature called "speculative execution", in which Hadoop schedules
multiple TaskTrackers to perform the same task. When one of the
multiply-scheduled tasks finishes, the other one is killed. As you seem to
already understand, this might cause issues if your tasks have
non-idempotent side effects on the outside world.

The configuration variable you need to look at is
mapred.reduce.tasks.speculative.execution. If this is set to false, only one
reduce task will be run on each key. If it is true, it's possible that some
reduce tasks will be scheduled twice to try to reduce variance in job
completion times due to slow machines.

There's an equivalent configuration variable
mapred.map.tasks.speculative.execution that controls this behavior for your
map tasks.

Hope that helps,
-Todd

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
Thanks Todd and Chuck - sorry, my terminology was wrong... exactly
what I was looking for.

I am letting mysql chuck throught the zoom levels now to get some
final numbers on the tiles and cost to S3 PUT.  Looks like zoom level
8 is feasible for our current data volume but not a long term option
if the input data explodes in volume.

Cheers,

Tim



On Thu, Apr 16, 2009 at 9:05 PM, Chuck Lam <ch...@gmail.com> wrote:
> ar.. i totally missed the point you had said about "compete reducers". it
> didn't occur to me that you were talking about hadoop's speculative
> execution. todd's solution to turn off speculative execution is correct.
>
> i'll respond to the rest of your email later today.
>
>
>
> On Thu, Apr 16, 2009 at 5:23 AM, tim robertson <ti...@gmail.com>
> wrote:
>>
>> Thanks Chuck,
>>
>> > I'm shooting for finishing the case studies by the end of May, but it'll
>> > be
>> > nice to have a draft done by mid-May so we can edit it to have a
>> > consistent
>> > style with the other case studies.
>>
>> I will do what I can!
>>
>> > I read your blog and found a couple posts on spatial joining. It wasn't
>> > clear to me from reading the posts whether the work was just
>> > experimental or
>> > if it led to some application. If it led to an application, then we may
>> > incorporate that into the case study too.
>>
>> It led to http://widgets.gbif.org/test/PACountry.html#/area/2571 which
>> shows a statistical summary for our data (latitude longitude)
>> cross-referenced with the polygons on the protected areas of the
>> world.  In truth though, we processed it in PostGIS and Hadoop and
>> found that the PostGIS approach, while way slower was fine for now and
>> we developed the scripts for that quicker.  So you can say it was
>> experimental... I do have ambitions to do a basic geospatial join
>> (points in polygons) for PIG, Cloudbase or Hive2.0 but alas have not
>> found time.  Also - the blog is always a late Sunday night effort so
>> really is not written well.
>>
>> > BTW, where in the US are you traveling to? I'm in Silicon Valley, so
>> > maybe
>> > we can meet up if you'll happen to be in the area and can squeeze a
>> > little
>> > time out.
>>
>> Would have loved to... but in Boston and DC this time.  In a few weeks
>> will be in Chicago, but for some reason I have never make it over your
>> neck of the woods.
>>
>> > I don't know what data you need to produce a single PNG file, so I don't
>> > know whether having map output TileId-ZoomLevel-SpeciesId as key is the
>> > right factoring. To me it looks like each PNG represents one tile at one
>> > zoom level but includes multiple species.
>>
>> We do individual species and higher levels of taxa (up to all data).
>> This is all data, grouped to 1x1 degree cells (think 100x100 km) with
>> counts.  Currently preprocessed with mysql, but another hadoop
>> candidate as we grow.
>>
>> http://maps.gbif.org/mapserver/draw.pl?dtype=box&imgonly=1&path=http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F13140803&extent=-180.0+-90.0+180.0+90.0&mode=browse&refresh=Refresh&layer=countryborders
>>
>> > In any case, under Hadoop/MapReduce, all key/value pairs outputted by
>> > the
>> > mappers are grouped by key before being sent to the reducer, so it's
>> > guaranteed that the same key will not go to multiple reducers.
>>
>> That is good to know.  I knew Map tasks would get run on multiple
>> machines if it detects a machine is idle, but wasn't sure if Hadoop
>> would put reducers on machines to compete against each other and kill
>> the one that did not finish first.
>>
>> > You may also want to think more about the actual volume and cost of all
>> > this. You initially said that you will have "billions of PNGs produced
>> > each
>> > at 1-3KB" but then later said the data size is only a few 100GB due to
>> > sparsity. Either you're not really creating billions of PNGs, or a lot
>> > of
>> > them are actually less than 1KB. Kevin brought up a good point that S3
>> > charges $0.01 for every 1000 files ("objects") created, so generating 1
>> > billion files will already set you back $10K plus storage cost (and
>> > transfer
>> > cost if you're not using EC2).
>>
>> Right - my bad... Having not processed this all I am not 100% sure yet
>> what the size will be and to what zoom level I will preprocess to.
>> The challenge is our data is growing continuously, so billions of PNGs
>> was looking into the coming months.  Sorry for the contradiction.
>>
>> You have clearly spotted that I am doing this as a project on the side
>> (evenings really) and not devoting enough time to this!!!  By day I am
>> mysql and postgis still but I am hitting limits and looking to our
>> scalability.
>> I kind of overlooked the PUT cost on S3 thinking stupidly that EC2->S3 was
>> free.
>>
>> I actually have the stuff processed for species only using mysql
>> (http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800) but not
>> the higher groupings of species (familys of species etc).  It could be
>> that I end up only processing all the summary data in Hadoop and then
>> load back into a light DB to render the maps in real time like the
>> link I just provided.  TIles render in around 150msecs so with some
>> hardware we could probably scale....
>>
>> Thanks for your inputs - I appreciate it a lot since I'm working
>> mostly alone on the processing.
>>
>> Cheers,
>>
>> Tim
>>
>> >
>> >
>> >
>> > On Thu, Apr 16, 2009 at 1:27 AM, tim robertson
>> > <ti...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Chuck,
>> >>
>> >> Thank you very much for this opportunity.   I also think it is a nice
>> >> case study; it goes beyond the typical wordcount example by generating
>> >> something that people can actually see and play with immediately
>> >> afterwards (e.g. maps).  It is also showcasing nicely the community
>> >> effort to collectively bring together information on the worlds
>> >> biodiversity - the GBIF network really is a nice example of a free and
>> >> open access community who are collectively addressing interoperability
>> >> globally.  Can you please tell me what kind of time frame you would
>> >> need the case study in?
>> >>
>> >> I have just got my Java PNG generation code down to 130msec on the
>> >> Mac, so I am pretty much ready to start running on EC2 and do the
>> >> volume tile generation, so will blog the whole thing on
>> >> http://biodivertido.blogspot.com at some point soon.  I have to travel
>> >> to the US on Saturday for a week so this will delay it somewhat.
>> >>
>> >> What is not 100% clear to me is when to push to S3:
>> >> In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
>> >> along with the count, and in the Reduce I group the counts into larger
>> >> tiles, and create the PNG.  I could write to Sequencefile here... but
>> >> I suspect I could just push to the s3 bucket here also - as long as
>> >> the task tracker does not send the same Keys to multiple reduce tasks
>> >> - my Hadoop naivity showing here (I wrote an in memory threaded
>> >> MapReduceLite which does not compete reducers, but not got into the
>> >> Hadoop code quite so much yet).
>> >>
>> >>
>> >> Cheers,
>> >>
>> >> Tim
>> >>
>> >>
>> >>
>> >> On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <ch...@gmail.com> wrote:
>> >> > Hi Tim,
>> >> >
>> >> > I'm really interested in your application at gbif.org. I'm in the
>> >> > middle
>> >> > of
>> >> > writing Hadoop in Action ( http://www.manning.com/lam/ ) and think
>> >> > this
>> >> > may
>> >> > make for an interesting hadoop case study, since you're taking
>> >> > advantage
>> >> > of
>> >> > a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
>> >> > PHP/streaming). Would you be interested in discussing making a 4-5
>> >> > page
>> >> > case
>> >> > study out of this?
>> >> >
>> >> > As to your question, I don't know if it's been properly answered, but
>> >> > I
>> >> > don't know why you think that "multiple tasks are running on the same
>> >> > section of the sequence file." Maybe you can elaborate further and
>> >> > I'll
>> >> > see
>> >> > if I can offer any thoughts.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Apr 14, 2009 at 7:10 AM, tim robertson
>> >> > <ti...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Sorry Brian, can I just ask please...
>> >> >>
>> >> >> I have the PNGs in the Sequence file for my sample set.  If I use a
>> >> >> second MR job and push to S3 in the map, surely I run into the
>> >> >> scenario where multiple tasks are running on the same section of the
>> >> >> sequence file and thus pushing the same data to S3.  Am I missing
>> >> >> something obvious (e.g. can I disable this behavior)?
>> >> >>
>> >> >> Cheers
>> >> >>
>> >> >> Tim
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
>> >> >> <ti...@gmail.com> wrote:
>> >> >> > Thanks Brian,
>> >> >> >
>> >> >> > This is pretty much what I was looking for.
>> >> >> >
>> >> >> > Your calculations are correct but based on the assumption that at
>> >> >> > all
>> >> >> > zoom levels we will need all tiles generated.  Given the sparsity
>> >> >> > of
>> >> >> > data, it actually results in only a few 100GBs.  I'll run a second
>> >> >> > MR
>> >> >> > job with the map pushing to S3 then to make use of parallel
>> >> >> > loading.
>> >> >> >
>> >> >> > Cheers,
>> >> >> >
>> >> >> > Tim
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman
>> >> >> > <bb...@cse.unl.edu>
>> >> >> > wrote:
>> >> >> >> Hey Tim,
>> >> >> >>
>> >> >> >> Why don't you put the PNGs in a SequenceFile in the output of
>> >> >> >> your
>> >> >> >> reduce
>> >> >> >> task?  You could then have a post-processing step that unpacks
>> >> >> >> the
>> >> >> >> PNG
>> >> >> >> and
>> >> >> >> places it onto S3.  (If my numbers are correct, you're looking at
>> >> >> >> around 3TB
>> >> >> >> of data; is this right?  With that much, you might want another
>> >> >> >> separate Map
>> >> >> >> task to unpack all the files in parallel ... really depends on
>> >> >> >> the
>> >> >> >> throughput you get to Amazon)
>> >> >> >>
>> >> >> >> Brian
>> >> >> >>
>> >> >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>> >> >> >>
>> >> >> >>> Hi all,
>> >> >> >>>
>> >> >> >>> I am currently processing a lot of raw CSV data and producing a
>> >> >> >>> summary text file which I load into mysql.  On top of this I
>> >> >> >>> have a
>> >> >> >>> PHP application to generate tiles for google mapping (sample
>> >> >> >>> tile:
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> >> >> >>> Here is a (dev server) example of the final map client:
>> >> >> >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 -
>> >> >> >>> the
>> >> >> >>> dynamic grids as you zoom are all pre-calculated.
>> >> >> >>>
>> >> >> >>> I am considering (for better throughput as maps generate huge
>> >> >> >>> request
>> >> >> >>> volumes) pregenerating all my tiles (PNG) and storing them in S3
>> >> >> >>> with
>> >> >> >>> cloudfront.  There will be billions of PNGs produced each at
>> >> >> >>> 1-3KB
>> >> >> >>> each.
>> >> >> >>>
>> >> >> >>> Could someone please recommend the best place to generate the
>> >> >> >>> PNGs
>> >> >> >>> and
>> >> >> >>> when to push them to S3 in a MR system?
>> >> >> >>> If I did the PNG generation and upload to S3 in the reduce the
>> >> >> >>> same
>> >> >> >>> task on multiple machines will compete with each other right?
>> >> >> >>>  Should
>> >> >> >>> I generate the PNGs to a local directory and then on Task
>> >> >> >>> success
>> >> >> >>> push
>> >> >> >>> the lot up?  I am assuming billions of 1-3KB files on HDFS is
>> >> >> >>> not a
>> >> >> >>> good idea.
>> >> >> >>>
>> >> >> >>> I will use EC2 for the MR for the time being, but this will be
>> >> >> >>> moved
>> >> >> >>> to a local cluster still pushing to S3...
>> >> >> >>>
>> >> >> >>> Cheers,
>> >> >> >>>
>> >> >> >>> Tim
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
Hi Chuck,

Thank you very much for this opportunity.   I also think it is a nice
case study; it goes beyond the typical wordcount example by generating
something that people can actually see and play with immediately
afterwards (e.g. maps).  It is also showcasing nicely the community
effort to collectively bring together information on the worlds
biodiversity - the GBIF network really is a nice example of a free and
open access community who are collectively addressing interoperability
globally.  Can you please tell me what kind of time frame you would
need the case study in?

I have just got my Java PNG generation code down to 130msec on the
Mac, so I am pretty much ready to start running on EC2 and do the
volume tile generation, so will blog the whole thing on
http://biodivertido.blogspot.com at some point soon.  I have to travel
to the US on Saturday for a week so this will delay it somewhat.

What is not 100% clear to me is when to push to S3:
In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
along with the count, and in the Reduce I group the counts into larger
tiles, and create the PNG.  I could write to Sequencefile here... but
I suspect I could just push to the s3 bucket here also - as long as
the task tracker does not send the same Keys to multiple reduce tasks
- my Hadoop naivity showing here (I wrote an in memory threaded
MapReduceLite which does not compete reducers, but not got into the
Hadoop code quite so much yet).


Cheers,

Tim



On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <ch...@gmail.com> wrote:
> Hi Tim,
>
> I'm really interested in your application at gbif.org. I'm in the middle of
> writing Hadoop in Action ( http://www.manning.com/lam/ ) and think this may
> make for an interesting hadoop case study, since you're taking advantage of
> a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
> PHP/streaming). Would you be interested in discussing making a 4-5 page case
> study out of this?
>
> As to your question, I don't know if it's been properly answered, but I
> don't know why you think that "multiple tasks are running on the same
> section of the sequence file." Maybe you can elaborate further and I'll see
> if I can offer any thoughts.
>
>
>
>
> On Tue, Apr 14, 2009 at 7:10 AM, tim robertson <ti...@gmail.com>
> wrote:
>>
>> Sorry Brian, can I just ask please...
>>
>> I have the PNGs in the Sequence file for my sample set.  If I use a
>> second MR job and push to S3 in the map, surely I run into the
>> scenario where multiple tasks are running on the same section of the
>> sequence file and thus pushing the same data to S3.  Am I missing
>> something obvious (e.g. can I disable this behavior)?
>>
>> Cheers
>>
>> Tim
>>
>>
>> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
>> <ti...@gmail.com> wrote:
>> > Thanks Brian,
>> >
>> > This is pretty much what I was looking for.
>> >
>> > Your calculations are correct but based on the assumption that at all
>> > zoom levels we will need all tiles generated.  Given the sparsity of
>> > data, it actually results in only a few 100GBs.  I'll run a second MR
>> > job with the map pushing to S3 then to make use of parallel loading.
>> >
>> > Cheers,
>> >
>> > Tim
>> >
>> >
>> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bb...@cse.unl.edu>
>> > wrote:
>> >> Hey Tim,
>> >>
>> >> Why don't you put the PNGs in a SequenceFile in the output of your
>> >> reduce
>> >> task?  You could then have a post-processing step that unpacks the PNG
>> >> and
>> >> places it onto S3.  (If my numbers are correct, you're looking at
>> >> around 3TB
>> >> of data; is this right?  With that much, you might want another
>> >> separate Map
>> >> task to unpack all the files in parallel ... really depends on the
>> >> throughput you get to Amazon)
>> >>
>> >> Brian
>> >>
>> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> I am currently processing a lot of raw CSV data and producing a
>> >>> summary text file which I load into mysql.  On top of this I have a
>> >>> PHP application to generate tiles for google mapping (sample tile:
>> >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> >>> Here is a (dev server) example of the final map client:
>> >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>> >>> dynamic grids as you zoom are all pre-calculated.
>> >>>
>> >>> I am considering (for better throughput as maps generate huge request
>> >>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>> >>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>> >>> each.
>> >>>
>> >>> Could someone please recommend the best place to generate the PNGs and
>> >>> when to push them to S3 in a MR system?
>> >>> If I did the PNG generation and upload to S3 in the reduce the same
>> >>> task on multiple machines will compete with each other right?  Should
>> >>> I generate the PNGs to a local directory and then on Task success push
>> >>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>> >>> good idea.
>> >>>
>> >>> I will use EC2 for the MR for the time being, but this will be moved
>> >>> to a local cluster still pushing to S3...
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Tim
>> >>
>> >>
>> >
>
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
Sorry Brian, can I just ask please...

I have the PNGs in the Sequence file for my sample set.  If I use a
second MR job and push to S3 in the map, surely I run into the
scenario where multiple tasks are running on the same section of the
sequence file and thus pushing the same data to S3.  Am I missing
something obvious (e.g. can I disable this behavior)?

Cheers

Tim


On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
<ti...@gmail.com> wrote:
> Thanks Brian,
>
> This is pretty much what I was looking for.
>
> Your calculations are correct but based on the assumption that at all
> zoom levels we will need all tiles generated.  Given the sparsity of
> data, it actually results in only a few 100GBs.  I'll run a second MR
> job with the map pushing to S3 then to make use of parallel loading.
>
> Cheers,
>
> Tim
>
>
> On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bb...@cse.unl.edu> wrote:
>> Hey Tim,
>>
>> Why don't you put the PNGs in a SequenceFile in the output of your reduce
>> task?  You could then have a post-processing step that unpacks the PNG and
>> places it onto S3.  (If my numbers are correct, you're looking at around 3TB
>> of data; is this right?  With that much, you might want another separate Map
>> task to unpack all the files in parallel ... really depends on the
>> throughput you get to Amazon)
>>
>> Brian
>>
>> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>>
>>> Hi all,
>>>
>>> I am currently processing a lot of raw CSV data and producing a
>>> summary text file which I load into mysql.  On top of this I have a
>>> PHP application to generate tiles for google mapping (sample tile:
>>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>>> Here is a (dev server) example of the final map client:
>>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>>> dynamic grids as you zoom are all pre-calculated.
>>>
>>> I am considering (for better throughput as maps generate huge request
>>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>>> each.
>>>
>>> Could someone please recommend the best place to generate the PNGs and
>>> when to push them to S3 in a MR system?
>>> If I did the PNG generation and upload to S3 in the reduce the same
>>> task on multiple machines will compete with each other right?  Should
>>> I generate the PNGs to a local directory and then on Task success push
>>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>>> good idea.
>>>
>>> I will use EC2 for the MR for the time being, but this will be moved
>>> to a local cluster still pushing to S3...
>>>
>>> Cheers,
>>>
>>> Tim
>>
>>
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by tim robertson <ti...@gmail.com>.
Thanks Brian,

This is pretty much what I was looking for.

Your calculations are correct but based on the assumption that at all
zoom levels we will need all tiles generated.  Given the sparsity of
data, it actually results in only a few 100GBs.  I'll run a second MR
job with the map pushing to S3 then to make use of parallel loading.

Cheers,

Tim


On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hey Tim,
>
> Why don't you put the PNGs in a SequenceFile in the output of your reduce
> task?  You could then have a post-processing step that unpacks the PNG and
> places it onto S3.  (If my numbers are correct, you're looking at around 3TB
> of data; is this right?  With that much, you might want another separate Map
> task to unpack all the files in parallel ... really depends on the
> throughput you get to Amazon)
>
> Brian
>
> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>
>> Hi all,
>>
>> I am currently processing a lot of raw CSV data and producing a
>> summary text file which I load into mysql.  On top of this I have a
>> PHP application to generate tiles for google mapping (sample tile:
>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> Here is a (dev server) example of the final map client:
>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>> dynamic grids as you zoom are all pre-calculated.
>>
>> I am considering (for better throughput as maps generate huge request
>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>> each.
>>
>> Could someone please recommend the best place to generate the PNGs and
>> when to push them to S3 in a MR system?
>> If I did the PNG generation and upload to S3 in the reduce the same
>> task on multiple machines will compete with each other right?  Should
>> I generate the PNGs to a local directory and then on Task success push
>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>> good idea.
>>
>> I will use EC2 for the MR for the time being, but this will be moved
>> to a local cluster still pushing to S3...
>>
>> Cheers,
>>
>> Tim
>
>

Re: Generating many small PNGs to Amazon S3 with MapReduce

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Tim,

Why don't you put the PNGs in a SequenceFile in the output of your  
reduce task?  You could then have a post-processing step that unpacks  
the PNG and places it onto S3.  (If my numbers are correct, you're  
looking at around 3TB of data; is this right?  With that much, you  
might want another separate Map task to unpack all the files in  
parallel ... really depends on the throughput you get to Amazon)

Brian

On Apr 14, 2009, at 4:35 AM, tim robertson wrote:

> Hi all,
>
> I am currently processing a lot of raw CSV data and producing a
> summary text file which I load into mysql.  On top of this I have a
> PHP application to generate tiles for google mapping (sample tile:
> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
> Here is a (dev server) example of the final map client:
> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
> dynamic grids as you zoom are all pre-calculated.
>
> I am considering (for better throughput as maps generate huge request
> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
> cloudfront.  There will be billions of PNGs produced each at 1-3KB
> each.
>
> Could someone please recommend the best place to generate the PNGs and
> when to push them to S3 in a MR system?
> If I did the PNG generation and upload to S3 in the reduce the same
> task on multiple machines will compete with each other right?  Should
> I generate the PNGs to a local directory and then on Task success push
> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
> good idea.
>
> I will use EC2 for the MR for the time being, but this will be moved
> to a local cluster still pushing to S3...
>
> Cheers,
>
> Tim