You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by andrea zonca <an...@gmail.com> on 2013/07/12 23:43:19 UTC

Running hadoop for processing sources in full sky maps

Hi,

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.

For each sky map I already have a catalog of the position of few
thousand sources, i.e. stars, galaxies, radio sources.

For each source I would like to:

open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog

I would like to run hadoop, possibly using python via the streaming
interface, to process them in parallel.

I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing
and print the output to stdout.

Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is
copied locally to the nodes that are processing one of its sources.
How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each
source I have a reference to the full sky map, latitude and longitude

Thanks,
I posted this question on StackOverflow:
http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps

Regards,
Andrea Zonca

Re: Running hadoop for processing sources in full sky maps

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Andrea,

you write:
a few tenth of sky maps, each 600 MB = some TB.
a few thousand sources, lat/long for each = let's say 10k = 1 MB.

Sounds to me, like you should turn your problem around:
- Distribute your sources.
- work on chunks of your sky maps files.
- search for sources fully covered by (node local) input
- if necesssary load some adjecant sky map files
- perform statistics.

Check http://en.wikipedia.org/wiki/Spatial_database#Spatial_index
 for a list of possible indexing strategies to prepare you sky maps to have
some locality properties: Use e.g. Z-order + padding to define areas to be
fed into your statistics algorithms.

You could also prepare input files, which repeat some information contained
in the sky maps, like overlapping borders around a central area, so that
eah input block has enough information to calculate your statistics for
each point contained in the central area.

If this preprocessing steps pays out depends on how often you're going to
run your statistics (e.g. on different points).

Hope this ignites your creativity,

Jens

Re: Running hadoop for processing sources in full sky maps

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Andrea,

you write:
a few tenth of sky maps, each 600 MB = some TB.
a few thousand sources, lat/long for each = let's say 10k = 1 MB.

Sounds to me, like you should turn your problem around:
- Distribute your sources.
- work on chunks of your sky maps files.
- search for sources fully covered by (node local) input
- if necesssary load some adjecant sky map files
- perform statistics.

Check http://en.wikipedia.org/wiki/Spatial_database#Spatial_index
 for a list of possible indexing strategies to prepare you sky maps to have
some locality properties: Use e.g. Z-order + padding to define areas to be
fed into your statistics algorithms.

You could also prepare input files, which repeat some information contained
in the sky maps, like overlapping borders around a central area, so that
eah input block has enough information to calculate your statistics for
each point contained in the central area.

If this preprocessing steps pays out depends on how often you're going to
run your statistics (e.g. on different points).

Hope this ignites your creativity,

Jens

Re: Running hadoop for processing sources in full sky maps

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Andrea,

you write:
a few tenth of sky maps, each 600 MB = some TB.
a few thousand sources, lat/long for each = let's say 10k = 1 MB.

Sounds to me, like you should turn your problem around:
- Distribute your sources.
- work on chunks of your sky maps files.
- search for sources fully covered by (node local) input
- if necesssary load some adjecant sky map files
- perform statistics.

Check http://en.wikipedia.org/wiki/Spatial_database#Spatial_index
 for a list of possible indexing strategies to prepare you sky maps to have
some locality properties: Use e.g. Z-order + padding to define areas to be
fed into your statistics algorithms.

You could also prepare input files, which repeat some information contained
in the sky maps, like overlapping borders around a central area, so that
eah input block has enough information to calculate your statistics for
each point contained in the central area.

If this preprocessing steps pays out depends on how often you're going to
run your statistics (e.g. on different points).

Hope this ignites your creativity,

Jens

Re: Running hadoop for processing sources in full sky maps

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Andrea,

you write:
a few tenth of sky maps, each 600 MB = some TB.
a few thousand sources, lat/long for each = let's say 10k = 1 MB.

Sounds to me, like you should turn your problem around:
- Distribute your sources.
- work on chunks of your sky maps files.
- search for sources fully covered by (node local) input
- if necesssary load some adjecant sky map files
- perform statistics.

Check http://en.wikipedia.org/wiki/Spatial_database#Spatial_index
 for a list of possible indexing strategies to prepare you sky maps to have
some locality properties: Use e.g. Z-order + padding to define areas to be
fed into your statistics algorithms.

You could also prepare input files, which repeat some information contained
in the sky maps, like overlapping borders around a central area, so that
eah input block has enough information to calculate your statistics for
each point contained in the central area.

If this preprocessing steps pays out depends on how often you're going to
run your statistics (e.g. on different points).

Hope this ignites your creativity,

Jens

Re: Running hadoop for processing sources in full sky maps

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Andrea,

For copying the full sky map to each node, look up the distributed cache.
 It works by placing the sky map file on HDFS and each task will pull it
down when needed.  For feeding the input data into Hadoop, what format is
it in currently?  One simple way would be to have a text file with the
reference, latitude, and longitude separated by commas on each line, and
then use TextInputFormat.

-Sandy


On Fri, Jul 12, 2013 at 2:43 PM, andrea zonca <an...@gmail.com>wrote:

> Hi,
>
> I have few tens of full sky maps, in binary format (FITS) of about 600MB
> each.
>
> For each sky map I already have a catalog of the position of few
> thousand sources, i.e. stars, galaxies, radio sources.
>
> For each source I would like to:
>
> open the full sky map
> extract the relevant section, typically 20MB or less
> run some statistics on them
> aggregate the outputs to a catalog
>
> I would like to run hadoop, possibly using python via the streaming
> interface, to process them in parallel.
>
> I think the input to the mapper should be each record of the catalogs,
> then the python mapper can open the full sky map, do the processing
> and print the output to stdout.
>
> Is this a reasonable approach?
> If so, I need to be able to configure hadoop so that a full sky map is
> copied locally to the nodes that are processing one of its sources.
> How can I achieve that?
> Also, what is the best way to feed the input data to hadoop? for each
> source I have a reference to the full sky map, latitude and longitude
>
> Thanks,
> I posted this question on StackOverflow:
>
> http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps
>
> Regards,
> Andrea Zonca
>

Re: Running hadoop for processing sources in full sky maps

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Andrea,

For copying the full sky map to each node, look up the distributed cache.
 It works by placing the sky map file on HDFS and each task will pull it
down when needed.  For feeding the input data into Hadoop, what format is
it in currently?  One simple way would be to have a text file with the
reference, latitude, and longitude separated by commas on each line, and
then use TextInputFormat.

-Sandy


On Fri, Jul 12, 2013 at 2:43 PM, andrea zonca <an...@gmail.com>wrote:

> Hi,
>
> I have few tens of full sky maps, in binary format (FITS) of about 600MB
> each.
>
> For each sky map I already have a catalog of the position of few
> thousand sources, i.e. stars, galaxies, radio sources.
>
> For each source I would like to:
>
> open the full sky map
> extract the relevant section, typically 20MB or less
> run some statistics on them
> aggregate the outputs to a catalog
>
> I would like to run hadoop, possibly using python via the streaming
> interface, to process them in parallel.
>
> I think the input to the mapper should be each record of the catalogs,
> then the python mapper can open the full sky map, do the processing
> and print the output to stdout.
>
> Is this a reasonable approach?
> If so, I need to be able to configure hadoop so that a full sky map is
> copied locally to the nodes that are processing one of its sources.
> How can I achieve that?
> Also, what is the best way to feed the input data to hadoop? for each
> source I have a reference to the full sky map, latitude and longitude
>
> Thanks,
> I posted this question on StackOverflow:
>
> http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps
>
> Regards,
> Andrea Zonca
>

Re: Running hadoop for processing sources in full sky maps

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Andrea,

For copying the full sky map to each node, look up the distributed cache.
 It works by placing the sky map file on HDFS and each task will pull it
down when needed.  For feeding the input data into Hadoop, what format is
it in currently?  One simple way would be to have a text file with the
reference, latitude, and longitude separated by commas on each line, and
then use TextInputFormat.

-Sandy


On Fri, Jul 12, 2013 at 2:43 PM, andrea zonca <an...@gmail.com>wrote:

> Hi,
>
> I have few tens of full sky maps, in binary format (FITS) of about 600MB
> each.
>
> For each sky map I already have a catalog of the position of few
> thousand sources, i.e. stars, galaxies, radio sources.
>
> For each source I would like to:
>
> open the full sky map
> extract the relevant section, typically 20MB or less
> run some statistics on them
> aggregate the outputs to a catalog
>
> I would like to run hadoop, possibly using python via the streaming
> interface, to process them in parallel.
>
> I think the input to the mapper should be each record of the catalogs,
> then the python mapper can open the full sky map, do the processing
> and print the output to stdout.
>
> Is this a reasonable approach?
> If so, I need to be able to configure hadoop so that a full sky map is
> copied locally to the nodes that are processing one of its sources.
> How can I achieve that?
> Also, what is the best way to feed the input data to hadoop? for each
> source I have a reference to the full sky map, latitude and longitude
>
> Thanks,
> I posted this question on StackOverflow:
>
> http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps
>
> Regards,
> Andrea Zonca
>

Re: Running hadoop for processing sources in full sky maps

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Andrea,

For copying the full sky map to each node, look up the distributed cache.
 It works by placing the sky map file on HDFS and each task will pull it
down when needed.  For feeding the input data into Hadoop, what format is
it in currently?  One simple way would be to have a text file with the
reference, latitude, and longitude separated by commas on each line, and
then use TextInputFormat.

-Sandy


On Fri, Jul 12, 2013 at 2:43 PM, andrea zonca <an...@gmail.com>wrote:

> Hi,
>
> I have few tens of full sky maps, in binary format (FITS) of about 600MB
> each.
>
> For each sky map I already have a catalog of the position of few
> thousand sources, i.e. stars, galaxies, radio sources.
>
> For each source I would like to:
>
> open the full sky map
> extract the relevant section, typically 20MB or less
> run some statistics on them
> aggregate the outputs to a catalog
>
> I would like to run hadoop, possibly using python via the streaming
> interface, to process them in parallel.
>
> I think the input to the mapper should be each record of the catalogs,
> then the python mapper can open the full sky map, do the processing
> and print the output to stdout.
>
> Is this a reasonable approach?
> If so, I need to be able to configure hadoop so that a full sky map is
> copied locally to the nodes that are processing one of its sources.
> How can I achieve that?
> Also, what is the best way to feed the input data to hadoop? for each
> source I have a reference to the full sky map, latitude and longitude
>
> Thanks,
> I posted this question on StackOverflow:
>
> http://stackoverflow.com/questions/17617654/running-hadoop-for-processing-sources-in-full-sky-maps
>
> Regards,
> Andrea Zonca
>