You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by Noah Watkins <ja...@cs.ucsc.edu> on 2012/03/27 17:46:42 UTC

multi-dimensional array storage

Hi Hama list,

I'm interested in using Hama to process large multi-dimensional arrays (sparse and dense). What is the best way to store and represent this type of data for processing in Hama?

Thanks,
Noah

Re: multi-dimensional array storage

Posted by Thomas Jungblut <th...@googlemail.com>.
>
> Does that make sense, and are there any suggestions for doing this?
>

Yep seems fine. Just use the I/O system of Hama and the ArrayWritable trick
for dense matrices as a SequenceFile. I guess this would be the best
solution.
There is a little bit of overhead in SeqFiles, it uses zLib compression by
default. So textfiles may be fast as well, but you have to parse the
strings.
You can turn off compression in the configuration via
"io.seqfile.compression.type" and set it to "NONE".

If you need additional tips, don't hesitate to come back and ask ;)

Am 28. März 2012 16:06 schrieb Noah Watkins <ja...@cs.ucsc.edu>:

> Thanks for the feedback. I'll focus on the dense array challenge for now.
>
> We will be examining 1000^3 arrays, multiple of which will represent
> changes to a spatial environment over time. That being the case, I think
> (but could be wrong), that representing each individual coordinate value is
> overkill, and that an array should be stored in chunks? For example, rather
> than store a coordinate value as an HBase key or a value in a sequence file
> (resulting in N billion keys), an array should be decomposed and stored as
> contiguous array hyperslab. Then a key becomes, for example, the corner of
> the hyperslab.
>
> Does that make sense, and are there any suggestions for doing this? I
> think as you said, simply using ArrayWritable as a SequenceFile value would
> work?
>
> As for our algorithms, currently we are interested in only structural
> manipulation, such as extracting hyperslabs. We will focus on analysis
> later, but the chunked solution should be OK for that, too.
>
>
> On Mar 27, 2012, at 11:20 PM, Thomas Jungblut wrote:
>
> > Hey, besides HBase you can use SequenceFiles, they have Key/Value pairs.
> > So normally you use somekind of <VectorWritable, NullWritable> pairs,
> > VectorWritable is for example in mahout. They have a good math package
> for
> > sparse and dense vectors.
> >
> > If you don't want vector classes then you can use ArrayWritable for dense
> > and MapWritable for sparse data.
> > It depends also on what you're doing with your data, so if you have more
> > information about the algorithm, we can give you a better suggestion ;)
> >
> > Am 28. März 2012 00:51 schrieb Edward J. Yoon <ed...@apache.org>:
> >
> >> Hi,
> >>
> >> I believe that HBase is the best way to store multi-dimensional
> >> arrays. HBase provides storage efficiencies as number of dimensions
> >> grow, ordering capability, and also allows you to record and access
> >> data corrections and updates directly via HBase client library.
> >>
> >> Another option is use of SequenceFile and MapFile. Once data loaded to
> >> the program initially, your math operations can run directly in memory
> >> and and synchronized using a standard BSP APIs.
> >>
> >> Thanks.
> >>
> >> On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <ja...@cs.ucsc.edu>
> >> wrote:
> >>> Hi Hama list,
> >>>
> >>> I'm interested in using Hama to process large multi-dimensional arrays
> >> (sparse and dense). What is the best way to store and represent this
> type
> >> of data for processing in Hama?
> >>>
> >>> Thanks,
> >>> Noah
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
>
>


-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: multi-dimensional array storage

Posted by Noah Watkins <ja...@cs.ucsc.edu>.
Thanks for the feedback. I'll focus on the dense array challenge for now.

We will be examining 1000^3 arrays, multiple of which will represent changes to a spatial environment over time. That being the case, I think (but could be wrong), that representing each individual coordinate value is overkill, and that an array should be stored in chunks? For example, rather than store a coordinate value as an HBase key or a value in a sequence file (resulting in N billion keys), an array should be decomposed and stored as contiguous array hyperslab. Then a key becomes, for example, the corner of the hyperslab.

Does that make sense, and are there any suggestions for doing this? I think as you said, simply using ArrayWritable as a SequenceFile value would work?

As for our algorithms, currently we are interested in only structural manipulation, such as extracting hyperslabs. We will focus on analysis later, but the chunked solution should be OK for that, too.


On Mar 27, 2012, at 11:20 PM, Thomas Jungblut wrote:

> Hey, besides HBase you can use SequenceFiles, they have Key/Value pairs.
> So normally you use somekind of <VectorWritable, NullWritable> pairs,
> VectorWritable is for example in mahout. They have a good math package for
> sparse and dense vectors.
> 
> If you don't want vector classes then you can use ArrayWritable for dense
> and MapWritable for sparse data.
> It depends also on what you're doing with your data, so if you have more
> information about the algorithm, we can give you a better suggestion ;)
> 
> Am 28. März 2012 00:51 schrieb Edward J. Yoon <ed...@apache.org>:
> 
>> Hi,
>> 
>> I believe that HBase is the best way to store multi-dimensional
>> arrays. HBase provides storage efficiencies as number of dimensions
>> grow, ordering capability, and also allows you to record and access
>> data corrections and updates directly via HBase client library.
>> 
>> Another option is use of SequenceFile and MapFile. Once data loaded to
>> the program initially, your math operations can run directly in memory
>> and and synchronized using a standard BSP APIs.
>> 
>> Thanks.
>> 
>> On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <ja...@cs.ucsc.edu>
>> wrote:
>>> Hi Hama list,
>>> 
>>> I'm interested in using Hama to process large multi-dimensional arrays
>> (sparse and dense). What is the best way to store and represent this type
>> of data for processing in Hama?
>>> 
>>> Thanks,
>>> Noah
>> 
>> 
>> 
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>> 
> 
> 
> 
> -- 
> Thomas Jungblut
> Berlin <th...@gmail.com>


Re: multi-dimensional array storage

Posted by Thomas Jungblut <th...@googlemail.com>.
Hey, besides HBase you can use SequenceFiles, they have Key/Value pairs.
So normally you use somekind of <VectorWritable, NullWritable> pairs,
VectorWritable is for example in mahout. They have a good math package for
sparse and dense vectors.

If you don't want vector classes then you can use ArrayWritable for dense
and MapWritable for sparse data.
It depends also on what you're doing with your data, so if you have more
information about the algorithm, we can give you a better suggestion ;)

Am 28. März 2012 00:51 schrieb Edward J. Yoon <ed...@apache.org>:

> Hi,
>
> I believe that HBase is the best way to store multi-dimensional
> arrays. HBase provides storage efficiencies as number of dimensions
> grow, ordering capability, and also allows you to record and access
> data corrections and updates directly via HBase client library.
>
> Another option is use of SequenceFile and MapFile. Once data loaded to
> the program initially, your math operations can run directly in memory
> and and synchronized using a standard BSP APIs.
>
> Thanks.
>
> On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <ja...@cs.ucsc.edu>
> wrote:
> > Hi Hama list,
> >
> > I'm interested in using Hama to process large multi-dimensional arrays
> (sparse and dense). What is the best way to store and represent this type
> of data for processing in Hama?
> >
> > Thanks,
> > Noah
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: multi-dimensional array storage

Posted by "Edward J. Yoon" <ed...@apache.org>.
Hi,

I believe that HBase is the best way to store multi-dimensional
arrays. HBase provides storage efficiencies as number of dimensions
grow, ordering capability, and also allows you to record and access
data corrections and updates directly via HBase client library.

Another option is use of SequenceFile and MapFile. Once data loaded to
the program initially, your math operations can run directly in memory
and and synchronized using a standard BSP APIs.

Thanks.

On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <ja...@cs.ucsc.edu> wrote:
> Hi Hama list,
>
> I'm interested in using Hama to process large multi-dimensional arrays (sparse and dense). What is the best way to store and represent this type of data for processing in Hama?
>
> Thanks,
> Noah



-- 
Best Regards, Edward J. Yoon
@eddieyoon