You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Harsh J <ha...@cloudera.com> on 2013/02/13 06:29:40 UTC

Re: Anyway to load certain Key/Value pair fast?

Please do not use the general@ lists for any user-oriented questions.
Please redirect them to user@hadoop.apache.org lists, which is where
the user community and questions lie.

I've moved your post there and have added you on CC in case you
haven't subscribed there. Please reply back only to the user@
addresses. The general@ list is for Apache Hadoop project-level
management and release oriented discussions alone.

On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
> Hi All,
> I am trying to figure out a good solution for such a scenario as following.
>
> 1. I have a 2T file (let's call it A), filled by key/value pairs,
> which is stored in the HDFS with the default 64M block size. In A,
> each key is less than 1K and each value is about 20M.
>
> 2. Occasionally, I will run analysis by using a different type of data
> (usually less than 10G, and let's call it B) and do look-up table
> alike operations by using the values in A. B resides in HDFS as well.
>
> 3. This analysis would require loading only a small number of values
> from A (usually less than 1000 of them) into the memory for fast
> look-up against the data in B. The way B finds the few values in A is
> by looking up for the key in A.
>
> Is there an efficient way to do this?
>
> I was thinking if I could identify the locality of the block that
> contains the few values, I might be able to push the B into the few
> nodes that contains the few values in A?  Since I only need to do this
> occasionally, maintaining a distributed database such as HBase cant be
> justified.
>
> Many thanks.
>
>
> Cao



--
Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks for moving the post to the correct list.


William

On Wed, Feb 13, 2013 at 12:29 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
>>
>> Is there an efficient way to do this?
>>
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.
>>
>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks for moving the post to the correct list.


William

On Wed, Feb 13, 2013 at 12:29 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
>>
>> Is there an efficient way to do this?
>>
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.
>>
>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks for moving the post to the correct list.


William

On Wed, Feb 13, 2013 at 12:29 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
>>
>> Is there an efficient way to do this?
>>
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.
>>
>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J <ha...@cloudera.com> wrote:
> My reply to your questions is inline.
>
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to user@hadoop.apache.org lists, which is where
>> the user community and questions lie.
>>
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>>
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>>
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>>
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>>
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
>
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>
>>> Is there an efficient way to do this?
>
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the  MapFile reader API to perform lookups of keys and
> read values only when you want them.
>
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A?  Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
>>> justified.
>
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
>
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
>
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
> btw.
>
>>> Many thanks.
>>>
>>>
>>> Cao
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J <ha...@cloudera.com> wrote:
> My reply to your questions is inline.
>
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to user@hadoop.apache.org lists, which is where
>> the user community and questions lie.
>>
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>>
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>>
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>>
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>>
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
>
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>
>>> Is there an efficient way to do this?
>
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the  MapFile reader API to perform lookups of keys and
> read values only when you want them.
>
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A?  Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
>>> justified.
>
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
>
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
>
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
> btw.
>
>>> Many thanks.
>>>
>>>
>>> Cao
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J <ha...@cloudera.com> wrote:
> My reply to your questions is inline.
>
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to user@hadoop.apache.org lists, which is where
>> the user community and questions lie.
>>
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>>
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>>
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>>
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>>
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
>
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>
>>> Is there an efficient way to do this?
>
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the  MapFile reader API to perform lookups of keys and
> read values only when you want them.
>
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A?  Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
>>> justified.
>
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
>
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
>
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
> btw.
>
>>> Many thanks.
>>>
>>>
>>> Cao
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks a lot for your reply and great suggestions.

In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
different nodes.

The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.

I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.

Many thanks for the great reply.


William


On Wed, Feb 13, 2013 at 12:38 AM, Harsh J <ha...@cloudera.com> wrote:
> My reply to your questions is inline.
>
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to user@hadoop.apache.org lists, which is where
>> the user community and questions lie.
>>
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>>
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>>
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>>
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>>
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
>
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>
>>> Is there an efficient way to do this?
>
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the  MapFile reader API to perform lookups of keys and
> read values only when you want them.
>
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A?  Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
>>> justified.
>
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
>
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
>
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
> btw.
>
>>> Many thanks.
>>>
>>>
>>> Cao
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by Harsh J <ha...@cloudera.com>.
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

>> Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J



--
Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by Harsh J <ha...@cloudera.com>.
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

>> Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J



--
Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by Harsh J <ha...@cloudera.com>.
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

>> Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J



--
Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by William Kang <we...@gmail.com>.
Hi Harsh,
Thanks for moving the post to the correct list.


William

On Wed, Feb 13, 2013 at 12:29 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
>>
>> Is there an efficient way to do this?
>>
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.
>>
>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J

Re: Anyway to load certain Key/Value pair fast?

Posted by Harsh J <ha...@cloudera.com>.
My reply to your questions is inline.

On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <ha...@cloudera.com> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to user@hadoop.apache.org lists, which is where
> the user community and questions lie.
>
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
>
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <we...@gmail.com> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>>
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>>
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>>
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.

About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
of processing.

>> Is there an efficient way to do this?

Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the  MapFile reader API to perform lookups of keys and
read values only when you want them.

>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A?  Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
>> justified.

I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).

Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.

Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
btw.

>> Many thanks.
>>
>>
>> Cao
>
>
>
> --
> Harsh J



--
Harsh J