You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Christopher Dorner <ch...@gmail.com> on 2011/10/01 13:19:33 UTC

question about writing to columns with lots of versions in map task

Hallo,

I am reading a File containing RDF triples in a Map-job. the RDF triples 
then are stored in a table, where columns can have lots of versions.
So i need to store many values for one rowKey in the same column.

I made the observation, that reading the file is very fast and thus some 
values are put into the table with the same timestamp and therefore 
overriding an existing value.

How can i avoid that? The timestamps are not necessary for later usage.

Could i simply use some sort of custom counter?

How would that work in fully distributed mode? I am working on 
pseudo-distributed-mode for testing purpose right now.

Thank You and Regards,
Christopher

Re: question about writing to columns with lots of versions in map task

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Regarding SPARQL and HBase, there is this thread which contains a LOT
of information:

http://search-hadoop.com/m/LLyyCSNDqm

And it refers to this jira:

https://issues.apache.org/jira/browse/HBASE-2433

Hopefully this can save you some time doing the model.

Regarding your particular problem, you will end up with fat rows and
using the timestamps as another dimension is very error-prone. I would
be more tempted to store each triple in its own row, but from the
discussion I linked it seems much more involved than that when want to
do SPARQL on that.

J-D

On Wed, Oct 5, 2011 at 2:54 AM, Christopher Dorner
<ch...@gmail.com> wrote:
> Thank you for your help. I am using different Schemas because i want to
> compare them later on to their performance of retrieving RDF SPARQL query
> results.
>
> I try to explain it a bit better. Below i give a simplified code how i end
> up overwriting.
>
> I want to store RDF triples (Subject Predicate Object).
> Each line in the input file is a triple S P O
> e.g.
> Person A knows Person B
> Person A knows Person C
> Person B knows Person X
> Person C knows Person B
> Person D knows Person B
> Person E knows Person B
>
> The schema where i discovered this behaviour looks like:
>
> Object is the rowkey
> Predicate is the columnQualifier
> Subject is the column value
>
> Many different subjects can have the same object value for the same
> predicate. So with this Schema, i can end up with potentially many column
> values for the same "rowKey->ColumnQualifier".
>
> In the Example above:
> rowkey e.g. "Person B",
> ColumnQualifer: "knows"
> Column Values: (Person A, Person C, Person D, Person E)
>
> I thought i can simply use the timestamps as a "third" dimension (if i
> simplify "the look" of HBase Tables as a sort of Excel-Sheet Layout) for the
> cells.
> It would make it very easy to retrieve all subjects for a given object and
> predicate.
>
>
> I end up with overwriting using this simplified Mapper code:
>
> void map(LongWritable offset, Text value, Context context){
>
>  triple = Parser.parse(value);
>  Put put = new Put(triple.object);
>  Put.add(family, triple.predicate, triple.subject);
>  context.write(tableName, put);
>
> }
>
> It seems that the Mapper runs very fast (which is good), but sometimes
> creates a few Puts with the same timestamp for the same rowkey/column. Then
> the one inserted last overwrites the one already in. So in my example, the
> input of "Person E" could overwrite "Person D" and kick "Person D" out of my
> result list, which is very bad.
>
> I could try to use a Reducer and generate a potentially very large value of
> concatenated Subjects instead of many small ones. But that isn't a very good
> option either.
>
>
> Christopher
>
>
>
> Am 04.10.2011 23:57, schrieb Jean-Daniel Cryans:
>>
>> Maybe try a different schema yeah (hard to help without knowing
>> exactly how you end up overwriting the same triples all the time tho).
>>
>> Setting timestamps yourself is usually bad yes.
>>
>> J-D
>>
>> On Tue, Oct 4, 2011 at 7:14 AM, Christopher Dorner
>> <ch...@gmail.com>  wrote:
>>>
>>> Why do you advise against setting timestamps by oneself? Is it generally
>>> not
>>> a good practice?
>>>
>>> If i do not want to insert anymore data later, then it shouldn't be a
>>> problem. Of course i probably will have trouble if i want to insert
>>> something later (e.g. from another file, then the byte offset could be
>>> exactly the same and again overwrite my data). I didn't think about that
>>> yet.
>>>
>>> The thing is, that i do not want to loose data while inserting and i need
>>> to
>>> insert all of them. Maybe i could consider some different schema.
>>>
>>> I will try it with a reduce step, but i am pretty sure i will again have
>>> some loss of data.
>>>
>>> Thank you,
>>>
>>> Christopher
>>>
>>>
>>> Am 03.10.2011 20:31, schrieb Jean-Daniel Cryans:
>>>>
>>>> I would advise against setting the timestamps yourself and instead
>>>> reduce in order to prune the versions you don't need to insert in
>>>> HBase.
>>>>
>>>> J-D
>>>>
>>>> On Sat, Oct 1, 2011 at 11:05 AM, Christopher Dorner
>>>> <ch...@gmail.com>    wrote:
>>>>>
>>>>> Hi again,
>>>>>
>>>>> i think i solved my issue.
>>>>>
>>>>> I simply use the byte offset of the row currently read by the Mapper as
>>>>> the
>>>>> timestamp for the Put. This is unique for my input file, which contains
>>>>> one
>>>>> triple for each row. So the timestamps are unique.
>>>>>
>>>>> Regards,
>>>>> Christopher
>>>>>
>>>>>
>>>>> Am 01.10.2011 13:19, schrieb Christopher Dorner:
>>>>>>
>>>>>> Hallo,
>>>>>>
>>>>>> I am reading a File containing RDF triples in a Map-job. the RDF
>>>>>> triples
>>>>>> then are stored in a table, where columns can have lots of versions.
>>>>>> So i need to store many values for one rowKey in the same column.
>>>>>>
>>>>>> I made the observation, that reading the file is very fast and thus
>>>>>> some
>>>>>> values are put into the table with the same timestamp and therefore
>>>>>> overriding an existing value.
>>>>>>
>>>>>> How can i avoid that? The timestamps are not necessary for later
>>>>>> usage.
>>>>>>
>>>>>> Could i simply use some sort of custom counter?
>>>>>>
>>>>>> How would that work in fully distributed mode? I am working on
>>>>>> pseudo-distributed-mode for testing purpose right now.
>>>>>>
>>>>>> Thank You and Regards,
>>>>>> Christopher
>>>>>
>>>>>
>>>
>>>
>
>

Re: question about writing to columns with lots of versions in map task

Posted by Christopher Dorner <ch...@gmail.com>.

Thank you for your help. I am using different Schemas because i want to 
compare them later on to their performance of retrieving RDF SPARQL 
query results.

I try to explain it a bit better. Below i give a simplified code how i 
end up overwriting.

I want to store RDF triples (Subject Predicate Object).
Each line in the input file is a triple S P O
e.g.
Person A knows Person B
Person A knows Person C
Person B knows Person X
Person C knows Person B
Person D knows Person B
Person E knows Person B

The schema where i discovered this behaviour looks like:

Object is the rowkey
Predicate is the columnQualifier
Subject is the column value

Many different subjects can have the same object value for the same 
predicate. So with this Schema, i can end up with potentially many 
column values for the same "rowKey->ColumnQualifier".

In the Example above:
rowkey e.g. "Person B",
ColumnQualifer: "knows"
Column Values: (Person A, Person C, Person D, Person E)

I thought i can simply use the timestamps as a "third" dimension (if i 
simplify "the look" of HBase Tables as a sort of Excel-Sheet Layout) for 
the cells.
It would make it very easy to retrieve all subjects for a given object 
and predicate.

I end up with overwriting using this simplified Mapper code:

void map(LongWritable offset, Text value, Context context){

   triple = Parser.parse(value);
   Put put = new Put(triple.object);
   Put.add(family, triple.predicate, triple.subject);
   context.write(tableName, put);

}

It seems that the Mapper runs very fast (which is good), but sometimes 
creates a few Puts with the same timestamp for the same rowkey/column. 
Then the one inserted last overwrites the one already in. So in my 
example, the input of "Person E" could overwrite "Person D" and kick 
"Person D" out of my result list, which is very bad.

I could try to use a Reducer and generate a potentially very large value 
of concatenated Subjects instead of many small ones. But that isn't a 
very good option either.

Christopher

Am 04.10.2011 23:57, schrieb Jean-Daniel Cryans:
> Maybe try a different schema yeah (hard to help without knowing
> exactly how you end up overwriting the same triples all the time tho).
>
> Setting timestamps yourself is usually bad yes.
>
> J-D
>
> On Tue, Oct 4, 2011 at 7:14 AM, Christopher Dorner
> <ch...@gmail.com>  wrote:
>> Why do you advise against setting timestamps by oneself? Is it generally not
>> a good practice?
>>
>> If i do not want to insert anymore data later, then it shouldn't be a
>> problem. Of course i probably will have trouble if i want to insert
>> something later (e.g. from another file, then the byte offset could be
>> exactly the same and again overwrite my data). I didn't think about that
>> yet.
>>
>> The thing is, that i do not want to loose data while inserting and i need to
>> insert all of them. Maybe i could consider some different schema.
>>
>> I will try it with a reduce step, but i am pretty sure i will again have
>> some loss of data.
>>
>> Thank you,
>>
>> Christopher
>>
>>
>> Am 03.10.2011 20:31, schrieb Jean-Daniel Cryans:
>>>
>>> I would advise against setting the timestamps yourself and instead
>>> reduce in order to prune the versions you don't need to insert in
>>> HBase.
>>>
>>> J-D
>>>
>>> On Sat, Oct 1, 2011 at 11:05 AM, Christopher Dorner
>>> <ch...@gmail.com>    wrote:
>>>>
>>>> Hi again,
>>>>
>>>> i think i solved my issue.
>>>>
>>>> I simply use the byte offset of the row currently read by the Mapper as
>>>> the
>>>> timestamp for the Put. This is unique for my input file, which contains
>>>> one
>>>> triple for each row. So the timestamps are unique.
>>>>
>>>> Regards,
>>>> Christopher
>>>>
>>>>
>>>> Am 01.10.2011 13:19, schrieb Christopher Dorner:
>>>>>
>>>>> Hallo,
>>>>>
>>>>> I am reading a File containing RDF triples in a Map-job. the RDF triples
>>>>> then are stored in a table, where columns can have lots of versions.
>>>>> So i need to store many values for one rowKey in the same column.
>>>>>
>>>>> I made the observation, that reading the file is very fast and thus some
>>>>> values are put into the table with the same timestamp and therefore
>>>>> overriding an existing value.
>>>>>
>>>>> How can i avoid that? The timestamps are not necessary for later usage.
>>>>>
>>>>> Could i simply use some sort of custom counter?
>>>>>
>>>>> How would that work in fully distributed mode? I am working on
>>>>> pseudo-distributed-mode for testing purpose right now.
>>>>>
>>>>> Thank You and Regards,
>>>>> Christopher
>>>>
>>>>
>>
>>

Re: question about writing to columns with lots of versions in map task

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Maybe try a different schema yeah (hard to help without knowing
exactly how you end up overwriting the same triples all the time tho).

Setting timestamps yourself is usually bad yes.

J-D

On Tue, Oct 4, 2011 at 7:14 AM, Christopher Dorner
<ch...@gmail.com> wrote:
> Why do you advise against setting timestamps by oneself? Is it generally not
> a good practice?
>
> If i do not want to insert anymore data later, then it shouldn't be a
> problem. Of course i probably will have trouble if i want to insert
> something later (e.g. from another file, then the byte offset could be
> exactly the same and again overwrite my data). I didn't think about that
> yet.
>
> The thing is, that i do not want to loose data while inserting and i need to
> insert all of them. Maybe i could consider some different schema.
>
> I will try it with a reduce step, but i am pretty sure i will again have
> some loss of data.
>
> Thank you,
>
> Christopher
>
>
> Am 03.10.2011 20:31, schrieb Jean-Daniel Cryans:
>>
>> I would advise against setting the timestamps yourself and instead
>> reduce in order to prune the versions you don't need to insert in
>> HBase.
>>
>> J-D
>>
>> On Sat, Oct 1, 2011 at 11:05 AM, Christopher Dorner
>> <ch...@gmail.com>  wrote:
>>>
>>> Hi again,
>>>
>>> i think i solved my issue.
>>>
>>> I simply use the byte offset of the row currently read by the Mapper as
>>> the
>>> timestamp for the Put. This is unique for my input file, which contains
>>> one
>>> triple for each row. So the timestamps are unique.
>>>
>>> Regards,
>>> Christopher
>>>
>>>
>>> Am 01.10.2011 13:19, schrieb Christopher Dorner:
>>>>
>>>> Hallo,
>>>>
>>>> I am reading a File containing RDF triples in a Map-job. the RDF triples
>>>> then are stored in a table, where columns can have lots of versions.
>>>> So i need to store many values for one rowKey in the same column.
>>>>
>>>> I made the observation, that reading the file is very fast and thus some
>>>> values are put into the table with the same timestamp and therefore
>>>> overriding an existing value.
>>>>
>>>> How can i avoid that? The timestamps are not necessary for later usage.
>>>>
>>>> Could i simply use some sort of custom counter?
>>>>
>>>> How would that work in fully distributed mode? I am working on
>>>> pseudo-distributed-mode for testing purpose right now.
>>>>
>>>> Thank You and Regards,
>>>> Christopher
>>>
>>>
>
>

Re: question about writing to columns with lots of versions in map task

Posted by Christopher Dorner <ch...@gmail.com>.

Why do you advise against setting timestamps by oneself? Is it generally 
not a good practice?

If i do not want to insert anymore data later, then it shouldn't be a 
problem. Of course i probably will have trouble if i want to insert 
something later (e.g. from another file, then the byte offset could be 
exactly the same and again overwrite my data). I didn't think about that 
yet.

The thing is, that i do not want to loose data while inserting and i 
need to insert all of them. Maybe i could consider some different schema.

I will try it with a reduce step, but i am pretty sure i will again have 
some loss of data.

Thank you,

Christopher

Am 03.10.2011 20:31, schrieb Jean-Daniel Cryans:
> I would advise against setting the timestamps yourself and instead
> reduce in order to prune the versions you don't need to insert in
> HBase.
>
> J-D
>
> On Sat, Oct 1, 2011 at 11:05 AM, Christopher Dorner
> <ch...@gmail.com>  wrote:
>> Hi again,
>>
>> i think i solved my issue.
>>
>> I simply use the byte offset of the row currently read by the Mapper as the
>> timestamp for the Put. This is unique for my input file, which contains one
>> triple for each row. So the timestamps are unique.
>>
>> Regards,
>> Christopher
>>
>>
>> Am 01.10.2011 13:19, schrieb Christopher Dorner:
>>>
>>> Hallo,
>>>
>>> I am reading a File containing RDF triples in a Map-job. the RDF triples
>>> then are stored in a table, where columns can have lots of versions.
>>> So i need to store many values for one rowKey in the same column.
>>>
>>> I made the observation, that reading the file is very fast and thus some
>>> values are put into the table with the same timestamp and therefore
>>> overriding an existing value.
>>>
>>> How can i avoid that? The timestamps are not necessary for later usage.
>>>
>>> Could i simply use some sort of custom counter?
>>>
>>> How would that work in fully distributed mode? I am working on
>>> pseudo-distributed-mode for testing purpose right now.
>>>
>>> Thank You and Regards,
>>> Christopher
>>
>>

Re: question about writing to columns with lots of versions in map task

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I would advise against setting the timestamps yourself and instead
reduce in order to prune the versions you don't need to insert in
HBase.

J-D

On Sat, Oct 1, 2011 at 11:05 AM, Christopher Dorner
<ch...@gmail.com> wrote:
> Hi again,
>
> i think i solved my issue.
>
> I simply use the byte offset of the row currently read by the Mapper as the
> timestamp for the Put. This is unique for my input file, which contains one
> triple for each row. So the timestamps are unique.
>
> Regards,
> Christopher
>
>
> Am 01.10.2011 13:19, schrieb Christopher Dorner:
>>
>> Hallo,
>>
>> I am reading a File containing RDF triples in a Map-job. the RDF triples
>> then are stored in a table, where columns can have lots of versions.
>> So i need to store many values for one rowKey in the same column.
>>
>> I made the observation, that reading the file is very fast and thus some
>> values are put into the table with the same timestamp and therefore
>> overriding an existing value.
>>
>> How can i avoid that? The timestamps are not necessary for later usage.
>>
>> Could i simply use some sort of custom counter?
>>
>> How would that work in fully distributed mode? I am working on
>> pseudo-distributed-mode for testing purpose right now.
>>
>> Thank You and Regards,
>> Christopher
>
>

Re: question about writing to columns with lots of versions in map task

Posted by Christopher Dorner <ch...@gmail.com>.

Hi again,

i think i solved my issue.

I simply use the byte offset of the row currently read by the Mapper as 
the timestamp for the Put. This is unique for my input file, which 
contains one triple for each row. So the timestamps are unique.

Regards,
Christopher


Am 01.10.2011 13:19, schrieb Christopher Dorner:
> Hallo,
>
> I am reading a File containing RDF triples in a Map-job. the RDF triples
> then are stored in a table, where columns can have lots of versions.
> So i need to store many values for one rowKey in the same column.
>
> I made the observation, that reading the file is very fast and thus some
> values are put into the table with the same timestamp and therefore
> overriding an existing value.
>
> How can i avoid that? The timestamps are not necessary for later usage.
>
> Could i simply use some sort of custom counter?
>
> How would that work in fully distributed mode? I am working on
> pseudo-distributed-mode for testing purpose right now.
>
> Thank You and Regards,
> Christopher