You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Mario Pastorelli <ma...@teralytics.ch> on 2016/10/28 13:37:13 UTC

Bulk ingestion of different locality groups at different times

Hi,

I have a question about using bulk ingestion for a rather special case.
Let's say that I have the locality groups A and B. The values of each
locality group are written to Accumulo in at different times, which means
that first we ingest all the cells of the group A and then of B. We use
Spark to ingest those records. Right now we write all the values with a
custom writer but we would like to create the rfiles directly with Spark.
In the case above, we would have two jobs creating the rfiles for the two
distinct locality groups. Is Accumulo able to import these files,
considering that they are two different locality groups, without triggering
a huge major compaction?  If not, what strategy would you suggest for the
above use case?

Thanks,
Mario

-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Re: Bulk ingestion of different locality groups at different times

Posted by Keith Turner <ke...@deenlo.com>.

On Fri, Oct 28, 2016 at 10:03 AM, Mario Pastorelli
<ma...@teralytics.ch> wrote:
> Thanks for the answers. About the huge major compaction, the question is not
> about when the major compaction will be but more about how big the major
> compaction of two bulked loaded files will be. The rfiles will be already
> sorted and they will contain two different locality groups and Accumulo
> stores locality groups  separately on disk. The compaction should not do
> much here, just reuse the created groups, right?

Accumulo stores multiple locality groups into a single file.
Compactions make a pass for each locality group.  The following is a
sketch of what compactions do.

inputIter = //an iterator over the files being compacted
outputRFile = //the tmp file compaction is writing to

for(localityGroup : localityGroups) {
    inputIter.seek(new Range(), localityGroup.getFamilies(), true)
    outputRFile.startLocalityGroup(localityGroup.getFamilies())

    //write intputIter to outputRFile
}

//write default locality group
inputIter.seek(new Range(), localityGroups.getAllFamilies(), false)
//read all families not in a configured LG
outputRFile.startDefaultLocalityGroup()

//write intputIter to outputRFile

>
> On Fri, Oct 28, 2016 at 3:54 PM, <dl...@comcast.net> wrote:
>>
>> >>> Is Accumulo able to import these files, considering that they are two
>> >>> different locality groups
>>
>>  Yes.
>>
>> >>> without triggering a huge major compaction?
>>
>> Depends on your table.compaction.major.ratio and table.file.max settings.
>>
>>
>> Sorry, not a real answer, but I think the answer is "it depends"
>>
>> ________________________________
>> From: "Mario Pastorelli" <ma...@teralytics.ch>
>> To: user@accumulo.apache.org
>> Sent: Friday, October 28, 2016 9:37:13 AM
>> Subject: Bulk ingestion of different locality groups at different times
>>
>>
>> Hi,
>>
>> I have a question about using bulk ingestion for a rather special case.
>> Let's say that I have the locality groups A and B. The values of each
>> locality group are written to Accumulo in at different times, which means
>> that first we ingest all the cells of the group A and then of B. We use
>> Spark to ingest those records. Right now we write all the values with a
>> custom writer but we would like to create the rfiles directly with Spark. In
>> the case above, we would have two jobs creating the rfiles for the two
>> distinct locality groups. Is Accumulo able to import these files,
>> considering that they are two different locality groups, without triggering
>> a huge major compaction?  If not, what strategy would you suggest for the
>> above use case?
>>
>> Thanks,
>> Mario
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> software engineer
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
>> de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>>
>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> software engineer
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the sole
> attention and use of the intended recipient. Please notify us at once if you
> think that it may not be intended for you and delete it immediately.

Re: Bulk ingestion of different locality groups at different times

Posted by Mario Pastorelli <ma...@teralytics.ch>.

Thanks for the answers. About the huge major compaction, the question is
not about when the major compaction will be but more about how big the
major compaction of two bulked loaded files will be. The rfiles will be
already sorted and they will contain two different locality groups and
Accumulo stores locality groups  separately on disk. The compaction should
not do much here, just reuse the created groups, right?

On Fri, Oct 28, 2016 at 3:54 PM, <dl...@comcast.net> wrote:

> >>> Is Accumulo able to import these files, considering that they are two
> different locality groups
>
>  Yes.
>
> >>> without triggering a huge major compaction?
>
> Depends on your table.compaction.major.ratio and table.file.max settings.
>
>
> Sorry, not a real answer, but I think the answer is "it depends"
>
> ------------------------------
> *From: *"Mario Pastorelli" <ma...@teralytics.ch>
> *To: *user@accumulo.apache.org
> *Sent: *Friday, October 28, 2016 9:37:13 AM
> *Subject: *Bulk ingestion of different locality groups at different times
>
>
> Hi,
>
> I have a question about using bulk ingestion for a rather special case.
> Let's say that I have the locality groups A and B. The values of each
> locality group are written to Accumulo in at different times, which means
> that first we ingest all the cells of the group A and then of B. We use
> Spark to ingest those records. Right now we write all the values with a
> custom writer but we would like to create the rfiles directly with Spark.
> In the case above, we would have two jobs creating the rfiles for the two
> distinct locality groups. Is Accumulo able to import these files,
> considering that they are two different locality groups, without triggering
> a huge major compaction?  If not, what strategy would you suggest for the
> above use case?
>
> Thanks,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>
>


-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Re: Bulk ingestion of different locality groups at different times

Posted by dl...@comcast.net.

>>> Is Accumulo able to import these files, considering that they are two different locality groups 

Yes. 

>>> without triggering a huge major compaction? 

Depends on your table.compaction.major.ratio and table.file.max settings. 


Sorry, not a real answer, but I think the answer is "it depends" 

----- Original Message -----

From: "Mario Pastorelli" <ma...@teralytics.ch> 
To: user@accumulo.apache.org 
Sent: Friday, October 28, 2016 9:37:13 AM 
Subject: Bulk ingestion of different locality groups at different times 

Hi, 

I have a question about using bulk ingestion for a rather special case. Let's say that I have the locality groups A and B. The values of each locality group are written to Accumulo in at different times, which means that first we ingest all the cells of the group A and then of B. We use Spark to ingest those records. Right now we write all the values with a custom writer but we would like to create the rfiles directly with Spark. In the case above, we would have two jobs creating the rfiles for the two distinct locality groups. Is Accumulo able to import these files, considering that they are two different locality groups, without triggering a huge major compaction? If not, what strategy would you suggest for the above use case? 

Thanks, 
Mario 

-- 
Mario Pastorelli | TERA LYTICS 


software engineer 

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41794381682 
email: mario.pastorelli@teralytics.ch 
www.teralytics.net 


Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich 
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries 

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.

Re: Bulk ingestion of different locality groups at different times

Posted by Keith Turner <ke...@deenlo.com>.

Currently, locality groups can not be configured for
AccumuloFileoutputFormat, it can only write to the default locality
group.   However things can still work out nicely, because the default
locality group will track up to a 1000 column families.  It will use
these tracked column families at scan time to determine if the file
should be used. The following is an example of this.

 * Locality Group A has families 3, 4, 5
 * Locality Group B has families x, y, z
 * Spark job 1 writes to Rfile RF1 families y,z.  This data will end
up in default LG.  RF1 is imported to tablet T1.
 * Spark job 2 writes to Rfile RF2 families 3,5.  This data will end
up in default LG.  RF2 is imported.to tablet T1.
 * A scan comes in for tablet T1 for family 3.  The scan will examine
RF1 and RF2 family metadata and only use RF2.
 * If T1 compacts RF1 and RF2 into a single file RF3, that file will
have two locality groups.  When the compaction reads LG B it will only
read data from RF1.  When it reads LG A it will only read data from
RF2.

On Fri, Oct 28, 2016 at 9:37 AM, Mario Pastorelli
<ma...@teralytics.ch> wrote:
> Hi,
>
> I have a question about using bulk ingestion for a rather special case.
> Let's say that I have the locality groups A and B. The values of each
> locality group are written to Accumulo in at different times, which means
> that first we ingest all the cells of the group A and then of B. We use
> Spark to ingest those records. Right now we write all the values with a
> custom writer but we would like to create the rfiles directly with Spark. In
> the case above, we would have two jobs creating the rfiles for the two
> distinct locality groups. Is Accumulo able to import these files,
> considering that they are two different locality groups, without triggering
> a huge major compaction?  If not, what strategy would you suggest for the
> above use case?
>
> Thanks,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> software engineer
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the sole
> attention and use of the intended recipient. Please notify us at once if you
> think that it may not be intended for you and delete it immediately.