You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Yamini Joshi <ya...@gmail.com> on 2016/10/11 21:24:52 UTC

Bulk import

Hello

I am trying to import data from a bson file to a 3 node Accumulo cluster
using pyaccumulo. The bson file is 4G and has a lot of records, all to be
stored into one table. I tried a very naive approach and used pyaccumulo
batch writer to write to the table. After parsing some records, my master
became unresonsive and shut down with the tserver threads stuck on low
memory error. I am assuming that the records are created faster than what
the proxy/master can handle. Is there ant other way to go about it? I am
thinking of using bulk ingest but I am not sure how exactly.

Best regards,
Yamini Joshi

Re: Bulk import

Posted by Yamini Joshi <ya...@gmail.com>.

Alright. I'll keep that in mind. The next step for me will be to import
data from 90G Bson files. I think that'll be a good start for bulk import.

Best regards,
Yamini Joshi

On Tue, Oct 11, 2016 at 10:14 PM, Josh Elser <jo...@gmail.com> wrote:

> Even 10G is a rather small amount of data. Setting up a bulk loading
> framework is a bit more complicated than it appears at first glance. Take
> your pick of course, but I probably wouldn't consider bulk loading unless
> you were regularly processing 10-100x that amount of data :)
>
>
> yamini.1691@gmail.com wrote:
>
>> The bulk import seemed to be a good option since the bson file generated
>> about 10g data. The problem with my code was that I wasn't releasing memory
>> which eventually became the bottleneck.
>>
>> Sent from my iPhone
>>
>> On Oct 11, 2016, at 9:39 PM, Josh Elser<jo...@gmail.com>  wrote:
>>>
>>> For only 4GB of data, you don't need to do bulk ingest. That is serious
>>> overkill.
>>>
>>> I don't know why the master would have died/become unresponsive. It is
>>> minimally involved with the write-pipeline.
>>>
>>> Can you share your current accumulo-env.sh/accumulo-site.xml? Have you
>>> followed the Accumulo user manual to change the configuration to match the
>>> available resources you have on your 3 nodes where Accumulo is running?
>>>
>>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_pr
>>> e_splitting_new_tables
>>>
>>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_native_map
>>>
>>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_tr
>>> oubleshooting
>>>
>>> Yamini Joshi wrote:
>>>
>>>> Hello
>>>>
>>>> I am trying to import data from a bson file to a 3 node Accumulo cluster
>>>> using pyaccumulo. The bson file is 4G and has a lot of records, all to
>>>> be stored into one table. I tried a very naive approach and used
>>>> pyaccumulo batch writer to write to the table. After parsing some
>>>> records, my master became unresonsive and shut down with the tserver
>>>> threads stuck on low memory error. I am assuming that the records are
>>>> created faster than what the proxy/master can handle. Is there ant other
>>>> way to go about it? I am thinking of using bulk ingest but I am not sure
>>>> how exactly.
>>>>
>>>> Best regards,
>>>> Yamini Joshi
>>>>
>>>

Re: Bulk import

Posted by Josh Elser <jo...@gmail.com>.

Even 10G is a rather small amount of data. Setting up a bulk loading 
framework is a bit more complicated than it appears at first glance. 
Take your pick of course, but I probably wouldn't consider bulk loading 
unless you were regularly processing 10-100x that amount of data :)

yamini.1691@gmail.com wrote:
> The bulk import seemed to be a good option since the bson file generated about 10g data. The problem with my code was that I wasn't releasing memory which eventually became the bottleneck.
>
> Sent from my iPhone
>
>> On Oct 11, 2016, at 9:39 PM, Josh Elser<jo...@gmail.com>  wrote:
>>
>> For only 4GB of data, you don't need to do bulk ingest. That is serious overkill.
>>
>> I don't know why the master would have died/become unresponsive. It is minimally involved with the write-pipeline.
>>
>> Can you share your current accumulo-env.sh/accumulo-site.xml? Have you followed the Accumulo user manual to change the configuration to match the available resources you have on your 3 nodes where Accumulo is running?
>>
>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_pre_splitting_new_tables
>>
>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_native_map
>>
>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_troubleshooting
>>
>> Yamini Joshi wrote:
>>> Hello
>>>
>>> I am trying to import data from a bson file to a 3 node Accumulo cluster
>>> using pyaccumulo. The bson file is 4G and has a lot of records, all to
>>> be stored into one table. I tried a very naive approach and used
>>> pyaccumulo batch writer to write to the table. After parsing some
>>> records, my master became unresonsive and shut down with the tserver
>>> threads stuck on low memory error. I am assuming that the records are
>>> created faster than what the proxy/master can handle. Is there ant other
>>> way to go about it? I am thinking of using bulk ingest but I am not sure
>>> how exactly.
>>>
>>> Best regards,
>>> Yamini Joshi

Re: Bulk import

Posted by ya...@gmail.com.

The bulk import seemed to be a good option since the bson file generated about 10g data. The problem with my code was that I wasn't releasing memory which eventually became the bottleneck.

Sent from my iPhone

> On Oct 11, 2016, at 9:39 PM, Josh Elser <jo...@gmail.com> wrote:
> 
> For only 4GB of data, you don't need to do bulk ingest. That is serious overkill.
> 
> I don't know why the master would have died/become unresponsive. It is minimally involved with the write-pipeline.
> 
> Can you share your current accumulo-env.sh/accumulo-site.xml? Have you followed the Accumulo user manual to change the configuration to match the available resources you have on your 3 nodes where Accumulo is running?
> 
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_pre_splitting_new_tables
> 
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_native_map
> 
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_troubleshooting
> 
> Yamini Joshi wrote:
>> Hello
>> 
>> I am trying to import data from a bson file to a 3 node Accumulo cluster
>> using pyaccumulo. The bson file is 4G and has a lot of records, all to
>> be stored into one table. I tried a very naive approach and used
>> pyaccumulo batch writer to write to the table. After parsing some
>> records, my master became unresonsive and shut down with the tserver
>> threads stuck on low memory error. I am assuming that the records are
>> created faster than what the proxy/master can handle. Is there ant other
>> way to go about it? I am thinking of using bulk ingest but I am not sure
>> how exactly.
>> 
>> Best regards,
>> Yamini Joshi

Re: Bulk import

Posted by Josh Elser <jo...@gmail.com>.

For only 4GB of data, you don't need to do bulk ingest. That is serious 
overkill.

I don't know why the master would have died/become unresponsive. It is 
minimally involved with the write-pipeline.

Can you share your current accumulo-env.sh/accumulo-site.xml? Have you 
followed the Accumulo user manual to change the configuration to match 
the available resources you have on your 3 nodes where Accumulo is running?

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_pre_splitting_new_tables

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_native_map

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_troubleshooting

Yamini Joshi wrote:
> Hello
>
> I am trying to import data from a bson file to a 3 node Accumulo cluster
> using pyaccumulo. The bson file is 4G and has a lot of records, all to
> be stored into one table. I tried a very naive approach and used
> pyaccumulo batch writer to write to the table. After parsing some
> records, my master became unresonsive and shut down with the tserver
> threads stuck on low memory error. I am assuming that the records are
> created faster than what the proxy/master can handle. Is there ant other
> way to go about it? I am thinking of using bulk ingest but I am not sure
> how exactly.
>
> Best regards,
> Yamini Joshi