You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@geode.apache.org by Amit Pandey <am...@gmail.com> on 2017/03/03 17:13:47 UTC

fastest way to bulk insert in geode

Hey Guys,

Whats the fastest way to do bulk insert in a region?

I am using region.putAll , is there any alternative/faster API?

regards

Re: fastest way to bulk insert in geode

Posted by Lyndon Adams <ly...@gmail.com>.
Or you could hire me on to your team as I specialise in this area. i.e. high performance ingest and stream processing. To give you an idea the platform I am the chief architect of consumes in the north region of 2 billion events per day.


> On 6 Mar 2017, at 16:35, Michael Stolz <ms...@pivotal.io> wrote:
> 
> Of course if you're REALLY in need of speed you can write your own custom implementations of toData and fromData for the DataSerializable Interface. 
> 
> I haven't seen anyone need that much speed in a long time though.
> 
> 
> 
> --
> 
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771 <tel:(631)%20835-4771>
> 
> On Mar 3, 2017 11:16 PM, "Real Wes" <therealwes@outlook.com <ma...@outlook.com>> wrote:
> Amit,
> 
>  
> 
> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that your speed is probably just fine with putAll but if you just have to have NOS in your tank, you might consider - since you’re inside a function - do the putAll from the function into your region but change the region scope to distributed-no-ack.  See:  <>https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html <https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html>
>  
> 
> Wes
> 
>  
> 
> From: Amit Pandey [mailto:amit.pandey2103@gmail.com <ma...@gmail.com>] 
> Sent: Friday, March 3, 2017 12:26 PM
> To: user@geode.apache.org <ma...@geode.apache.org>
> Subject: Re: fastest way to bulk insert in geode
> 
>  
> 
> Hey John ,
> 
>  
> 
> Thanks I am planning to use Spring XD. But my current usecase is that I am aggregating and doing some computes in a Function and then I want to populate it with the values I have a map , is region.putAll the fastest?
> 
>  
> 
> Regards
> 
>  
> 
> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jblum@pivotal.io <ma...@pivotal.io>> wrote:
> 
> You might consider using the Snapshot service <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1] if you previously had data in a Region of another Cluster (for instance).
> 
>  
> 
> If the data is coming externally, then Spring XD <http://projects.spring.io/spring-xd/> [2] is a great tool for moving (streaming) data from a source <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3] to a sink <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].  It also allows you to perform all manners of transformations/conversions, trigger events, and so and so forth.
> 
>  
> 
> -j
> 
>  
> 
>  
> 
> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html>
> [2] http://projects.spring.io/spring-xd/ <http://projects.spring.io/spring-xd/>
> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources>
> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks>
>  
> 
>  
> 
> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
> 
> Hey Guys,
> 
>  
> 
> Whats the fastest way to do bulk insert in a region?
> 
>  
> 
> I am using region.putAll , is there any alternative/faster API?
> 
>  
> 
> regards
> 
> 
> 
> 
>  
> 
> --
> 
> -John
> 
> john.blum10101 (skype)
> 
>  
> 


Re: fastest way to bulk insert in geode

Posted by Amit Pandey <am...@gmail.com>.
Thanks this looks interesting

On Tue, Mar 7, 2017 at 2:28 AM, Luke Shannon <ls...@pivotal.io> wrote:

> I did something similar to Jake and Lyndon's suggestions. It was very fast
> and scaled well as members were added. It returned a summary object that
> the client app could use to reports on the results of the ingest (basically
> ensuring all files we ingested by some member). This needed the names of
> the files to be the key of the object for it to work.
>
> Here is the code:
>
> import java.io.BufferedReader;
> import java.io.File;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Properties;
>
> import com.gemstone.gemfire.LogWriter;
> import com.gemstone.gemfire.cache.Cache;
> import com.gemstone.gemfire.cache.CacheFactory;
> import com.gemstone.gemfire.cache.Declarable;
> import com.gemstone.gemfire.cache.Region;
> import com.gemstone.gemfire.cache.execute.FunctionAdapter;
> import com.gemstone.gemfire.cache.execute.FunctionContext;
> import com.gemstone.gemfire.cache.execute.RegionFunctionContext;
> import com.gemstone.gemfire.cache.partition.PartitionRegionHelper;
> import com.gemstone.gemfire.distributed.DistributedMember;
> import com.google.gson.Gson;
>
> /**
>  * Reads Bulk JSON files into Gemfire in parrallel across the cluster
>  * It deletes each file as its loaded
>  *
>  * @author lshannon
>  *
>  */
> public class DataLoadFunction extends FunctionAdapter implements
> Declarable {
> public static final String ID = DataLoadFunction.class.getSimpleName();
>
> private LogWriter logger;
> private DistributedMember member;
> private String backUpDirectory;
>
> private static final long serialVersionUID = -7759261808685094980L;
>
> @Override
> public void execute(FunctionContext context) {
> if (context.getArguments() == null) {
> System.out.println("Must Provide the location of the data folder when
> executing the function.");
> context.getResultSender().lastResult("Must Provide the location of the
> data folder when executing the function.");
> }
> Cache cache = CacheFactory.getAnyInstance();
> this.member = cache.getDistributedSystem().getDistributedMember();
> logger = cache.getDistributedSystem().getLogWriter();
> Object[] arg = (Object[]) context.getArguments();
> backUpDirectory = (String) arg[0];
> RegionFunctionContext rfc = (RegionFunctionContext) context;
> String loadingSummary = null;
> try {
> loadingSummary = loadSegments(rfc.getDataSet());
> context.getResultSender().lastResult(loadingSummary);
> }
> catch (Exception e) {
> context.getResultSender().lastResult(e);
> }
> }
>
> /**
> * This function passes through the folder of JSON files. If the key, which
> * is the name of the file, would be a primary on this node its loaded by
> * this member into the cluster. Otherwise its ignored add will be picked
> up by the correct member
> *
> * @return
> */
> @SuppressWarnings("unchecked")
> private String loadSegments(@SuppressWarnings("rawtypes") Region region) {
> logger.info("Started loading segments from: " + backUpDirectory);
> // summary of the loading process
> long startTime = 0, endTime = 0;
> int totalSegments = 0, loadedSegments = 0, skippedSegments = 0;
> startTime = System.currentTimeMillis();
> BufferedReader br = null;
> File segments = new File(backUpDirectory);
> logger.info("Loading From: " + backUpDirectory + " " +
> segments.list().length + " file to process");
> String[] files = segments.list();
> Gson gson = new Gson();
> for (int i = 0; i < files.length; i++) {
> if (files[i].endsWith(".json")) {
> try {
> //name of the file is the key
> String key = files[i].substring(0, files[i].indexOf("."));
> //this is an entry, but may not be one for this server
> totalSegments++;
> //this will return the member that would be the primary copy for this
> data, if its this
> //member running the function, we will do the put otherwise its skipped
> if (this.member.equals(PartitionRegionHelper.getPrimaryMemberForKey(region,
> key))) {
> //read the file
> br = new BufferedReader(new FileReader(backUpDirectory + files[i]));
> //get an array of Segment objects http://stackoverflow.com/
> questions/3763937/gson-and-deserializing-an-array-of-
> objects-with-arrays-in-it
> Segment[] segmentValue = gson.fromJson(br,Segment[].class);
> region.put(key, segmentValue);
> loadedSegments++;
> } else {
> skippedSegments++;
> }
> }
> catch (IOException e) {
> this.logger.error(e);
> }
> //clean up
> finally {
> if (br != null) {
> try {
> br.close();
> } catch (IOException e) {
> this.logger.error(e);
> }
> }
> }
> }
> }
> endTime = System.currentTimeMillis();
> //return the summary
> DataLoadFunction.LoadingSummary loadingSummary = new
> LoadingSummary(member.toString(), startTime, endTime, totalSegments,
> skippedSegments, loadedSegments);
> logger.info("Loading Complete: " + loadingSummary.toString());
> return loadingSummary.toString();
> }
>
> @Override
> public String getId() {
> return ID;
> }
>
> @Override
> public boolean hasResult() {
> return true;
> }
>
> @Override
> public boolean isHA() {
> return true;
> }
>
> @Override
> public boolean optimizeForWrite() {
> return true;
> }
>
> @Override
> public void init(Properties arg0) {
> }
>
> /**
> * Convenience class for storing the results of a segment load operation
> * @author lshannon
> *
> */
> class LoadingSummary {
> private String memberName;
> private long startTime;
> private long endTime;
> private int totalSegments;
> private int segmentsSkipped;
> private int segmentsLoaded;
>
> public LoadingSummary(String memberName, long startTime, long endTime, int
> totalSegments, int segmentsSkipped, int segmentsLoaded) {
> this.memberName = memberName;
> this.startTime = startTime;
> this.endTime = endTime;
> this.totalSegments = totalSegments;
> this.segmentsSkipped = segmentsSkipped;
> this.segmentsLoaded = segmentsLoaded;
> }
> public String getMemberName() {
> return memberName;
> }
>
> public long getStartTime() {
> return startTime;
> }
>
> public long getEndTime() {
> return endTime;
> }
>
> public int getTotalSegments() {
> return totalSegments;
> }
>
> public int getSegmentsSkipped() {
> return segmentsSkipped;
> }
>
> public int getSegmentsLoaded() {
> return segmentsLoaded;
> }
>
>
> @Override
> public String toString() {
> return "LoadingSummary [memberName=" + memberName + ", startTime="
> + startTime + ", endTime=" + endTime + ", totalSegments="
> + totalSegments + ", segmentsSkipped=" + segmentsSkipped
> + ", segmentsLoaded=" + segmentsLoaded + "]";
> }
> }
>
> }
>
> On Mon, Mar 6, 2017 at 3:42 PM, Amit Pandey <am...@gmail.com>
> wrote:
>
>> Hey Lyndon,
>>
>> Poor dev here, cant hire you. Not in that kind of position :)
>>
>> Hey Jake,
>>
>> Makes sense.  Will try your approach, with DataSerializable.
>>
>> Hi Charlie,
>>
>> Okay. I think yea, yes I understand GC needs to be tuned. Also currently
>> I do use Bulk sizes like I put 500 items and then clear the bulk data and
>> then fill up 500 again and retry. using DataSerializable with this approach
>> should be helpful I guess.
>>
>> Thanks everyone, I will be trying out things and update you guys
>>
>> On Tue, Mar 7, 2017 at 12:48 AM, Lyndon Adams <ly...@gmail.com>
>> wrote:
>>
>>> Oh my god Charlie you are taking my money making opportunities away from
>>> me. Basically he is right plus you got to add some black GC magic in to the
>>> mix to optimise pauses.
>>>
>>>
>>> On 6 Mar 2017, at 18:57, Charlie Black <cb...@pivotal.io> wrote:
>>>
>>> putAll() is the bulk operation for geode.   Plain and simple.
>>>
>>> The other techniques outlined in this thread are all efforts to go
>>> really fast by separating concerns at multiple levels.   Or taking
>>> advantage of the fact there are other system and CPUs that are in the
>>> physical architecture.
>>>
>>> Example: The GC comment - when creating the domain objects sometimes
>>> that causes GC pressure which reduces throughput.   I typically look at
>>> bulk sizes to reduce that concern.
>>>
>>> Consider all suggestions then profile your options and choose the right
>>> pattern for your app.
>>>
>>> Regards,
>>> Charlie
>>>
>>> ---
>>> Charlie Black
>>> 858.480.9722 <(858)%20480-9722> | cblack@pivotal.io
>>>
>>> On Mar 6, 2017, at 10:42 AM, Amit Pandey <am...@gmail.com>
>>> wrote:
>>>
>>> Hey Jake,
>>>
>>> Thanks. I am a bot confused so a put should be faster than putAll ?
>>>
>>> John,
>>>
>>> I need to setup all data so that they can be queried.  So I don't think
>>> CacheLoader works for me. Those data are the results of a very large and
>>> expensive computations and doing them dynamically will be costly.
>>>
>>> We have a time window to setup the system because after that some other
>>> jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and
>>> its great.  But I am just trying to optimize if it can be made faster.
>>>
>>> Regards
>>>
>>> On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jb...@pivotal.io> wrote:
>>>
>>>> Amit-
>>>>
>>>> Note, a CacheLoader does not necessarily imply "loading data from a
>>>> database"; it can load data from any [external] data source and does so on
>>>> demand (i.e. lazily, on a cache miss).  However, as Mike points out, this
>>>> might not work for your Use Case in situations where you are querying, for
>>>> instance.
>>>>
>>>> I guess the real question here is, what is the requirement to pre-load
>>>> this data quickly?  What is the driving requirement here?
>>>>
>>>> For instance, is the need to be able to bring another system online
>>>> quickly in case of "failover".  If so, perhaps an architectural change is
>>>> more appropriate such as an Active/Passive arch (using WAN).
>>>>
>>>> -j
>>>>
>>>>
>>>>
>>>> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <am...@gmail.com>
>>>> wrote:
>>>>
>>>>> We might need that actually...problem is we cant use dataloader
>>>>> because we are not loading from database. So we have to use putall. Its
>>>>> taking 2 seconds for over 30000 data. If implenting it will bring it down
>>>>> that will be helpful.
>>>>> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>>>>>
>>>>>> Of course if you're REALLY in need of speed you can write your own
>>>>>> custom implementations of toData and fromData for the DataSerializable
>>>>>> Interface.
>>>>>>
>>>>>> I haven't seen anyone need that much speed in a long time though.
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Mike Stolz
>>>>>> Principal Engineer - Gemfire Product Manager
>>>>>> Mobile: 631-835-4771 <(631)%20835-4771>
>>>>>>
>>>>>> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>>>>>>
>>>>>>> Amit,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find
>>>>>>> that your speed is probably just fine with putAll but if you just have to
>>>>>>> have NOS in your tank, you might consider - since you’re inside a function
>>>>>>> - do the putAll from the function into your region but change the region
>>>>>>> scope to distributed-no-ack.  See: https://geode.apache.org/docs/
>>>>>>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Wes
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>>>>>>> *Sent:* Friday, March 3, 2017 12:26 PM
>>>>>>> *To:* user@geode.apache.org
>>>>>>> *Subject:* Re: fastest way to bulk insert in geode
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hey John ,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks I am planning to use Spring XD. But my current usecase is
>>>>>>> that I am aggregating and doing some computes in a Function and then I want
>>>>>>> to populate it with the values I have a map , is region.putAll the fastest?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>>>>>>
>>>>>>> You might consider using the Snapshot service
>>>>>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>>>>>> if you previously had data in a Region of another Cluster (for instance).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If the data is coming externally, then *Spring XD
>>>>>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for
>>>>>>> moving (streaming) data from a source
>>>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>>>>>> to a sink
>>>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>>>>>> It also allows you to perform all manners of transformations/conversions,
>>>>>>> trigger events, and so and so forth.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -j
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>>>>>> apshots/chapter_overview.html
>>>>>>>
>>>>>>> [2] http://projects.spring.io/spring-xd/
>>>>>>>
>>>>>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>>>> ence/html/#sources
>>>>>>>
>>>>>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>>>> ence/html/#sinks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <
>>>>>>> amit.pandey2103@gmail.com> wrote:
>>>>>>>
>>>>>>> Hey Guys,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Whats the fastest way to do bulk insert in a region?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am using region.putAll , is there any alternative/faster API?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> -John
>>>>>>>
>>>>>>> john.blum10101 (skype)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> -John
>>>> john.blum10101 (skype)
>>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Luke Shannon | Platform Engineering | Pivotal
> -------------------------------------------------------------------------
>
> Mobile:416-571-9495
> Join the Toronto Pivotal Usergroup: http://www.meetup.
> com/Toronto-Pivotal-User-Group/
>

Re: fastest way to bulk insert in geode

Posted by Luke Shannon <ls...@pivotal.io>.
I did something similar to Jake and Lyndon's suggestions. It was very fast
and scaled well as members were added. It returned a summary object that
the client app could use to reports on the results of the ingest (basically
ensuring all files we ingested by some member). This needed the names of
the files to be the key of the object for it to work.

Here is the code:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Properties;

import com.gemstone.gemfire.LogWriter;
import com.gemstone.gemfire.cache.Cache;
import com.gemstone.gemfire.cache.CacheFactory;
import com.gemstone.gemfire.cache.Declarable;
import com.gemstone.gemfire.cache.Region;
import com.gemstone.gemfire.cache.execute.FunctionAdapter;
import com.gemstone.gemfire.cache.execute.FunctionContext;
import com.gemstone.gemfire.cache.execute.RegionFunctionContext;
import com.gemstone.gemfire.cache.partition.PartitionRegionHelper;
import com.gemstone.gemfire.distributed.DistributedMember;
import com.google.gson.Gson;

/**
 * Reads Bulk JSON files into Gemfire in parrallel across the cluster
 * It deletes each file as its loaded
 *
 * @author lshannon
 *
 */
public class DataLoadFunction extends FunctionAdapter implements Declarable
{
public static final String ID = DataLoadFunction.class.getSimpleName();

private LogWriter logger;
private DistributedMember member;
private String backUpDirectory;

private static final long serialVersionUID = -7759261808685094980L;

@Override
public void execute(FunctionContext context) {
if (context.getArguments() == null) {
System.out.println("Must Provide the location of the data folder when
executing the function.");
context.getResultSender().lastResult("Must Provide the location of the data
folder when executing the function.");
}
Cache cache = CacheFactory.getAnyInstance();
this.member = cache.getDistributedSystem().getDistributedMember();
logger = cache.getDistributedSystem().getLogWriter();
Object[] arg = (Object[]) context.getArguments();
backUpDirectory = (String) arg[0];
RegionFunctionContext rfc = (RegionFunctionContext) context;
String loadingSummary = null;
try {
loadingSummary = loadSegments(rfc.getDataSet());
context.getResultSender().lastResult(loadingSummary);
}
catch (Exception e) {
context.getResultSender().lastResult(e);
}
}

/**
* This function passes through the folder of JSON files. If the key, which
* is the name of the file, would be a primary on this node its loaded by
* this member into the cluster. Otherwise its ignored add will be picked up
by the correct member
*
* @return
*/
@SuppressWarnings("unchecked")
private String loadSegments(@SuppressWarnings("rawtypes") Region region) {
logger.info("Started loading segments from: " + backUpDirectory);
// summary of the loading process
long startTime = 0, endTime = 0;
int totalSegments = 0, loadedSegments = 0, skippedSegments = 0;
startTime = System.currentTimeMillis();
BufferedReader br = null;
File segments = new File(backUpDirectory);
logger.info("Loading From: " + backUpDirectory + " " +
segments.list().length + " file to process");
String[] files = segments.list();
Gson gson = new Gson();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".json")) {
try {
//name of the file is the key
String key = files[i].substring(0, files[i].indexOf("."));
//this is an entry, but may not be one for this server
totalSegments++;
//this will return the member that would be the primary copy for this data,
if its this
//member running the function, we will do the put otherwise its skipped
if (this.member.equals(PartitionRegionHelper.getPrimaryMemberForKey(region,
key))) {
//read the file
br = new BufferedReader(new FileReader(backUpDirectory + files[i]));
//get an array of Segment objects
http://stackoverflow.com/questions/3763937/gson-and-deserializing-an-array-of-objects-with-arrays-in-it
Segment[] segmentValue = gson.fromJson(br,Segment[].class);
region.put(key, segmentValue);
loadedSegments++;
} else {
skippedSegments++;
}
}
catch (IOException e) {
this.logger.error(e);
}
//clean up
finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
this.logger.error(e);
}
}
}
}
}
endTime = System.currentTimeMillis();
//return the summary
DataLoadFunction.LoadingSummary loadingSummary = new
LoadingSummary(member.toString(), startTime, endTime, totalSegments,
skippedSegments, loadedSegments);
logger.info("Loading Complete: " + loadingSummary.toString());
return loadingSummary.toString();
}

@Override
public String getId() {
return ID;
}

@Override
public boolean hasResult() {
return true;
}

@Override
public boolean isHA() {
return true;
}

@Override
public boolean optimizeForWrite() {
return true;
}

@Override
public void init(Properties arg0) {
}

/**
* Convenience class for storing the results of a segment load operation
* @author lshannon
*
*/
class LoadingSummary {
private String memberName;
private long startTime;
private long endTime;
private int totalSegments;
private int segmentsSkipped;
private int segmentsLoaded;

public LoadingSummary(String memberName, long startTime, long endTime, int
totalSegments, int segmentsSkipped, int segmentsLoaded) {
this.memberName = memberName;
this.startTime = startTime;
this.endTime = endTime;
this.totalSegments = totalSegments;
this.segmentsSkipped = segmentsSkipped;
this.segmentsLoaded = segmentsLoaded;
}
public String getMemberName() {
return memberName;
}

public long getStartTime() {
return startTime;
}

public long getEndTime() {
return endTime;
}

public int getTotalSegments() {
return totalSegments;
}

public int getSegmentsSkipped() {
return segmentsSkipped;
}

public int getSegmentsLoaded() {
return segmentsLoaded;
}


@Override
public String toString() {
return "LoadingSummary [memberName=" + memberName + ", startTime="
+ startTime + ", endTime=" + endTime + ", totalSegments="
+ totalSegments + ", segmentsSkipped=" + segmentsSkipped
+ ", segmentsLoaded=" + segmentsLoaded + "]";
}
}

}

On Mon, Mar 6, 2017 at 3:42 PM, Amit Pandey <am...@gmail.com>
wrote:

> Hey Lyndon,
>
> Poor dev here, cant hire you. Not in that kind of position :)
>
> Hey Jake,
>
> Makes sense.  Will try your approach, with DataSerializable.
>
> Hi Charlie,
>
> Okay. I think yea, yes I understand GC needs to be tuned. Also currently I
> do use Bulk sizes like I put 500 items and then clear the bulk data and
> then fill up 500 again and retry. using DataSerializable with this approach
> should be helpful I guess.
>
> Thanks everyone, I will be trying out things and update you guys
>
> On Tue, Mar 7, 2017 at 12:48 AM, Lyndon Adams <ly...@gmail.com>
> wrote:
>
>> Oh my god Charlie you are taking my money making opportunities away from
>> me. Basically he is right plus you got to add some black GC magic in to the
>> mix to optimise pauses.
>>
>>
>> On 6 Mar 2017, at 18:57, Charlie Black <cb...@pivotal.io> wrote:
>>
>> putAll() is the bulk operation for geode.   Plain and simple.
>>
>> The other techniques outlined in this thread are all efforts to go really
>> fast by separating concerns at multiple levels.   Or taking advantage of
>> the fact there are other system and CPUs that are in the physical
>> architecture.
>>
>> Example: The GC comment - when creating the domain objects sometimes that
>> causes GC pressure which reduces throughput.   I typically look at bulk
>> sizes to reduce that concern.
>>
>> Consider all suggestions then profile your options and choose the right
>> pattern for your app.
>>
>> Regards,
>> Charlie
>>
>> ---
>> Charlie Black
>> 858.480.9722 <(858)%20480-9722> | cblack@pivotal.io
>>
>> On Mar 6, 2017, at 10:42 AM, Amit Pandey <am...@gmail.com>
>> wrote:
>>
>> Hey Jake,
>>
>> Thanks. I am a bot confused so a put should be faster than putAll ?
>>
>> John,
>>
>> I need to setup all data so that they can be queried.  So I don't think
>> CacheLoader works for me. Those data are the results of a very large and
>> expensive computations and doing them dynamically will be costly.
>>
>> We have a time window to setup the system because after that some other
>> jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and
>> its great.  But I am just trying to optimize if it can be made faster.
>>
>> Regards
>>
>> On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jb...@pivotal.io> wrote:
>>
>>> Amit-
>>>
>>> Note, a CacheLoader does not necessarily imply "loading data from a
>>> database"; it can load data from any [external] data source and does so on
>>> demand (i.e. lazily, on a cache miss).  However, as Mike points out, this
>>> might not work for your Use Case in situations where you are querying, for
>>> instance.
>>>
>>> I guess the real question here is, what is the requirement to pre-load
>>> this data quickly?  What is the driving requirement here?
>>>
>>> For instance, is the need to be able to bring another system online
>>> quickly in case of "failover".  If so, perhaps an architectural change is
>>> more appropriate such as an Active/Passive arch (using WAN).
>>>
>>> -j
>>>
>>>
>>>
>>> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <am...@gmail.com>
>>> wrote:
>>>
>>>> We might need that actually...problem is we cant use dataloader because
>>>> we are not loading from database. So we have to use putall. Its taking 2
>>>> seconds for over 30000 data. If implenting it will bring it down that will
>>>> be helpful.
>>>> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>>>>
>>>>> Of course if you're REALLY in need of speed you can write your own
>>>>> custom implementations of toData and fromData for the DataSerializable
>>>>> Interface.
>>>>>
>>>>> I haven't seen anyone need that much speed in a long time though.
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Mike Stolz
>>>>> Principal Engineer - Gemfire Product Manager
>>>>> Mobile: 631-835-4771 <(631)%20835-4771>
>>>>>
>>>>> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>>>>>
>>>>>> Amit,
>>>>>>
>>>>>>
>>>>>>
>>>>>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find
>>>>>> that your speed is probably just fine with putAll but if you just have to
>>>>>> have NOS in your tank, you might consider - since you’re inside a function
>>>>>> - do the putAll from the function into your region but change the region
>>>>>> scope to distributed-no-ack.  See: https://geode.apache.org/docs/
>>>>>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> Wes
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>>>>>> *Sent:* Friday, March 3, 2017 12:26 PM
>>>>>> *To:* user@geode.apache.org
>>>>>> *Subject:* Re: fastest way to bulk insert in geode
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hey John ,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks I am planning to use Spring XD. But my current usecase is that
>>>>>> I am aggregating and doing some computes in a Function and then I want to
>>>>>> populate it with the values I have a map , is region.putAll the fastest?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>>>>>
>>>>>> You might consider using the Snapshot service
>>>>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>>>>> if you previously had data in a Region of another Cluster (for instance).
>>>>>>
>>>>>>
>>>>>>
>>>>>> If the data is coming externally, then *Spring XD
>>>>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for
>>>>>> moving (streaming) data from a source
>>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>>>>> to a sink
>>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>>>>> It also allows you to perform all manners of transformations/conversions,
>>>>>> trigger events, and so and so forth.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -j
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>>>>> apshots/chapter_overview.html
>>>>>>
>>>>>> [2] http://projects.spring.io/spring-xd/
>>>>>>
>>>>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>>> ence/html/#sources
>>>>>>
>>>>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>>> ence/html/#sinks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <
>>>>>> amit.pandey2103@gmail.com> wrote:
>>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Whats the fastest way to do bulk insert in a region?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am using region.putAll , is there any alternative/faster API?
>>>>>>
>>>>>>
>>>>>>
>>>>>> regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> john.blum10101 (skype)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> -John
>>> john.blum10101 (skype)
>>>
>>
>>
>>
>>
>


-- 
Luke Shannon | Platform Engineering | Pivotal
-------------------------------------------------------------------------

Mobile:416-571-9495
Join the Toronto Pivotal Usergroup:
http://www.meetup.com/Toronto-Pivotal-User-Group/

Re: fastest way to bulk insert in geode

Posted by Amit Pandey <am...@gmail.com>.
Hey Lyndon,

Poor dev here, cant hire you. Not in that kind of position :)

Hey Jake,

Makes sense.  Will try your approach, with DataSerializable.

Hi Charlie,

Okay. I think yea, yes I understand GC needs to be tuned. Also currently I
do use Bulk sizes like I put 500 items and then clear the bulk data and
then fill up 500 again and retry. using DataSerializable with this approach
should be helpful I guess.

Thanks everyone, I will be trying out things and update you guys

On Tue, Mar 7, 2017 at 12:48 AM, Lyndon Adams <ly...@gmail.com>
wrote:

> Oh my god Charlie you are taking my money making opportunities away from
> me. Basically he is right plus you got to add some black GC magic in to the
> mix to optimise pauses.
>
>
> On 6 Mar 2017, at 18:57, Charlie Black <cb...@pivotal.io> wrote:
>
> putAll() is the bulk operation for geode.   Plain and simple.
>
> The other techniques outlined in this thread are all efforts to go really
> fast by separating concerns at multiple levels.   Or taking advantage of
> the fact there are other system and CPUs that are in the physical
> architecture.
>
> Example: The GC comment - when creating the domain objects sometimes that
> causes GC pressure which reduces throughput.   I typically look at bulk
> sizes to reduce that concern.
>
> Consider all suggestions then profile your options and choose the right
> pattern for your app.
>
> Regards,
> Charlie
>
> ---
> Charlie Black
> 858.480.9722 | cblack@pivotal.io
>
> On Mar 6, 2017, at 10:42 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
> Hey Jake,
>
> Thanks. I am a bot confused so a put should be faster than putAll ?
>
> John,
>
> I need to setup all data so that they can be queried.  So I don't think
> CacheLoader works for me. Those data are the results of a very large and
> expensive computations and doing them dynamically will be costly.
>
> We have a time window to setup the system because after that some other
> jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and
> its great.  But I am just trying to optimize if it can be made faster.
>
> Regards
>
> On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jb...@pivotal.io> wrote:
>
>> Amit-
>>
>> Note, a CacheLoader does not necessarily imply "loading data from a
>> database"; it can load data from any [external] data source and does so on
>> demand (i.e. lazily, on a cache miss).  However, as Mike points out, this
>> might not work for your Use Case in situations where you are querying, for
>> instance.
>>
>> I guess the real question here is, what is the requirement to pre-load
>> this data quickly?  What is the driving requirement here?
>>
>> For instance, is the need to be able to bring another system online
>> quickly in case of "failover".  If so, perhaps an architectural change is
>> more appropriate such as an Active/Passive arch (using WAN).
>>
>> -j
>>
>>
>>
>> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <am...@gmail.com>
>> wrote:
>>
>>> We might need that actually...problem is we cant use dataloader because
>>> we are not loading from database. So we have to use putall. Its taking 2
>>> seconds for over 30000 data. If implenting it will bring it down that will
>>> be helpful.
>>> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>>>
>>>> Of course if you're REALLY in need of speed you can write your own
>>>> custom implementations of toData and fromData for the DataSerializable
>>>> Interface.
>>>>
>>>> I haven't seen anyone need that much speed in a long time though.
>>>>
>>>>
>>>> --
>>>>
>>>> Mike Stolz
>>>> Principal Engineer - Gemfire Product Manager
>>>> Mobile: 631-835-4771 <(631)%20835-4771>
>>>>
>>>> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>>>>
>>>>> Amit,
>>>>>
>>>>>
>>>>>
>>>>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find
>>>>> that your speed is probably just fine with putAll but if you just have to
>>>>> have NOS in your tank, you might consider - since you’re inside a function
>>>>> - do the putAll from the function into your region but change the region
>>>>> scope to distributed-no-ack.  See: https://geode.apache.org/docs/
>>>>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>>>>
>>>>>
>>>>>
>>>>> Wes
>>>>>
>>>>>
>>>>>
>>>>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>>>>> *Sent:* Friday, March 3, 2017 12:26 PM
>>>>> *To:* user@geode.apache.org
>>>>> *Subject:* Re: fastest way to bulk insert in geode
>>>>>
>>>>>
>>>>>
>>>>> Hey John ,
>>>>>
>>>>>
>>>>>
>>>>> Thanks I am planning to use Spring XD. But my current usecase is that
>>>>> I am aggregating and doing some computes in a Function and then I want to
>>>>> populate it with the values I have a map , is region.putAll the fastest?
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>>>>
>>>>> You might consider using the Snapshot service
>>>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>>>> if you previously had data in a Region of another Cluster (for instance).
>>>>>
>>>>>
>>>>>
>>>>> If the data is coming externally, then *Spring XD
>>>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for
>>>>> moving (streaming) data from a source
>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>>>> to a sink
>>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>>>> It also allows you to perform all manners of transformations/conversions,
>>>>> trigger events, and so and so forth.
>>>>>
>>>>>
>>>>>
>>>>> -j
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>>>> apshots/chapter_overview.html
>>>>>
>>>>> [2] http://projects.spring.io/spring-xd/
>>>>>
>>>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>> ence/html/#sources
>>>>>
>>>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>>> ence/html/#sinks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hey Guys,
>>>>>
>>>>>
>>>>>
>>>>> Whats the fastest way to do bulk insert in a region?
>>>>>
>>>>>
>>>>>
>>>>> I am using region.putAll , is there any alternative/faster API?
>>>>>
>>>>>
>>>>>
>>>>> regards
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -John
>>>>>
>>>>> john.blum10101 (skype)
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>> --
>> -John
>> john.blum10101 (skype)
>>
>
>
>
>

Re: fastest way to bulk insert in geode

Posted by Lyndon Adams <ly...@gmail.com>.
Oh my god Charlie you are taking my money making opportunities away from me. Basically he is right plus you got to add some black GC magic in to the mix to optimise pauses.


> On 6 Mar 2017, at 18:57, Charlie Black <cb...@pivotal.io> wrote:
> 
> putAll() is the bulk operation for geode.   Plain and simple.
> 
> The other techniques outlined in this thread are all efforts to go really fast by separating concerns at multiple levels.   Or taking advantage of the fact there are other system and CPUs that are in the physical architecture.  
> 
> Example: The GC comment - when creating the domain objects sometimes that causes GC pressure which reduces throughput.   I typically look at bulk sizes to reduce that concern.   
> 
> Consider all suggestions then profile your options and choose the right pattern for your app.  
> 
> Regards,
> Charlie
> 
> ---
> Charlie Black
> 858.480.9722 | cblack@pivotal.io <ma...@pivotal.io>
> 
>> On Mar 6, 2017, at 10:42 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hey Jake,
>> 
>> Thanks. I am a bot confused so a put should be faster than putAll ?
>> 
>> John,
>> 
>> I need to setup all data so that they can be queried.  So I don't think CacheLoader works for me. Those data are the results of a very large and expensive computations and doing them dynamically will be costly.
>> 
>> We have a time window to setup the system because after that some other jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and its great.  But I am just trying to optimize if it can be made faster.
>> 
>> Regards
>> 
>> On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jblum@pivotal.io <ma...@pivotal.io>> wrote:
>> Amit-
>> 
>> Note, a CacheLoader does not necessarily imply "loading data from a database"; it can load data from any [external] data source and does so on demand (i.e. lazily, on a cache miss).  However, as Mike points out, this might not work for your Use Case in situations where you are querying, for instance.
>> 
>> I guess the real question here is, what is the requirement to pre-load this data quickly?  What is the driving requirement here?
>> 
>> For instance, is the need to be able to bring another system online quickly in case of "failover".  If so, perhaps an architectural change is more appropriate such as an Active/Passive arch (using WAN).
>> 
>> -j
>> 
>> 
>> 
>> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
>> We might need that actually...problem is we cant use dataloader because we are not loading from database. So we have to use putall. Its taking 2 seconds for over 30000 data. If implenting it will bring it down that will be helpful.
>> 
>> On 06-Mar-2017 10:05 pm, "Michael Stolz" <mstolz@pivotal.io <ma...@pivotal.io>> wrote:
>> Of course if you're REALLY in need of speed you can write your own custom implementations of toData and fromData for the DataSerializable Interface. 
>> 
>> I haven't seen anyone need that much speed in a long time though.
>> 
>> 
>> 
>> --
>> 
>> Mike Stolz
>> Principal Engineer - Gemfire Product Manager
>> Mobile: 631-835-4771 <tel:(631)%20835-4771>
>> 
>> On Mar 3, 2017 11:16 PM, "Real Wes" <therealwes@outlook.com <ma...@outlook.com>> wrote:
>> Amit,
>> 
>>  
>> 
>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that your speed is probably just fine with putAll but if you just have to have NOS in your tank, you might consider - since you’re inside a function - do the putAll from the function into your region but change the region scope to distributed-no-ack.  See:  <>https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html <https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html>
>>  
>> 
>> Wes
>> 
>>  
>> 
>> From: Amit Pandey [mailto:amit.pandey2103@gmail.com <ma...@gmail.com>] 
>> Sent: Friday, March 3, 2017 12:26 PM
>> To: user@geode.apache.org <ma...@geode.apache.org>
>> Subject: Re: fastest way to bulk insert in geode
>> 
>>  
>> 
>> Hey John ,
>> 
>>  
>> 
>> Thanks I am planning to use Spring XD. But my current usecase is that I am aggregating and doing some computes in a Function and then I want to populate it with the values I have a map , is region.putAll the fastest?
>> 
>>  
>> 
>> Regards
>> 
>>  
>> 
>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jblum@pivotal.io <ma...@pivotal.io>> wrote:
>> 
>> You might consider using the Snapshot service <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1] if you previously had data in a Region of another Cluster (for instance).
>> 
>>  
>> 
>> If the data is coming externally, then Spring XD <http://projects.spring.io/spring-xd/> [2] is a great tool for moving (streaming) data from a source <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3] to a sink <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].  It also allows you to perform all manners of transformations/conversions, trigger events, and so and so forth.
>> 
>>  
>> 
>> -j
>> 
>>  
>> 
>>  
>> 
>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html>
>> [2] http://projects.spring.io/spring-xd/ <http://projects.spring.io/spring-xd/>
>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources>
>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks>
>>  
>> 
>>  
>> 
>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hey Guys,
>> 
>>  
>> 
>> Whats the fastest way to do bulk insert in a region?
>> 
>>  
>> 
>> I am using region.putAll , is there any alternative/faster API?
>> 
>>  
>> 
>> regards
>> 
>> 
>> 
>> 
>>  
>> 
>> --
>> 
>> -John
>> 
>> john.blum10101 (skype)
>> 
>>  
>> 
>> 
>> 
>> 
>> -- 
>> -John
>> john.blum10101 (skype)
>> 
> 


Re: fastest way to bulk insert in geode

Posted by Charlie Black <cb...@pivotal.io>.
putAll() is the bulk operation for geode.   Plain and simple.

The other techniques outlined in this thread are all efforts to go really fast by separating concerns at multiple levels.   Or taking advantage of the fact there are other system and CPUs that are in the physical architecture.  

Example: The GC comment - when creating the domain objects sometimes that causes GC pressure which reduces throughput.   I typically look at bulk sizes to reduce that concern.   

Consider all suggestions then profile your options and choose the right pattern for your app.  

Regards,
Charlie

---
Charlie Black
858.480.9722 | cblack@pivotal.io

> On Mar 6, 2017, at 10:42 AM, Amit Pandey <am...@gmail.com> wrote:
> 
> Hey Jake,
> 
> Thanks. I am a bot confused so a put should be faster than putAll ?
> 
> John,
> 
> I need to setup all data so that they can be queried.  So I don't think CacheLoader works for me. Those data are the results of a very large and expensive computations and doing them dynamically will be costly.
> 
> We have a time window to setup the system because after that some other jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and its great.  But I am just trying to optimize if it can be made faster.
> 
> Regards
> 
> On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jblum@pivotal.io <ma...@pivotal.io>> wrote:
> Amit-
> 
> Note, a CacheLoader does not necessarily imply "loading data from a database"; it can load data from any [external] data source and does so on demand (i.e. lazily, on a cache miss).  However, as Mike points out, this might not work for your Use Case in situations where you are querying, for instance.
> 
> I guess the real question here is, what is the requirement to pre-load this data quickly?  What is the driving requirement here?
> 
> For instance, is the need to be able to bring another system online quickly in case of "failover".  If so, perhaps an architectural change is more appropriate such as an Active/Passive arch (using WAN).
> 
> -j
> 
> 
> 
> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
> We might need that actually...problem is we cant use dataloader because we are not loading from database. So we have to use putall. Its taking 2 seconds for over 30000 data. If implenting it will bring it down that will be helpful.
> 
> On 06-Mar-2017 10:05 pm, "Michael Stolz" <mstolz@pivotal.io <ma...@pivotal.io>> wrote:
> Of course if you're REALLY in need of speed you can write your own custom implementations of toData and fromData for the DataSerializable Interface. 
> 
> I haven't seen anyone need that much speed in a long time though.
> 
> 
> 
> --
> 
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771 <tel:(631)%20835-4771>
> 
> On Mar 3, 2017 11:16 PM, "Real Wes" <therealwes@outlook.com <ma...@outlook.com>> wrote:
> Amit,
> 
>  
> 
> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that your speed is probably just fine with putAll but if you just have to have NOS in your tank, you might consider - since you’re inside a function - do the putAll from the function into your region but change the region scope to distributed-no-ack.  See:  <>https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html <https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html>
>  
> 
> Wes
> 
>  
> 
> From: Amit Pandey [mailto:amit.pandey2103@gmail.com <ma...@gmail.com>] 
> Sent: Friday, March 3, 2017 12:26 PM
> To: user@geode.apache.org <ma...@geode.apache.org>
> Subject: Re: fastest way to bulk insert in geode
> 
>  
> 
> Hey John ,
> 
>  
> 
> Thanks I am planning to use Spring XD. But my current usecase is that I am aggregating and doing some computes in a Function and then I want to populate it with the values I have a map , is region.putAll the fastest?
> 
>  
> 
> Regards
> 
>  
> 
> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jblum@pivotal.io <ma...@pivotal.io>> wrote:
> 
> You might consider using the Snapshot service <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1] if you previously had data in a Region of another Cluster (for instance).
> 
>  
> 
> If the data is coming externally, then Spring XD <http://projects.spring.io/spring-xd/> [2] is a great tool for moving (streaming) data from a source <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3] to a sink <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].  It also allows you to perform all manners of transformations/conversions, trigger events, and so and so forth.
> 
>  
> 
> -j
> 
>  
> 
>  
> 
> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html>
> [2] http://projects.spring.io/spring-xd/ <http://projects.spring.io/spring-xd/>
> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources>
> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks>
>  
> 
>  
> 
> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <amit.pandey2103@gmail.com <ma...@gmail.com>> wrote:
> 
> Hey Guys,
> 
>  
> 
> Whats the fastest way to do bulk insert in a region?
> 
>  
> 
> I am using region.putAll , is there any alternative/faster API?
> 
>  
> 
> regards
> 
> 
> 
> 
>  
> 
> --
> 
> -John
> 
> john.blum10101 (skype)
> 
>  
> 
> 
> 
> 
> -- 
> -John
> john.blum10101 (skype)
> 


Re: fastest way to bulk insert in geode

Posted by Jacob Barrett <jb...@pivotal.io>.
On Mon, Mar 6, 2017 at 10:42 AM Amit Pandey <am...@gmail.com>
wrote:

> Thanks. I am a bot confused so a put should be faster than putAll ?
>

In the scenario where you are processing the import on each server there is
very little savings from doing putAll and more overhead of keeping an
internal buffer of objects to putAll in batches.

-Jake

Re: fastest way to bulk insert in geode

Posted by Amit Pandey <am...@gmail.com>.
Hey Jake,

Thanks. I am a bot confused so a put should be faster than putAll ?

John,

I need to setup all data so that they can be queried.  So I don't think
CacheLoader works for me. Those data are the results of a very large and
expensive computations and doing them dynamically will be costly.

We have a time window to setup the system because after that some other
jobs will start. Currently its taking 2.4 seconds to insert 30,000 data and
its great.  But I am just trying to optimize if it can be made faster.

Regards

On Tue, Mar 7, 2017 at 12:01 AM, John Blum <jb...@pivotal.io> wrote:

> Amit-
>
> Note, a CacheLoader does not necessarily imply "loading data from a
> database"; it can load data from any [external] data source and does so on
> demand (i.e. lazily, on a cache miss).  However, as Mike points out, this
> might not work for your Use Case in situations where you are querying, for
> instance.
>
> I guess the real question here is, what is the requirement to pre-load
> this data quickly?  What is the driving requirement here?
>
> For instance, is the need to be able to bring another system online
> quickly in case of "failover".  If so, perhaps an architectural change is
> more appropriate such as an Active/Passive arch (using WAN).
>
> -j
>
>
>
> On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
>> We might need that actually...problem is we cant use dataloader because
>> we are not loading from database. So we have to use putall. Its taking 2
>> seconds for over 30000 data. If implenting it will bring it down that will
>> be helpful.
>> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>>
>>> Of course if you're REALLY in need of speed you can write your own
>>> custom implementations of toData and fromData for the DataSerializable
>>> Interface.
>>>
>>> I haven't seen anyone need that much speed in a long time though.
>>>
>>>
>>> --
>>>
>>> Mike Stolz
>>> Principal Engineer - Gemfire Product Manager
>>> Mobile: 631-835-4771 <(631)%20835-4771>
>>>
>>> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>>>
>>>> Amit,
>>>>
>>>>
>>>>
>>>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find
>>>> that your speed is probably just fine with putAll but if you just have to
>>>> have NOS in your tank, you might consider - since you’re inside a function
>>>> - do the putAll from the function into your region but change the region
>>>> scope to distributed-no-ack.  See: https://geode.apache.org/docs/
>>>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>>>
>>>>
>>>>
>>>> Wes
>>>>
>>>>
>>>>
>>>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>>>> *Sent:* Friday, March 3, 2017 12:26 PM
>>>> *To:* user@geode.apache.org
>>>> *Subject:* Re: fastest way to bulk insert in geode
>>>>
>>>>
>>>>
>>>> Hey John ,
>>>>
>>>>
>>>>
>>>> Thanks I am planning to use Spring XD. But my current usecase is that I
>>>> am aggregating and doing some computes in a Function and then I want to
>>>> populate it with the values I have a map , is region.putAll the fastest?
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>>
>>>>
>>>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>>>
>>>> You might consider using the Snapshot service
>>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>>> if you previously had data in a Region of another Cluster (for instance).
>>>>
>>>>
>>>>
>>>> If the data is coming externally, then *Spring XD
>>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>>>> (streaming) data from a source
>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>>> to a sink
>>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>>> It also allows you to perform all manners of transformations/conversions,
>>>> trigger events, and so and so forth.
>>>>
>>>>
>>>>
>>>> -j
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>>> apshots/chapter_overview.html
>>>>
>>>> [2] http://projects.spring.io/spring-xd/
>>>>
>>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>> ence/html/#sources
>>>>
>>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>>> ence/html/#sinks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>>>> wrote:
>>>>
>>>> Hey Guys,
>>>>
>>>>
>>>>
>>>> Whats the fastest way to do bulk insert in a region?
>>>>
>>>>
>>>>
>>>> I am using region.putAll , is there any alternative/faster API?
>>>>
>>>>
>>>>
>>>> regards
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> -John
>>>>
>>>> john.blum10101 (skype)
>>>>
>>>>
>>>>
>>>
>
>
> --
> -John
> john.blum10101 (skype)
>

Re: fastest way to bulk insert in geode

Posted by John Blum <jb...@pivotal.io>.
Amit-

Note, a CacheLoader does not necessarily imply "loading data from a
database"; it can load data from any [external] data source and does so on
demand (i.e. lazily, on a cache miss).  However, as Mike points out, this
might not work for your Use Case in situations where you are querying, for
instance.

I guess the real question here is, what is the requirement to pre-load this
data quickly?  What is the driving requirement here?

For instance, is the need to be able to bring another system online quickly
in case of "failover".  If so, perhaps an architectural change is more
appropriate such as an Active/Passive arch (using WAN).

-j



On Mon, Mar 6, 2017 at 9:45 AM, Amit Pandey <am...@gmail.com>
wrote:

> We might need that actually...problem is we cant use dataloader because we
> are not loading from database. So we have to use putall. Its taking 2
> seconds for over 30000 data. If implenting it will bring it down that will
> be helpful.
> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>
>> Of course if you're REALLY in need of speed you can write your own custom
>> implementations of toData and fromData for the DataSerializable Interface.
>>
>> I haven't seen anyone need that much speed in a long time though.
>>
>>
>> --
>>
>> Mike Stolz
>> Principal Engineer - Gemfire Product Manager
>> Mobile: 631-835-4771 <(631)%20835-4771>
>>
>> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>>
>>> Amit,
>>>
>>>
>>>
>>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find
>>> that your speed is probably just fine with putAll but if you just have to
>>> have NOS in your tank, you might consider - since you’re inside a function
>>> - do the putAll from the function into your region but change the region
>>> scope to distributed-no-ack.  See: https://geode.apache.org/docs/
>>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>>
>>>
>>>
>>> Wes
>>>
>>>
>>>
>>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>>> *Sent:* Friday, March 3, 2017 12:26 PM
>>> *To:* user@geode.apache.org
>>> *Subject:* Re: fastest way to bulk insert in geode
>>>
>>>
>>>
>>> Hey John ,
>>>
>>>
>>>
>>> Thanks I am planning to use Spring XD. But my current usecase is that I
>>> am aggregating and doing some computes in a Function and then I want to
>>> populate it with the values I have a map , is region.putAll the fastest?
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>>
>>> You might consider using the Snapshot service
>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>> if you previously had data in a Region of another Cluster (for instance).
>>>
>>>
>>>
>>> If the data is coming externally, then *Spring XD
>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>>> (streaming) data from a source
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>> to a sink
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>> It also allows you to perform all manners of transformations/conversions,
>>> trigger events, and so and so forth.
>>>
>>>
>>>
>>> -j
>>>
>>>
>>>
>>>
>>>
>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>> apshots/chapter_overview.html
>>>
>>> [2] http://projects.spring.io/spring-xd/
>>>
>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sources
>>>
>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sinks
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>>> wrote:
>>>
>>> Hey Guys,
>>>
>>>
>>>
>>> Whats the fastest way to do bulk insert in a region?
>>>
>>>
>>>
>>> I am using region.putAll , is there any alternative/faster API?
>>>
>>>
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -John
>>>
>>> john.blum10101 (skype)
>>>
>>>
>>>
>>


-- 
-John
john.blum10101 (skype)

Re: fastest way to bulk insert in geode

Posted by Jacob Barrett <jb...@pivotal.io>.
Split your data into multiple files, one for each Geode server. Write a
distributed function that executes on each server. This function would open
the file, parse it, and do a local put (not put all). You will have some
data shuffle but the ingest is fast. I have done something similar for a
POC that had millions of rows with thousands of columns each. The key will
be to avoid lots of garbage production when parsing text files into native
types, like string to int conversions. The pressure on the GC will kill
your throughput. If you can export your data in an efficient binary form
and write a custom importer you should be able to avoid this or use highly
optimized text to native type parsers.

-Jake


On Mon, Mar 6, 2017 at 9:52 AM Amit Pandey <am...@gmail.com>
wrote:

> We might need that actually...problem is we cant use dataloader because we
> are not loading from database. So we have to use putall. Its taking 2
> seconds for over 30000 data. If implenting it will bring it down that will
> be helpful.
> On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:
>
> Of course if you're REALLY in need of speed you can write your own custom
> implementations of toData and fromData for the DataSerializable Interface.
>
> I haven't seen anyone need that much speed in a long time though.
>
>
> --
>
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771 <(631)%20835-4771>
>
> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>
> Amit,
>
>
>
> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that
> your speed is probably just fine with putAll but if you just have to have
> NOS in your tank, you might consider - since you’re inside a function - do
> the putAll from the function into your region but change the region scope
> to distributed-no-ack.  See:
> https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html
>
>
>
> Wes
>
>
>
> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
> *Sent:* Friday, March 3, 2017 12:26 PM
> *To:* user@geode.apache.org
> *Subject:* Re: fastest way to bulk insert in geode
>
>
>
> Hey John ,
>
>
>
> Thanks I am planning to use Spring XD. But my current usecase is that I am
> aggregating and doing some computes in a Function and then I want to
> populate it with the values I have a map , is region.putAll the fastest?
>
>
>
> Regards
>
>
>
> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>
> You might consider using the Snapshot service
> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
> if you previously had data in a Region of another Cluster (for instance).
>
>
>
> If the data is coming externally, then *Spring XD
> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
> (streaming) data from a source
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
> to a sink
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
> It also allows you to perform all manners of transformations/conversions,
> trigger events, and so and so forth.
>
>
>
> -j
>
>
>
>
>
> [1]
> http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html
>
> [2] http://projects.spring.io/spring-xd/
>
> [3]
> http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources
>
> [4]
> http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks
>
>
>
>
>
> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
> Hey Guys,
>
>
>
> Whats the fastest way to do bulk insert in a region?
>
>
>
> I am using region.putAll , is there any alternative/faster API?
>
>
>
> regards
>
>
>
>
>
> --
>
> -John
>
> john.blum10101 (skype)
>
>
>
>

RE: fastest way to bulk insert in geode

Posted by Amit Pandey <am...@gmail.com>.
We might need that actually...problem is we cant use dataloader because we
are not loading from database. So we have to use putall. Its taking 2
seconds for over 30000 data. If implenting it will bring it down that will
be helpful.
On 06-Mar-2017 10:05 pm, "Michael Stolz" <ms...@pivotal.io> wrote:

> Of course if you're REALLY in need of speed you can write your own custom
> implementations of toData and fromData for the DataSerializable Interface.
>
> I haven't seen anyone need that much speed in a long time though.
>
>
> --
>
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771 <(631)%20835-4771>
>
> On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:
>
>> Amit,
>>
>>
>>
>> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that
>> your speed is probably just fine with putAll but if you just have to have
>> NOS in your tank, you might consider - since you’re inside a function - do
>> the putAll from the function into your region but change the region scope
>> to distributed-no-ack.  See: https://geode.apache.org/docs/
>> guide/developing/distributed_regions/choosing_level_of_dist.html
>>
>>
>>
>> Wes
>>
>>
>>
>> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
>> *Sent:* Friday, March 3, 2017 12:26 PM
>> *To:* user@geode.apache.org
>> *Subject:* Re: fastest way to bulk insert in geode
>>
>>
>>
>> Hey John ,
>>
>>
>>
>> Thanks I am planning to use Spring XD. But my current usecase is that I
>> am aggregating and doing some computes in a Function and then I want to
>> populate it with the values I have a map , is region.putAll the fastest?
>>
>>
>>
>> Regards
>>
>>
>>
>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>
>> You might consider using the Snapshot service
>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>> if you previously had data in a Region of another Cluster (for instance).
>>
>>
>>
>> If the data is coming externally, then *Spring XD
>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>> (streaming) data from a source
>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>> to a sink
>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>> It also allows you to perform all manners of transformations/conversions,
>> trigger events, and so and so forth.
>>
>>
>>
>> -j
>>
>>
>>
>>
>>
>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>> apshots/chapter_overview.html
>>
>> [2] http://projects.spring.io/spring-xd/
>>
>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>> ence/html/#sources
>>
>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>> ence/html/#sinks
>>
>>
>>
>>
>>
>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>> wrote:
>>
>> Hey Guys,
>>
>>
>>
>> Whats the fastest way to do bulk insert in a region?
>>
>>
>>
>> I am using region.putAll , is there any alternative/faster API?
>>
>>
>>
>> regards
>>
>>
>>
>>
>>
>> --
>>
>> -John
>>
>> john.blum10101 (skype)
>>
>>
>>
>

RE: fastest way to bulk insert in geode

Posted by Michael Stolz <ms...@pivotal.io>.
Of course if you're REALLY in need of speed you can write your own custom
implementations of toData and fromData for the DataSerializable Interface.

I haven't seen anyone need that much speed in a long time though.


--

Mike Stolz
Principal Engineer - Gemfire Product Manager
Mobile: 631-835-4771 <(631)%20835-4771>

On Mar 3, 2017 11:16 PM, "Real Wes" <th...@outlook.com> wrote:

> Amit,
>
>
>
> John and Mike’s advice about tradeoffs is worth heeding. You’ll find that
> your speed is probably just fine with putAll but if you just have to have
> NOS in your tank, you might consider - since you’re inside a function - do
> the putAll from the function into your region but change the region scope
> to distributed-no-ack.  See: https://geode.apache.org/docs/
> guide/developing/distributed_regions/choosing_level_of_dist.html
>
>
>
> Wes
>
>
>
> *From:* Amit Pandey [mailto:amit.pandey2103@gmail.com]
> *Sent:* Friday, March 3, 2017 12:26 PM
> *To:* user@geode.apache.org
> *Subject:* Re: fastest way to bulk insert in geode
>
>
>
> Hey John ,
>
>
>
> Thanks I am planning to use Spring XD. But my current usecase is that I am
> aggregating and doing some computes in a Function and then I want to
> populate it with the values I have a map , is region.putAll the fastest?
>
>
>
> Regards
>
>
>
> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>
> You might consider using the Snapshot service
> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
> if you previously had data in a Region of another Cluster (for instance).
>
>
>
> If the data is coming externally, then *Spring XD
> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
> (streaming) data from a source
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
> to a sink
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
> It also allows you to perform all manners of transformations/conversions,
> trigger events, and so and so forth.
>
>
>
> -j
>
>
>
>
>
> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_
> snapshots/chapter_overview.html
>
> [2] http://projects.spring.io/spring-xd/
>
> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
> ence/html/#sources
>
> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
> ence/html/#sinks
>
>
>
>
>
> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
> Hey Guys,
>
>
>
> Whats the fastest way to do bulk insert in a region?
>
>
>
> I am using region.putAll , is there any alternative/faster API?
>
>
>
> regards
>
>
>
>
>
> --
>
> -John
>
> john.blum10101 (skype)
>
>
>

RE: fastest way to bulk insert in geode

Posted by Real Wes <th...@outlook.com>.
Amit,

John and Mike’s advice about tradeoffs is worth heeding. You’ll find that your speed is probably just fine with putAll but if you just have to have NOS in your tank, you might consider - since you’re inside a function - do the putAll from the function into your region but change the region scope to distributed-no-ack.  See: https://geode.apache.org/docs/guide/developing/distributed_regions/choosing_level_of_dist.html

Wes

From: Amit Pandey [mailto:amit.pandey2103@gmail.com]
Sent: Friday, March 3, 2017 12:26 PM
To: user@geode.apache.org
Subject: Re: fastest way to bulk insert in geode

Hey John ,

Thanks I am planning to use Spring XD. But my current usecase is that I am aggregating and doing some computes in a Function and then I want to populate it with the values I have a map , is region.putAll the fastest?

Regards

On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io>> wrote:
You might consider using the Snapshot service<http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1] if you previously had data in a Region of another Cluster (for instance).

If the data is coming externally, then Spring XD<http://projects.spring.io/spring-xd/> [2] is a great tool for moving (streaming) data from a source<http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3] to a sink<http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].  It also allows you to perform all manners of transformations/conversions, trigger events, and so and so forth.

-j


[1] http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html
[2] http://projects.spring.io/spring-xd/
[3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources
[4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks


On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>> wrote:
Hey Guys,

Whats the fastest way to do bulk insert in a region?

I am using region.putAll , is there any alternative/faster API?

regards



--
-John
john.blum10101 (skype)


Re: fastest way to bulk insert in geode

Posted by Michael Stolz <ms...@pivotal.io>.
And of course, it depends on your access patterns.
If all access is by primary key, then CacheLoaders are a viable option.
If access is by query on non-primary key fields, then ALL data needs to be
pre-loaded, otherwise you won't know if you got the right query result.

So for situations where pre-loading is either required or desirable, putAll
is probably the best tool BUT don't try to put too much all at once because
that will bog down at the network layer. Keep yourself down to a couple of
hundred objects per call to putAll, and tune that number to get best
overall throughput.

--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: +1-631-835-4771

On Fri, Mar 3, 2017 at 1:10 PM, John Blum <jb...@pivotal.io> wrote:

> SIMPLE ANSWER:
>
> Well, I am not certain about "fastest", but it is convenient, and maybe 1
> of the few ways (perhaps the only way, other than individual Region puts,
> which I gather would be slower).
>
> If we are talking about a simple Map of data that is relatively small,
> then Region.putAll(:Map) is your best option.
>
> However...
>
>
> DETAILED ANSWER*:*
>
> I.e. don't equate loading a simple Map with bulk data loads in general.
>
> It really depends on many factors, like distribution factors in
> particular... Region type (e.g. REPLICATE vs. PARTITION), Scope (as in
> DISTRIBUTED_ACK, DISTRIBUTED_NO_ACK (only applicable for REPLICATE Regions;
> i.e. PARTITION Regions are DISTRIBUTED_ACK only), number of redundant
> copies (for PARTITION Regions), number of nodes in cluster hosting the
> "target" Region, etc, etc.  All these can affect speed.
>
> But typically, bulk loading data (batch) is not so much about speed as it
> is consistency/accuracy, or data availability.
>
> A more sophisticated approach in a distributed scenario, say if you were
> using PARTITION Regions with a fixed partitioning strategy would be to load
> the data in parallel from a Function, where the Function handles the data
> set for the individual nodes based on the partitioning strategy.  Of course
> redundant copies (along with Redundancy Zones) are still going to affect
> perf, even in this approach.
>
> So, again, it is a factor of your consistency and availability guarantees.
>
> See here
> <http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/how_pr_ha_works.html> [1]
> and here
> <http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/configuring_ha_for_pr.html> [2]
> for more details.
>
> I think the more pertinent question is where do you want to make your data
> available to best serve the needs of your application in a reliable fashion
> at runtime, rather than how it gets there.  You must be mindful of how much
> memory your data takes up in the first place.  Additionally, using a
> CacheLoader to lazily load the data in certain cases might make more
> sense.  I.e. w.r.t. to bulk load, it is not about having all your data in
> memory, but having the right data in-memory at the right time.  That is
> going give your application the best responsiveness.
>
> Food for thought,
>
> -j
>
> [1] http://gemfire90.docs.pivotal.io/geode/developing/
> partitioned_regions/how_pr_ha_works.html
> [2] http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/
> configuring_ha_for_pr.html
>
>
> On Fri, Mar 3, 2017 at 9:26 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
>> Hey John ,
>>
>> Thanks I am planning to use Spring XD. But my current usecase is that I
>> am aggregating and doing some computes in a Function and then I want to
>> populate it with the values I have a map , is region.putAll the fastest?
>>
>> Regards
>>
>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>>
>>> You might consider using the Snapshot service
>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>>> if you previously had data in a Region of another Cluster (for instance).
>>>
>>> If the data is coming externally, then *Spring XD
>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>>> (streaming) data from a source
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>>> to a sink
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>>> It also allows you to perform all manners of transformations/conversions,
>>> trigger events, and so and so forth.
>>>
>>> -j
>>>
>>>
>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>> apshots/chapter_overview.html
>>> [2] http://projects.spring.io/spring-xd/
>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sources
>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sinks
>>>
>>>
>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>>> wrote:
>>>
>>>> Hey Guys,
>>>>
>>>> Whats the fastest way to do bulk insert in a region?
>>>>
>>>> I am using region.putAll , is there any alternative/faster API?
>>>>
>>>> regards
>>>>
>>>
>>>
>>>
>>> --
>>> -John
>>> john.blum10101 (skype)
>>>
>>
>>
>
>
> --
> -John
> john.blum10101 (skype)
>

Re: fastest way to bulk insert in geode

Posted by John Blum <jb...@pivotal.io>.
SIMPLE ANSWER:

Well, I am not certain about "fastest", but it is convenient, and maybe 1
of the few ways (perhaps the only way, other than individual Region puts,
which I gather would be slower).

If we are talking about a simple Map of data that is relatively small, then
Region.putAll(:Map) is your best option.

However...


DETAILED ANSWER*:*

I.e. don't equate loading a simple Map with bulk data loads in general.

It really depends on many factors, like distribution factors in
particular... Region type (e.g. REPLICATE vs. PARTITION), Scope (as in
DISTRIBUTED_ACK, DISTRIBUTED_NO_ACK (only applicable for REPLICATE Regions;
i.e. PARTITION Regions are DISTRIBUTED_ACK only), number of redundant
copies (for PARTITION Regions), number of nodes in cluster hosting the
"target" Region, etc, etc.  All these can affect speed.

But typically, bulk loading data (batch) is not so much about speed as it
is consistency/accuracy, or data availability.

A more sophisticated approach in a distributed scenario, say if you were
using PARTITION Regions with a fixed partitioning strategy would be to load
the data in parallel from a Function, where the Function handles the data
set for the individual nodes based on the partitioning strategy.  Of course
redundant copies (along with Redundancy Zones) are still going to affect
perf, even in this approach.

So, again, it is a factor of your consistency and availability guarantees.

See here
<http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/how_pr_ha_works.html>
[1]
and here
<http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/configuring_ha_for_pr.html>
[2]
for more details.

I think the more pertinent question is where do you want to make your data
available to best serve the needs of your application in a reliable fashion
at runtime, rather than how it gets there.  You must be mindful of how much
memory your data takes up in the first place.  Additionally, using a
CacheLoader to lazily load the data in certain cases might make more
sense.  I.e. w.r.t. to bulk load, it is not about having all your data in
memory, but having the right data in-memory at the right time.  That is
going give your application the best responsiveness.

Food for thought,

-j

[1]
http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/how_pr_ha_works.html
[2]
http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/configuring_ha_for_pr.html


On Fri, Mar 3, 2017 at 9:26 AM, Amit Pandey <am...@gmail.com>
wrote:

> Hey John ,
>
> Thanks I am planning to use Spring XD. But my current usecase is that I am
> aggregating and doing some computes in a Function and then I want to
> populate it with the values I have a map , is region.putAll the fastest?
>
> Regards
>
> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:
>
>> You might consider using the Snapshot service
>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
>> if you previously had data in a Region of another Cluster (for instance).
>>
>> If the data is coming externally, then *Spring XD
>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>> (streaming) data from a source
>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
>> to a sink
>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
>> It also allows you to perform all manners of transformations/conversions,
>> trigger events, and so and so forth.
>>
>> -j
>>
>>
>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_
>> snapshots/chapter_overview.html
>> [2] http://projects.spring.io/spring-xd/
>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>> ence/html/#sources
>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>> ence/html/#sinks
>>
>>
>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
>> wrote:
>>
>>> Hey Guys,
>>>
>>> Whats the fastest way to do bulk insert in a region?
>>>
>>> I am using region.putAll , is there any alternative/faster API?
>>>
>>> regards
>>>
>>
>>
>>
>> --
>> -John
>> john.blum10101 (skype)
>>
>
>


-- 
-John
john.blum10101 (skype)

Re: fastest way to bulk insert in geode

Posted by Amit Pandey <am...@gmail.com>.
Hey John ,

Thanks I am planning to use Spring XD. But my current usecase is that I am
aggregating and doing some computes in a Function and then I want to
populate it with the values I have a map , is region.putAll the fastest?

Regards

On Fri, Mar 3, 2017 at 10:52 PM, John Blum <jb...@pivotal.io> wrote:

> You might consider using the Snapshot service
> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html> [1]
> if you previously had data in a Region of another Cluster (for instance).
>
> If the data is coming externally, then *Spring XD
> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
> (streaming) data from a source
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources> [3]
> to a sink
> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> [4].
> It also allows you to perform all manners of transformations/conversions,
> trigger events, and so and so forth.
>
> -j
>
>
> [1] http://gemfire90.docs.pivotal.io/geode/managing/
> cache_snapshots/chapter_overview.html
> [2] http://projects.spring.io/spring-xd/
> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/
> reference/html/#sources
> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/
> reference/html/#sinks
>
>
> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
> wrote:
>
>> Hey Guys,
>>
>> Whats the fastest way to do bulk insert in a region?
>>
>> I am using region.putAll , is there any alternative/faster API?
>>
>> regards
>>
>
>
>
> --
> -John
> john.blum10101 (skype)
>

Re: fastest way to bulk insert in geode

Posted by John Blum <jb...@pivotal.io>.
You might consider using the Snapshot service
<http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html>
[1]
if you previously had data in a Region of another Cluster (for instance).

If the data is coming externally, then *Spring XD
<http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
(streaming) data from a source
<http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources>
[3]
to a sink
<http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks>
[4].
It also allows you to perform all manners of transformations/conversions,
trigger events, and so and so forth.

-j


[1]
http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html
[2] http://projects.spring.io/spring-xd/
[3]
http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources
[4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks


On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <am...@gmail.com>
wrote:

> Hey Guys,
>
> Whats the fastest way to do bulk insert in a region?
>
> I am using region.putAll , is there any alternative/faster API?
>
> regards
>



-- 
-John
john.blum10101 (skype)