You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by vt...@gmail.com on 2021/02/19 06:06:10 UTC

very fast loading of very big table

Sep 2020 I've published the paper about Loading Large Datasets into Apache
Ignite by Using a Key-Value API (English [1] and Russian [2] version). The
approach described works in production, but shows inacceptable perfomance for
very large tables.

  

The story continues, and yesterday I've finished the proof of concept for very
fast loading of very big table. The partitioned MSSQL table about 295 million
rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had
executed its own SQL queries in parallel and then distributed the loaded
values across the other cluster nodes.

  

Probably that result will be of interest for the community.

  

Regards,

Vladimir Chernyi

  

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-
apache-ignite-using-key-value-api

[2] https://m.habr.com/ru/post/526708/

  


Re: very fast loading of very big table

Posted by Pavel Tupitsyn <pt...@apache.org>.
> 295 million rows
> 3 min 35 sec

Agree with Ilya, DataStreamer should do this much faster, have you tried it?

> 3.7Gb

I would not call this "big" by any means today, when even the cheapest
laptops have 8GB of RAM.


On Fri, Feb 19, 2021 at 1:33 PM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> Is there a chance that you have tried enabling streaming (data streamer)
> on the clients?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 19 февр. 2021 г. в 10:10, <vt...@gmail.com>:
>
>> Hi Denis,
>>
>> Data space is 3.7Gb according to MSSQL table properries
>>
>> Vladimir
>>
>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>:
>>
>> Hello Vladimir,
>>
>> Good to hear from you! How much is that in gigabytes?
>>
>> -
>> Denis
>>
>>
>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>
>> Sep 2020 I've published the paper about Loading Large Datasets into
>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>> version). The approach described works in production, but shows
>> inacceptable perfomance for very large tables.
>>
>> The story continues, and yesterday I've finished the proof of concept for
>> very fast loading of very big table. The partitioned MSSQL table about 295
>> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
>> node had executed its own SQL queries in parallel and then distributed the
>> loaded values across the other cluster nodes.
>>
>> Probably that result will be of interest for the community.
>>
>> Regards,
>> Vladimir Chernyi
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>> [2] https://m.habr.com/ru/post/526708/
>>
>>
>>
>> --
>> Отправлено из мобильного приложения Яндекс.Почты
>>
>

Re: very fast loading of very big table

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Is there a chance that you have tried enabling streaming (data streamer) on
the clients?

Regards,
-- 
Ilya Kasnacheev


пт, 19 февр. 2021 г. в 10:10, <vt...@gmail.com>:

> Hi Denis,
>
> Data space is 3.7Gb according to MSSQL table properries
>
> Vladimir
>
> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>:
>
> Hello Vladimir,
>
> Good to hear from you! How much is that in gigabytes?
>
> -
> Denis
>
>
> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>
> Sep 2020 I've published the paper about Loading Large Datasets into Apache
> Ignite by Using a Key-Value API (English [1] and Russian [2] version). The
> approach described works in production, but shows inacceptable perfomance
> for very large tables.
>
> The story continues, and yesterday I've finished the proof of concept for
> very fast loading of very big table. The partitioned MSSQL table about 295
> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
> node had executed its own SQL queries in parallel and then distributed the
> loaded values across the other cluster nodes.
>
> Probably that result will be of interest for the community.
>
> Regards,
> Vladimir Chernyi
>
> [1]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
> [2] https://m.habr.com/ru/post/526708/
>
>
>
> --
> Отправлено из мобильного приложения Яндекс.Почты
>

Re: very fast loading of very big table

Posted by Vladimir Tchernyi <vt...@gmail.com>.
Pavel,

thanks for mentioning the patterns. Of course, I spent a lot of time
reading documentation, [2] at the very beginning and [1] a couple of months
ago. Here is the origin of my pain-in-the-neck about a complete GitHub
example - none of [1] and [2] give an answer about my problem. The keyword
in my case is ASAP, there should be a multithreaded example. Of course, the
real-world example must not use a primitive types as a cache values, I
tried to illustrate that in [3].

I'd built one with the data streamer [1], it seems I was limited by network
adapter performance (see my previous post in this thread). That is the
reason I decided to move SQL queries to the data nodes.

Vladimir

пт, 26 февр. 2021 г. в 10:54, Pavel Tupitsyn <pt...@apache.org>:

> Vladimir,
>
> I think all real-world use cases are very valuable to the community.
> However, we should be careful to avoid misleading conclusions.
>
> We have well-known patterns for loading data from other systems:
> DataStreamer [1] and CacheStore [2].
> The article [3] seems a bit confusing to me, since none of those two
> patterns are mentioned there.
> When proposing a custom approach, it would be great to compare it to the
> standard alternatives.
>
> [1] https://ignite.apache.org/docs/latest/data-streaming
> [2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store
> [3]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>
> On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <vt...@gmail.com>
> wrote:
>
>> Hi Pavel,
>>
>> the code [1] you shared is a kind of in-memory experiment with all the
>> processes inside a single JVM. My work differs - it is from the big retail
>> business, and hence it is 100% practice-oriented. True to say, it's
>> oriented to the state of things inside my company, and that is my question
>> - will my results be interesting to the community? I have seen a lot of
>> questions on the user list regarding data loading and difficulties here
>> seem to be a blocker in extending Ignite's popularity.
>>
>> Please let me know if my case is not common in the industry. We have a
>> big bare-metal Windows MSSQL server and a number of bare metal hosts, each
>> with the virtualization software and a single CentOs virtual server inside.
>> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
>> client node. The example [2] I published last year is intended to solve the
>> business problem we have out here:
>> 1) the data currently present in the cluster have zero value;
>> 2) actual data is in the database and must be loaded in the cluster ASAP.
>> We use BinaryObject as cache key and value;
>> 3) cluster performs some data processing and writes the result to the
>> database.
>>
>> Unfortunately, the code [2] does not 100% OK in my case, it tends to say
>> "is node still alive" and to drop the client node off the cluster. The
>> performance of the MSSQL and network is what it is, I consider it as a
>> given restriction. It seems I got some progress when managed to move the
>> data loading process from a single client node to multiple data nodes. When
>> the extra data nodes will be added, I expect the load performance will be
>> better. Of course, until my big MSSQL will be able to hold the load. So I
>> want to know how interesting my results will be if it will be published.
>>
>> WDYT?
>>
>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>> [2] https://github.com/vtchernyi/FastDataLoad
>>
>> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <pt...@apache.org>:
>>
>>> Vladimir,
>>>
>>> Thanks for getting back to us. A full example that clarifies the
>>> situation will be great!
>>>
>>> > Can you share your code as a GitHub project? Maybe with the script to
>>> reproduce 6 GB of data.
>>>
>>> It is super trivial, I just wanted to get a sense of the throughput and
>>> check if we have some kind of a regression in recent versions (we don't) [1]
>>> Also I realised that the data size can be counted very differently - do
>>> we account for DB overhead and how?
>>>
>>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>>>
>>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vt...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd spent some time thinking about the community comments on my post.
>>>> It seems that Ignite is really not a bottleneck here. The performance of my
>>>> production MSSQL is a given restriction and the problem is to ensure fast
>>>> loading by executing multiple parallel queries. I'll test my code in
>>>> production for a couple of months for possible problems. If it will be OK,
>>>> probably the complete/downloadable/compilable GitHub example will be useful
>>>> for the community.
>>>>
>>>> WDYT?
>>>>
>>>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vt...@gmail.com>:
>>>>
>>>>> Pavel,
>>>>>
>>>>> maybe it's time to put your five-cent in. Can you share your code as a
>>>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>>>
>>>>> As for MSSQL data retrieval being the bottleneck - don't think so, I
>>>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>>>> linear dependency (the table and the RDBMS server were the same).
>>>>> --
>>>>> Vladimir
>>>>>
>>>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:
>>>>>
>>>>>> > First of all, I tried to select the whole table as once
>>>>>>
>>>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here,
>>>>>> not Ignite.
>>>>>>
>>>>>> Can you run a test where some dummy data of the same size as real
>>>>>> data is generated and inserted into Ignite,
>>>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>>>
>>>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <
>>>>>> vtchernyi@gmail.com> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>> thanks for your interest in my work.
>>>>>>>
>>>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>>>> couple of years ago and didn't succeed. Probably because examples available
>>>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>>>>> KV API.
>>>>>>>
>>>>>>> I did try a data streamer, and that was my first try. I did not
>>>>>>> notice a significant time reduction in using code from my paper [1] versus
>>>>>>> data streamer/receiver. There was some memory economy with the streamer,
>>>>>>> though. I must say my experiment was made on a heavily loaded production
>>>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 sec.
>>>>>>> The story follows.
>>>>>>>
>>>>>>> First of all, I tried to select the whole table as once, I got the
>>>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>>>> still alive?).
>>>>>>> So I'd partitioned the table and executed a number of queries
>>>>>>> one-by-one on the client node, each query for the specific table partition.
>>>>>>> That process took about 90 min. Inacceptable time.
>>>>>>>
>>>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>>>> each query executing dataStreamer.addData() for a single dataStreamer. The
>>>>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>>>>> that was the network throughput limit on the client node (same interface
>>>>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>>>>> the production environment.
>>>>>>>
>>>>>>> Final schema:
>>>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>>>> job for one table partition;
>>>>>>> * each job executes SQL query, constructs a map with binary object
>>>>>>> key and value. Then the job executes targetCache.invokeAll() specifying the
>>>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>>>> contains the logic for cache binary entry update;
>>>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>>>
>>>>>>> The schema described proved to be network error-free in my
>>>>>>> production network and gives acceptable timing.
>>>>>>>
>>>>>>> Vladimir
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>
>>>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>>>> stephen.darlington@gridgain.com>:
>>>>>>>
>>>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>>>
>>>>>>>> But that’s a lot of work that the data streamer already does for
>>>>>>>> you and the data streamer also batches updates so would still be faster.
>>>>>>>>
>>>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>>>> maximiliano.628@gmail.com> wrote:
>>>>>>>>
>>>>>>>> What would be the difference between doing cache.putAll(all rows)
>>>>>>>> and separating them by affinity key+executing putAll inside a compute job.
>>>>>>>> If I'm not mistaken, doing putAll should end up splitting those
>>>>>>>> rows by affinity key in one of the servers, right?
>>>>>>>> Is there a comparison of that?
>>>>>>>>
>>>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vladimir,
>>>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin
>>>>>>>>> JDBC?
>>>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>>>> and parse CSV on the server node.
>>>>>>>>>
>>>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>>>> data.
>>>>>>>>>
>>>>>>>>> Hi Denis,
>>>>>>>>>
>>>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>>>
>>>>>>>>> Vladimir
>>>>>>>>>
>>>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>>>>>>> <dm...@apache.org>:
>>>>>>>>>
>>>>>>>>> Hello Vladimir,
>>>>>>>>>
>>>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets
>>>>>>>>> into Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>>>>> version). The approach described works in production, but shows
>>>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>>>
>>>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster in 3
>>>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and then
>>>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>>>
>>>>>>>>> Probably that result will be of interest for the community.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Vladimir Chernyi
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Taras Ledkov
>>>>>>>>> Mail-To: tledkov@gridgain.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: very fast loading of very big table

Posted by Pavel Tupitsyn <pt...@apache.org>.
Vladimir,

I think all real-world use cases are very valuable to the community.
However, we should be careful to avoid misleading conclusions.

We have well-known patterns for loading data from other systems:
DataStreamer [1] and CacheStore [2].
The article [3] seems a bit confusing to me, since none of those two
patterns are mentioned there.
When proposing a custom approach, it would be great to compare it to the
standard alternatives.

[1] https://ignite.apache.org/docs/latest/data-streaming
[2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store
[3]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <vt...@gmail.com>
wrote:

> Hi Pavel,
>
> the code [1] you shared is a kind of in-memory experiment with all the
> processes inside a single JVM. My work differs - it is from the big retail
> business, and hence it is 100% practice-oriented. True to say, it's
> oriented to the state of things inside my company, and that is my question
> - will my results be interesting to the community? I have seen a lot of
> questions on the user list regarding data loading and difficulties here
> seem to be a blocker in extending Ignite's popularity.
>
> Please let me know if my case is not common in the industry. We have a big
> bare-metal Windows MSSQL server and a number of bare metal hosts, each with
> the virtualization software and a single CentOs virtual server inside.
> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
> client node. The example [2] I published last year is intended to solve the
> business problem we have out here:
> 1) the data currently present in the cluster have zero value;
> 2) actual data is in the database and must be loaded in the cluster ASAP.
> We use BinaryObject as cache key and value;
> 3) cluster performs some data processing and writes the result to the
> database.
>
> Unfortunately, the code [2] does not 100% OK in my case, it tends to say
> "is node still alive" and to drop the client node off the cluster. The
> performance of the MSSQL and network is what it is, I consider it as a
> given restriction. It seems I got some progress when managed to move the
> data loading process from a single client node to multiple data nodes. When
> the extra data nodes will be added, I expect the load performance will be
> better. Of course, until my big MSSQL will be able to hold the load. So I
> want to know how interesting my results will be if it will be published.
>
> WDYT?
>
> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
> [2] https://github.com/vtchernyi/FastDataLoad
>
> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <pt...@apache.org>:
>
>> Vladimir,
>>
>> Thanks for getting back to us. A full example that clarifies the
>> situation will be great!
>>
>> > Can you share your code as a GitHub project? Maybe with the script to
>> reproduce 6 GB of data.
>>
>> It is super trivial, I just wanted to get a sense of the throughput and
>> check if we have some kind of a regression in recent versions (we don't) [1]
>> Also I realised that the data size can be counted very differently - do
>> we account for DB overhead and how?
>>
>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>>
>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vt...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'd spent some time thinking about the community comments on my post. It
>>> seems that Ignite is really not a bottleneck here. The performance of my
>>> production MSSQL is a given restriction and the problem is to ensure fast
>>> loading by executing multiple parallel queries. I'll test my code in
>>> production for a couple of months for possible problems. If it will be OK,
>>> probably the complete/downloadable/compilable GitHub example will be useful
>>> for the community.
>>>
>>> WDYT?
>>>
>>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vt...@gmail.com>:
>>>
>>>> Pavel,
>>>>
>>>> maybe it's time to put your five-cent in. Can you share your code as a
>>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>>
>>>> As for MSSQL data retrieval being the bottleneck - don't think so, I
>>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>>> linear dependency (the table and the RDBMS server were the same).
>>>> --
>>>> Vladimir
>>>>
>>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:
>>>>
>>>>> > First of all, I tried to select the whole table as once
>>>>>
>>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here,
>>>>> not Ignite.
>>>>>
>>>>> Can you run a test where some dummy data of the same size as real data
>>>>> is generated and inserted into Ignite,
>>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>>
>>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>> thanks for your interest in my work.
>>>>>>
>>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>>> couple of years ago and didn't succeed. Probably because examples available
>>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>>>> KV API.
>>>>>>
>>>>>> I did try a data streamer, and that was my first try. I did not
>>>>>> notice a significant time reduction in using code from my paper [1] versus
>>>>>> data streamer/receiver. There was some memory economy with the streamer,
>>>>>> though. I must say my experiment was made on a heavily loaded production
>>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 sec.
>>>>>> The story follows.
>>>>>>
>>>>>> First of all, I tried to select the whole table as once, I got the
>>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>>> still alive?).
>>>>>> So I'd partitioned the table and executed a number of queries
>>>>>> one-by-one on the client node, each query for the specific table partition.
>>>>>> That process took about 90 min. Inacceptable time.
>>>>>>
>>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>>> each query executing dataStreamer.addData() for a single dataStreamer. The
>>>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>>>> that was the network throughput limit on the client node (same interface
>>>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>>>> the production environment.
>>>>>>
>>>>>> Final schema:
>>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>>> job for one table partition;
>>>>>> * each job executes SQL query, constructs a map with binary object
>>>>>> key and value. Then the job executes targetCache.invokeAll() specifying the
>>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>>> contains the logic for cache binary entry update;
>>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>>
>>>>>> The schema described proved to be network error-free in my production
>>>>>> network and gives acceptable timing.
>>>>>>
>>>>>> Vladimir
>>>>>>
>>>>>> [1]
>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>
>>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>>> stephen.darlington@gridgain.com>:
>>>>>>
>>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>>
>>>>>>> But that’s a lot of work that the data streamer already does for you
>>>>>>> and the data streamer also batches updates so would still be faster.
>>>>>>>
>>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>>> maximiliano.628@gmail.com> wrote:
>>>>>>>
>>>>>>> What would be the difference between doing cache.putAll(all rows)
>>>>>>> and separating them by affinity key+executing putAll inside a compute job.
>>>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>>>> by affinity key in one of the servers, right?
>>>>>>> Is there a comparison of that?
>>>>>>>
>>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vladimir,
>>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>>> and parse CSV on the server node.
>>>>>>>>
>>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>>> data.
>>>>>>>>
>>>>>>>> Hi Denis,
>>>>>>>>
>>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>>
>>>>>>>> Vladimir
>>>>>>>>
>>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>>>>>> <dm...@apache.org>:
>>>>>>>>
>>>>>>>> Hello Vladimir,
>>>>>>>>
>>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>>
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>>>> version). The approach described works in production, but shows
>>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>>
>>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster in 3
>>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and then
>>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>>
>>>>>>>> Probably that result will be of interest for the community.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vladimir Chernyi
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>>
>>>>>>>> --
>>>>>>>> Taras Ledkov
>>>>>>>> Mail-To: tledkov@gridgain.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>

Re: very fast loading of very big table

Posted by Vladimir Tchernyi <vt...@gmail.com>.
Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the
processes inside a single JVM. My work differs - it is from the big retail
business, and hence it is 100% practice-oriented. True to say, it's
oriented to the state of things inside my company, and that is my question
- will my results be interesting to the community? I have seen a lot of
questions on the user list regarding data loading and difficulties here
seem to be a blocker in extending Ignite's popularity.

Please let me know if my case is not common in the industry. We have a big
bare-metal Windows MSSQL server and a number of bare metal hosts, each with
the virtualization software and a single CentOs virtual server inside.
These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
client node. The example [2] I published last year is intended to solve the
business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP.
We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the
database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say
"is node still alive" and to drop the client node off the cluster. The
performance of the MSSQL and network is what it is, I consider it as a
given restriction. It seems I got some progress when managed to move the
data loading process from a single client node to multiple data nodes. When
the extra data nodes will be added, I expect the load performance will be
better. Of course, until my big MSSQL will be able to hold the load. So I
want to know how interesting my results will be if it will be published.

WDYT?

[1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
[2] https://github.com/vtchernyi/FastDataLoad

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <pt...@apache.org>:

> Vladimir,
>
> Thanks for getting back to us. A full example that clarifies the situation
> will be great!
>
> > Can you share your code as a GitHub project? Maybe with the script to
> reproduce 6 GB of data.
>
> It is super trivial, I just wanted to get a sense of the throughput and
> check if we have some kind of a regression in recent versions (we don't) [1]
> Also I realised that the data size can be counted very differently - do we
> account for DB overhead and how?
>
> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>
> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vt...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'd spent some time thinking about the community comments on my post. It
>> seems that Ignite is really not a bottleneck here. The performance of my
>> production MSSQL is a given restriction and the problem is to ensure fast
>> loading by executing multiple parallel queries. I'll test my code in
>> production for a couple of months for possible problems. If it will be OK,
>> probably the complete/downloadable/compilable GitHub example will be useful
>> for the community.
>>
>> WDYT?
>>
>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vt...@gmail.com>:
>>
>>> Pavel,
>>>
>>> maybe it's time to put your five-cent in. Can you share your code as a
>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>
>>> As for MSSQL data retrieval being the bottleneck - don't think so, I got
>>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>> linear dependency (the table and the RDBMS server were the same).
>>> --
>>> Vladimir
>>>
>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:
>>>
>>>> > First of all, I tried to select the whole table as once
>>>>
>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>>>> Ignite.
>>>>
>>>> Can you run a test where some dummy data of the same size as real data
>>>> is generated and inserted into Ignite,
>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>
>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>> thanks for your interest in my work.
>>>>>
>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>> couple of years ago and didn't succeed. Probably because examples available
>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>>> KV API.
>>>>>
>>>>> I did try a data streamer, and that was my first try. I did not notice
>>>>> a significant time reduction in using code from my paper [1] versus data
>>>>> streamer/receiver. There was some memory economy with the streamer, though.
>>>>> I must say my experiment was made on a heavily loaded production mssql
>>>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>>>> story follows.
>>>>>
>>>>> First of all, I tried to select the whole table as once, I got the
>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>> still alive?).
>>>>> So I'd partitioned the table and executed a number of queries
>>>>> one-by-one on the client node, each query for the specific table partition.
>>>>> That process took about 90 min. Inacceptable time.
>>>>>
>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>> each query executing dataStreamer.addData() for a single dataStreamer. The
>>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>>> that was the network throughput limit on the client node (same interface
>>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>>> the production environment.
>>>>>
>>>>> Final schema:
>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>> job for one table partition;
>>>>> * each job executes SQL query, constructs a map with binary object key
>>>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>> contains the logic for cache binary entry update;
>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>
>>>>> The schema described proved to be network error-free in my production
>>>>> network and gives acceptable timing.
>>>>>
>>>>> Vladimir
>>>>>
>>>>> [1]
>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>
>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>> stephen.darlington@gridgain.com>:
>>>>>
>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>
>>>>>> But that’s a lot of work that the data streamer already does for you
>>>>>> and the data streamer also batches updates so would still be faster.
>>>>>>
>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>> maximiliano.628@gmail.com> wrote:
>>>>>>
>>>>>> What would be the difference between doing cache.putAll(all rows) and
>>>>>> separating them by affinity key+executing putAll inside a compute job.
>>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>>> by affinity key in one of the servers, right?
>>>>>> Is there a comparison of that?
>>>>>>
>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vladimir,
>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>> and parse CSV on the server node.
>>>>>>>
>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>> data.
>>>>>>>
>>>>>>> Hi Denis,
>>>>>>>
>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>
>>>>>>> Vladimir
>>>>>>>
>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>>>>> <dm...@apache.org>:
>>>>>>>
>>>>>>> Hello Vladimir,
>>>>>>>
>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>>>>
>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>>> version). The approach described works in production, but shows
>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>
>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster in 3
>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and then
>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>
>>>>>>> Probably that result will be of interest for the community.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vladimir Chernyi
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>
>>>>>>> --
>>>>>>> Taras Ledkov
>>>>>>> Mail-To: tledkov@gridgain.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>

Re: very fast loading of very big table

Posted by Pavel Tupitsyn <pt...@apache.org>.
Vladimir,

Thanks for getting back to us. A full example that clarifies the situation
will be great!

> Can you share your code as a GitHub project? Maybe with the script to
reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and
check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we
account for DB overhead and how?

[1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15

On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vt...@gmail.com>
wrote:

> Hi,
>
> I'd spent some time thinking about the community comments on my post. It
> seems that Ignite is really not a bottleneck here. The performance of my
> production MSSQL is a given restriction and the problem is to ensure fast
> loading by executing multiple parallel queries. I'll test my code in
> production for a couple of months for possible problems. If it will be OK,
> probably the complete/downloadable/compilable GitHub example will be useful
> for the community.
>
> WDYT?
>
> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vt...@gmail.com>:
>
>> Pavel,
>>
>> maybe it's time to put your five-cent in. Can you share your code as a
>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>
>> As for MSSQL data retrieval being the bottleneck - don't think so, I got
>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>> linear dependency (the table and the RDBMS server were the same).
>> --
>> Vladimir
>>
>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:
>>
>>> > First of all, I tried to select the whole table as once
>>>
>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>>> Ignite.
>>>
>>> Can you run a test where some dummy data of the same size as real data
>>> is generated and inserted into Ignite,
>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>> For example, streaming 300 million entries (total size 6 GB) takes
>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>
>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
>>> wrote:
>>>
>>>> Hi folks,
>>>> thanks for your interest in my work.
>>>>
>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>> couple of years ago and didn't succeed. Probably because examples available
>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>> KV API.
>>>>
>>>> I did try a data streamer, and that was my first try. I did not notice
>>>> a significant time reduction in using code from my paper [1] versus data
>>>> streamer/receiver. There was some memory economy with the streamer, though.
>>>> I must say my experiment was made on a heavily loaded production mssql
>>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>>> story follows.
>>>>
>>>> First of all, I tried to select the whole table as once, I got the
>>>> network timeout and the client node was dropped off the cluster (is node
>>>> still alive?).
>>>> So I'd partitioned the table and executed a number of queries
>>>> one-by-one on the client node, each query for the specific table partition.
>>>> That process took about 90 min. Inacceptable time.
>>>>
>>>> Then I tried to execute my queries in parallel on the client node, each
>>>> query executing dataStreamer.addData() for a single dataStreamer. The
>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>> that was the network throughput limit on the client node (same interface
>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>> the production environment.
>>>>
>>>> Final schema:
>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>> job for one table partition;
>>>> * each job executes SQL query, constructs a map with binary object key
>>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>> contains the logic for cache binary entry update;
>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>
>>>> The schema described proved to be network error-free in my production
>>>> network and gives acceptable timing.
>>>>
>>>> Vladimir
>>>>
>>>> [1]
>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>
>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>> stephen.darlington@gridgain.com>:
>>>>
>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>
>>>>> But that’s a lot of work that the data streamer already does for you
>>>>> and the data streamer also batches updates so would still be faster.
>>>>>
>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>> maximiliano.628@gmail.com> wrote:
>>>>>
>>>>> What would be the difference between doing cache.putAll(all rows) and
>>>>> separating them by affinity key+executing putAll inside a compute job.
>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>> by affinity key in one of the servers, right?
>>>>> Is there a comparison of that?
>>>>>
>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vladimir,
>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>>>> parse CSV on the server node.
>>>>>>
>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>>>
>>>>>> Hi Denis,
>>>>>>
>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>
>>>>>> Vladimir
>>>>>>
>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>>>> <dm...@apache.org>:
>>>>>>
>>>>>> Hello Vladimir,
>>>>>>
>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>
>>>>>> -
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>>>
>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>> version). The approach described works in production, but shows
>>>>>> inacceptable perfomance for very large tables.
>>>>>>
>>>>>> The story continues, and yesterday I've finished the proof of concept
>>>>>> for very fast loading of very big table. The partitioned MSSQL table about
>>>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>>>> Each node had executed its own SQL queries in parallel and then distributed
>>>>>> the loaded values across the other cluster nodes.
>>>>>>
>>>>>> Probably that result will be of interest for the community.
>>>>>>
>>>>>> Regards,
>>>>>> Vladimir Chernyi
>>>>>>
>>>>>> [1]
>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>
>>>>>> --
>>>>>> Taras Ledkov
>>>>>> Mail-To: tledkov@gridgain.com
>>>>>>
>>>>>>
>>>>>
>>>>>

Re: very fast loading of very big table

Posted by Vladimir Tchernyi <vt...@gmail.com>.
Hi,

I'd spent some time thinking about the community comments on my post. It
seems that Ignite is really not a bottleneck here. The performance of my
production MSSQL is a given restriction and the problem is to ensure fast
loading by executing multiple parallel queries. I'll test my code in
production for a couple of months for possible problems. If it will be OK,
probably the complete/downloadable/compilable GitHub example will be useful
for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vt...@gmail.com>:

> Pavel,
>
> maybe it's time to put your five-cent in. Can you share your code as a
> GitHub project? Maybe with the script to reproduce 6 GB of data.
>
> As for MSSQL data retrieval being the bottleneck - don't think so, I got
> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
> linear dependency (the table and the RDBMS server were the same).
> --
> Vladimir
>
> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:
>
>> > First of all, I tried to select the whole table as once
>>
>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>> Ignite.
>>
>> Can you run a test where some dummy data of the same size as real data is
>> generated and inserted into Ignite,
>> so that we test Ignite perf only, excluding MSSQL from the equation?
>> For example, streaming 300 million entries (total size 6 GB) takes around
>> 1 minute on my machine, with a simple single-threaded DataStreamer.
>>
>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
>> wrote:
>>
>>> Hi folks,
>>> thanks for your interest in my work.
>>>
>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>>> of years ago and didn't succeed. Probably because examples available aren't
>>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>>> is my five cents in changing the status quo). My interest is in KV API.
>>>
>>> I did try a data streamer, and that was my first try. I did not notice a
>>> significant time reduction in using code from my paper [1] versus data
>>> streamer/receiver. There was some memory economy with the streamer, though.
>>> I must say my experiment was made on a heavily loaded production mssql
>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>> story follows.
>>>
>>> First of all, I tried to select the whole table as once, I got the
>>> network timeout and the client node was dropped off the cluster (is node
>>> still alive?).
>>> So I'd partitioned the table and executed a number of queries one-by-one
>>> on the client node, each query for the specific table partition. That
>>> process took about 90 min. Inacceptable time.
>>>
>>> Then I tried to execute my queries in parallel on the client node, each
>>> query executing dataStreamer.addData() for a single dataStreamer. The
>>> timing was not less than 15 min. All the attempts were the same, probably
>>> that was the network throughput limit on the client node (same interface
>>> used for the resultset and for cluster intercom). Say it again - that was
>>> the production environment.
>>>
>>> Final schema:
>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>>> for one table partition;
>>> * each job executes SQL query, constructs a map with binary object key
>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>> contains the logic for cache binary entry update;
>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>
>>> The schema described proved to be network error-free in my production
>>> network and gives acceptable timing.
>>>
>>> Vladimir
>>>
>>> [1]
>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>
>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>> stephen.darlington@gridgain.com>:
>>>
>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>> and the size of each put is going to be smaller (also beneficial).
>>>>
>>>> But that’s a lot of work that the data streamer already does for you
>>>> and the data streamer also batches updates so would still be faster.
>>>>
>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>> maximiliano.628@gmail.com> wrote:
>>>>
>>>> What would be the difference between doing cache.putAll(all rows) and
>>>> separating them by affinity key+executing putAll inside a compute job.
>>>> If I'm not mistaken, doing putAll should end up splitting those rows by
>>>> affinity key in one of the servers, right?
>>>> Is there a comparison of that?
>>>>
>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>>> wrote:
>>>>
>>>>> Hi Vladimir,
>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>>> parse CSV on the server node.
>>>>>
>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>>
>>>>> Hi Denis,
>>>>>
>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>
>>>>> Vladimir
>>>>>
>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>>> <dm...@apache.org>:
>>>>>
>>>>> Hello Vladimir,
>>>>>
>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>>
>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>> version). The approach described works in production, but shows
>>>>> inacceptable perfomance for very large tables.
>>>>>
>>>>> The story continues, and yesterday I've finished the proof of concept
>>>>> for very fast loading of very big table. The partitioned MSSQL table about
>>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>>> Each node had executed its own SQL queries in parallel and then distributed
>>>>> the loaded values across the other cluster nodes.
>>>>>
>>>>> Probably that result will be of interest for the community.
>>>>>
>>>>> Regards,
>>>>> Vladimir Chernyi
>>>>>
>>>>> [1]
>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>
>>>>> --
>>>>> Taras Ledkov
>>>>> Mail-To: tledkov@gridgain.com
>>>>>
>>>>>
>>>>
>>>>

Re: very fast loading of very big table

Posted by Vladimir Tchernyi <vt...@gmail.com>.
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a
GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15
min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear
dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <pt...@apache.org>:

> > First of all, I tried to select the whole table as once
>
> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
> Ignite.
>
> Can you run a test where some dummy data of the same size as real data is
> generated and inserted into Ignite,
> so that we test Ignite perf only, excluding MSSQL from the equation?
> For example, streaming 300 million entries (total size 6 GB) takes around
> 1 minute on my machine, with a simple single-threaded DataStreamer.
>
> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
> wrote:
>
>> Hi folks,
>> thanks for your interest in my work.
>>
>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>> of years ago and didn't succeed. Probably because examples available aren't
>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>> is my five cents in changing the status quo). My interest is in KV API.
>>
>> I did try a data streamer, and that was my first try. I did not notice a
>> significant time reduction in using code from my paper [1] versus data
>> streamer/receiver. There was some memory economy with the streamer, though.
>> I must say my experiment was made on a heavily loaded production mssql
>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>> story follows.
>>
>> First of all, I tried to select the whole table as once, I got the
>> network timeout and the client node was dropped off the cluster (is node
>> still alive?).
>> So I'd partitioned the table and executed a number of queries one-by-one
>> on the client node, each query for the specific table partition. That
>> process took about 90 min. Inacceptable time.
>>
>> Then I tried to execute my queries in parallel on the client node, each
>> query executing dataStreamer.addData() for a single dataStreamer. The
>> timing was not less than 15 min. All the attempts were the same, probably
>> that was the network throughput limit on the client node (same interface
>> used for the resultset and for cluster intercom). Say it again - that was
>> the production environment.
>>
>> Final schema:
>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>> for one table partition;
>> * each job executes SQL query, constructs a map with binary object key
>> and value. Then the job executes targetCache.invokeAll() specifying the
>> constructed map and the static EntryProcessor class. The EntryProcessor
>> contains the logic for cache binary entry update;
>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>
>> The schema described proved to be network error-free in my production
>> network and gives acceptable timing.
>>
>> Vladimir
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>
>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>> stephen.darlington@gridgain.com>:
>>
>>> I think it’s more that that putAll is mostly atomic, so the more records
>>> you save in one chunk, the more locking, etc. happens. Distributing as
>>> compute jobs means all the putAlls will be local which is beneficial, and
>>> the size of each put is going to be smaller (also beneficial).
>>>
>>> But that’s a lot of work that the data streamer already does for you and
>>> the data streamer also batches updates so would still be faster.
>>>
>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <ma...@gmail.com>
>>> wrote:
>>>
>>> What would be the difference between doing cache.putAll(all rows) and
>>> separating them by affinity key+executing putAll inside a compute job.
>>> If I'm not mistaken, doing putAll should end up splitting those rows by
>>> affinity key in one of the servers, right?
>>> Is there a comparison of that?
>>>
>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>>> wrote:
>>>
>>>> Hi Vladimir,
>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>> parse CSV on the server node.
>>>>
>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>
>>>> Hi Denis,
>>>>
>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>
>>>> Vladimir
>>>>
>>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>>> <dm...@apache.org>:
>>>>
>>>> Hello Vladimir,
>>>>
>>>> Good to hear from you! How much is that in gigabytes?
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>>
>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>> version). The approach described works in production, but shows
>>>> inacceptable perfomance for very large tables.
>>>>
>>>> The story continues, and yesterday I've finished the proof of concept
>>>> for very fast loading of very big table. The partitioned MSSQL table about
>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>> Each node had executed its own SQL queries in parallel and then distributed
>>>> the loaded values across the other cluster nodes.
>>>>
>>>> Probably that result will be of interest for the community.
>>>>
>>>> Regards,
>>>> Vladimir Chernyi
>>>>
>>>> [1]
>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>> [2] https://m.habr.com/ru/post/526708/
>>>>
>>>>
>>>>
>>>> --
>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>
>>>> --
>>>> Taras Ledkov
>>>> Mail-To: tledkov@gridgain.com
>>>>
>>>>
>>>
>>>

Re: very fast loading of very big table

Posted by Pavel Tupitsyn <pt...@apache.org>.
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
Ignite.

Can you run a test where some dummy data of the same size as real data is
generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1
minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vt...@gmail.com>
wrote:

> Hi folks,
> thanks for your interest in my work.
>
> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
> of years ago and didn't succeed. Probably because examples available aren't
> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
> is my five cents in changing the status quo). My interest is in KV API.
>
> I did try a data streamer, and that was my first try. I did not notice a
> significant time reduction in using code from my paper [1] versus data
> streamer/receiver. There was some memory economy with the streamer, though.
> I must say my experiment was made on a heavily loaded production mssql
> server. Filtered query with 300K rows resultset takes about 15 sec. The
> story follows.
>
> First of all, I tried to select the whole table as once, I got the network
> timeout and the client node was dropped off the cluster (is node still
> alive?).
> So I'd partitioned the table and executed a number of queries one-by-one
> on the client node, each query for the specific table partition. That
> process took about 90 min. Inacceptable time.
>
> Then I tried to execute my queries in parallel on the client node, each
> query executing dataStreamer.addData() for a single dataStreamer. The
> timing was not less than 15 min. All the attempts were the same, probably
> that was the network throughput limit on the client node (same interface
> used for the resultset and for cluster intercom). Say it again - that was
> the production environment.
>
> Final schema:
> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
> for one table partition;
> * each job executes SQL query, constructs a map with binary object key and
> value. Then the job executes targetCache.invokeAll() specifying the
> constructed map and the static EntryProcessor class. The EntryProcessor
> contains the logic for cache binary entry update;
> * ComputeTask.reduce() summarizes the row count reported by each job.
>
> The schema described proved to be network error-free in my production
> network and gives acceptable timing.
>
> Vladimir
>
> [1]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>
> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
> stephen.darlington@gridgain.com>:
>
>> I think it’s more that that putAll is mostly atomic, so the more records
>> you save in one chunk, the more locking, etc. happens. Distributing as
>> compute jobs means all the putAlls will be local which is beneficial, and
>> the size of each put is going to be smaller (also beneficial).
>>
>> But that’s a lot of work that the data streamer already does for you and
>> the data streamer also batches updates so would still be faster.
>>
>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <ma...@gmail.com>
>> wrote:
>>
>> What would be the difference between doing cache.putAll(all rows) and
>> separating them by affinity key+executing putAll inside a compute job.
>> If I'm not mistaken, doing putAll should end up splitting those rows by
>> affinity key in one of the servers, right?
>> Is there a comparison of that?
>>
>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com>
>> wrote:
>>
>>> Hi Vladimir,
>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>> parse CSV on the server node.
>>>
>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>
>>> Hi Denis,
>>>
>>> Data space is 3.7Gb according to MSSQL table properries
>>>
>>> Vladimir
>>>
>>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>>> <dm...@apache.org>:
>>>
>>> Hello Vladimir,
>>>
>>> Good to hear from you! How much is that in gigabytes?
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>>
>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>> version). The approach described works in production, but shows
>>> inacceptable perfomance for very large tables.
>>>
>>> The story continues, and yesterday I've finished the proof of concept
>>> for very fast loading of very big table. The partitioned MSSQL table about
>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>> Each node had executed its own SQL queries in parallel and then distributed
>>> the loaded values across the other cluster nodes.
>>>
>>> Probably that result will be of interest for the community.
>>>
>>> Regards,
>>> Vladimir Chernyi
>>>
>>> [1]
>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>> [2] https://m.habr.com/ru/post/526708/
>>>
>>>
>>>
>>> --
>>> Отправлено из мобильного приложения Яндекс.Почты
>>>
>>> --
>>> Taras Ledkov
>>> Mail-To: tledkov@gridgain.com
>>>
>>>
>>
>>

Re: very fast loading of very big table

Posted by Vladimir Tchernyi <vt...@gmail.com>.
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of
years ago and didn't succeed. Probably because examples available aren't
complete/downloadable/compilable (the paper [1] contains GitHub repo, that
is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a
significant time reduction in using code from my paper [1] versus data
streamer/receiver. There was some memory economy with the streamer, though.
I must say my experiment was made on a heavily loaded production mssql
server. Filtered query with 300K rows resultset takes about 15 sec. The
story follows.

First of all, I tried to select the whole table as once, I got the network
timeout and the client node was dropped off the cluster (is node still
alive?).
So I'd partitioned the table and executed a number of queries one-by-one on
the client node, each query for the specific table partition. That process
took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each
query executing dataStreamer.addData() for a single dataStreamer. The
timing was not less than 15 min. All the attempts were the same, probably
that was the network throughput limit on the client node (same interface
used for the resultset and for cluster intercom). Say it again - that was
the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
for one table partition;
* each job executes SQL query, constructs a map with binary object key and
value. Then the job executes targetCache.invokeAll() specifying the
constructed map and the static EntryProcessor class. The EntryProcessor
contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production
network and gives acceptable timing.

Vladimir

[1]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
stephen.darlington@gridgain.com>:

> I think it’s more that that putAll is mostly atomic, so the more records
> you save in one chunk, the more locking, etc. happens. Distributing as
> compute jobs means all the putAlls will be local which is beneficial, and
> the size of each put is going to be smaller (also beneficial).
>
> But that’s a lot of work that the data streamer already does for you and
> the data streamer also batches updates so would still be faster.
>
> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <ma...@gmail.com>
> wrote:
>
> What would be the difference between doing cache.putAll(all rows) and
> separating them by affinity key+executing putAll inside a compute job.
> If I'm not mistaken, doing putAll should end up splitting those rows by
> affinity key in one of the servers, right?
> Is there a comparison of that?
>
> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com> wrote:
>
>> Hi Vladimir,
>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>> This command uses 'IgniteDataStreamer' to write data into cluster and
>> parse CSV on the server node.
>>
>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>
>> Hi Denis,
>>
>> Data space is 3.7Gb according to MSSQL table properries
>>
>> Vladimir
>>
>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
>> <dm...@apache.org>:
>>
>> Hello Vladimir,
>>
>> Good to hear from you! How much is that in gigabytes?
>>
>> -
>> Denis
>>
>>
>> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>>
>> Sep 2020 I've published the paper about Loading Large Datasets into
>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>> version). The approach described works in production, but shows
>> inacceptable perfomance for very large tables.
>>
>> The story continues, and yesterday I've finished the proof of concept for
>> very fast loading of very big table. The partitioned MSSQL table about 295
>> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
>> node had executed its own SQL queries in parallel and then distributed the
>> loaded values across the other cluster nodes.
>>
>> Probably that result will be of interest for the community.
>>
>> Regards,
>> Vladimir Chernyi
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>> [2] https://m.habr.com/ru/post/526708/
>>
>>
>>
>> --
>> Отправлено из мобильного приложения Яндекс.Почты
>>
>> --
>> Taras Ledkov
>> Mail-To: tledkov@gridgain.com
>>
>>
>
>

Re: very fast loading of very big table

Posted by Stephen Darlington <st...@gridgain.com>.
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <ma...@gmail.com> wrote:
> 
> What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
> If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right? 
> Is there a comparison of that?
> 
> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tledkov@gridgain.com <ma...@gridgain.com>> wrote:
> Hi Vladimir,
> 
> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
> This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.
> 
> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
> 
>> Hi Denis,
>> 
>> Data space is 3.7Gb according to MSSQL table properries
>> 
>> Vladimir
>> 
>> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org> <ma...@apache.org>:
>> Hello Vladimir, 
>> 
>> Good to hear from you! How much is that in gigabytes?
>> 
>> -
>> Denis
>> 
>> 
>> On Thu, Feb 18, 2021 at 10:06 PM <vtchernyi@gmail.com <ma...@gmail.com>> wrote:
>> Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.
>> 
>> The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.
>> 
>> Probably that result will be of interest for the community.
>> 
>> Regards,
>> Vladimir Chernyi
>> 
>> [1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api <https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api>
>> [2] https://m.habr.com/ru/post/526708/ <https://m.habr.com/ru/post/526708/>
>> 
>> 
>> -- 
>> Отправлено из мобильного приложения Яндекс.Почты
> -- 
> Taras Ledkov
> Mail-To: tledkov@gridgain.com <ma...@gridgain.com>


Re: very fast loading of very big table

Posted by Maximiliano Gazquez <ma...@gmail.com>.
What would be the difference between doing cache.putAll(all rows) and
separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by
affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tl...@gridgain.com> wrote:

> Hi Vladimir,
> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
> This command uses 'IgniteDataStreamer' to write data into cluster and
> parse CSV on the server node.
>
> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>
> Hi Denis,
>
> Data space is 3.7Gb according to MSSQL table properries
>
> Vladimir
>
> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>
> <dm...@apache.org>:
>
> Hello Vladimir,
>
> Good to hear from you! How much is that in gigabytes?
>
> -
> Denis
>
>
> On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:
>
> Sep 2020 I've published the paper about Loading Large Datasets into Apache
> Ignite by Using a Key-Value API (English [1] and Russian [2] version). The
> approach described works in production, but shows inacceptable perfomance
> for very large tables.
>
> The story continues, and yesterday I've finished the proof of concept for
> very fast loading of very big table. The partitioned MSSQL table about 295
> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
> node had executed its own SQL queries in parallel and then distributed the
> loaded values across the other cluster nodes.
>
> Probably that result will be of interest for the community.
>
> Regards,
> Vladimir Chernyi
>
> [1]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
> [2] https://m.habr.com/ru/post/526708/
>
>
>
> --
> Отправлено из мобильного приложения Яндекс.Почты
>
> --
> Taras Ledkov
> Mail-To: tledkov@gridgain.com
>
>

Re: very fast loading of very big table

Posted by Taras Ledkov <tl...@gridgain.com>.
Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and 
parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

> Hi Denis,
>
> Data space is 3.7Gb according to MSSQL table properries
>
> Vladimir
>
> 9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>:
>
>     Hello Vladimir,
>
>     Good to hear from you! How much is that in gigabytes?
>
>     -
>     Denis
>
>
>     On Thu, Feb 18, 2021 at 10:06 PM <vtchernyi@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Sep 2020 I've published the paper about Loading Large Datasets
>         into Apache Ignite by Using a Key-Value API (English [1] and
>         Russian [2] version). The approach described works in
>         production, but shows inacceptable perfomance for very large
>         tables.
>
>         The story continues, and yesterday I've finished the proof of
>         concept for very fast loading of very big table. The
>         partitioned MSSQL table about 295 million rows was loaded by
>         the 4-node Ignite cluster in 3 min 35 sec. Each node had
>         executed its own SQL queries in parallel and then distributed
>         the loaded values across the other cluster nodes.
>
>         Probably that result will be of interest for the community.
>
>         Regards,
>         Vladimir Chernyi
>
>         [1]
>         https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>         <https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api>
>         [2] https://m.habr.com/ru/post/526708/
>         <https://m.habr.com/ru/post/526708/>
>
>
>
> -- 
> Отправлено из мобильного приложения Яндекс.Почты

-- 
Taras Ledkov
Mail-To: tledkov@gridgain.com


Re: very fast loading of very big table

Posted by vt...@gmail.com.
Hi Denis,

  

Data space is 3.7Gb according to MSSQL table properries

  

Vladimir  
  
9:47, 19 февраля 2021 г., Denis Magda <dm...@apache.org>:  

> Hello Vladimir,

>

>  
>

>

> Good to hear from you! How much is that in gigabytes?

>

>  
>

>

> -

>

> Denis

>

>  
>

>

>  
>

>

> On Thu, Feb 18, 2021 at 10:06 PM
<[vtchernyi@gmail.com](mailto:vtchernyi@gmail.com)> wrote:  
>

>

>> Sep 2020 I've published the paper about Loading Large Datasets into Apache
Ignite by Using a Key-Value API (English [1] and Russian [2] version). The
approach described works in production, but shows inacceptable perfomance for
very large tables.

>>

>>  
>

>>

>> The story continues, and yesterday I've finished the proof of concept for
very fast loading of very big table. The partitioned MSSQL table about 295
million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
node had executed its own SQL queries in parallel and then distributed the
loaded values across the other cluster nodes.

>>

>>  
>

>>

>> Probably that result will be of interest for the community.

>>

>>  
>

>>

>> Regards,

>>

>> Vladimir Chernyi

>>

>>  
>

>>

>> [1] <https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-
apache-ignite-using-key-value-api>

>>

>> [2] <https://m.habr.com/ru/post/526708/>

>>

>>  
>

  
  
\--  
Отправлено из мобильного приложения Яндекс.Почты


Re: very fast loading of very big table

Posted by Denis Magda <dm...@apache.org>.
Hello Vladimir,

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <vt...@gmail.com> wrote:

> Sep 2020 I've published the paper about Loading Large Datasets into Apache
> Ignite by Using a Key-Value API (English [1] and Russian [2] version). The
> approach described works in production, but shows inacceptable perfomance
> for very large tables.
>
> The story continues, and yesterday I've finished the proof of concept for
> very fast loading of very big table. The partitioned MSSQL table about 295
> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
> node had executed its own SQL queries in parallel and then distributed the
> loaded values across the other cluster nodes.
>
> Probably that result will be of interest for the community.
>
> Regards,
> Vladimir Chernyi
>
> [1]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
> [2] https://m.habr.com/ru/post/526708/
>
>