You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by joseheitor <jo...@heitorprojects.com> on 2018/11/24 13:49:02 UTC

Optimum storage startegy (Advice?)

Given the following model structure:

{
  "trans": {
    "cust": {
      "firstname": "Bone",
      "lastname": "Klebes",
      "email": "[hidden email]",
      "gender": "Male"
    },
    "ipaddress": "104.89.149.184",
    "date": "2017-12-01",
    "amount": 1217,
    "currency": "NOK"
  }
}

What is the recommended storage strategy for optimum querying? 
(example: date='2018-12-01' and lastname='Klebes' and amount > 1000)

I have tried the following strategies on a partitioned, two-node cluster
having 1 000 000 documents, with the following results:

1. - DataGrid: 
      One document per record
      Using IgniteBiPredicate in ScanQuery
      Time taken: ~1.5 sec
      Inserts are quite fast (1 000 000 documents = 1 000 000 records = 35
min)

2. - SQL:
      Table structure - [id(PK)], [doc_id], [key], [val] (all columns
indexed)
      One record for each field (8 records per document)
      Using nested JDBC JOIN query
      Time taken: ~20ms (PostgreSQL comparison = ~10ms)
      Inserts are very slow (1 000 000 documents = 8 000 000 records = 10
hours)

Is there another strategy that I should consider? Or any techniques that I
can use to optimise the queries?

Thanks.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Optimum storage startegy (Advice?)

Posted by joseheitor <jo...@heitorprojects.com>.

Thanks for the reply and questions, Evgenii. Here are some additional details
(long-winded...sorry), to provide you with some context of the usage.

The reasons for not using a Flat table structure are as follows:

1. - A real-world retail transaction has a much larger and complex structure
than my simple example test-case. A single transaction can have many simple
and complex nested elements with varying structures that house one or more
records for terminal, store,item-sales, item-returns, discounts, rewards,
tenders, taxes, customer, salesperson, etc., etc...

2. - The transaction structure can be (and generally is) extended over time,
without impacting the data store or the code.

3. - If we can attain reasonable performance with this 'meta' key-value type
storage structure, then we can tailor the amount and selection of individual
elements that need to be queried, for each customer's requirement based on
their need and appetite for big data analytics. This could be achieved
purely through configuration (list of keys used in queries), without
requiring custom development. For example:

A) Customer with basic financial reporting needs: we simply store a few
records as follows...

pk=1  id=2195  k='trans.body'  v='{[full JSON transaction]}'
pk=2  id=2195  k='trans.date'  v='2018-12-01'

B) Customer with larger appetite for complex reporting and analytics: we
would then store several records for a number of matching keys required for
querying as follows...

pk=1  id=2195  k='trans.body'  v='{[full JSON transaction]}'
pk=2  id=2195  k='trans.date'  v='2018-12-01'
pk=3  id=2195  k='trans.item.sale[].sku'  v='6005120257329'
pk=4  id=2195  k='trans.item.sale[].sku'  v='2460058410533'
pk=5  id=2195  k='trans.cust.lastname'  v='Smith'
pk=6  id=2195  k='trans.tender.type.card.amount'  v='2595'
pk=7  id=2195  k='trans.amount'  v='2595'
pk=8  id=2195  k='trans.tax.type0.amount'  v='189'
pk=9  id=2195  k='trans.tax.type1.amount'  v='0'
pk=10  id=2195  k='trans.store.id'  v='109574'
... (etc)

The table structure (with all indexes) would be as follows:

CREATE TABLE public.transactions (
  pk UUID, id UUID, k VARCHAR, v VARCHAR, PRIMARY KEY (pk, id))
  WITH "TEMPLATE=PARTITIONED, BACKUPS=1, ATOMICITY=TRANSACTIONAL,
WRITE_SYNCHRONIZATION_MODE=FULL_SYNC, AFFINITY_KEY=id";

CREATE INDEX test_data_id_k_v ON public.test_data (id, k, v);
CREATE INDEX test_data_k_id_v ON public.test_data (k, id, v);
CREATE INDEX test_data_k_v_id ON public.test_data (k, v, id);
 
CREATE INDEX test_data_id ON public.test_data (id);
CREATE INDEX test_data_k ON public.test_data (k);
CREATE INDEX test_data_v ON public.test_data (v);
CREATE INDEX test_data_pk ON public.test_data (pk);

Inserting data through plain JDBC INSERT operations is slow and worrying.
Although the docs for the JDBC Client Driver specify that setting
'streaming=true' in the connection string will enable streaming - I have
found that although no errors occur, the table remains empty. (Don't know
whether I am doing something wrong, or if it is a bug?). Do you know of any
examples for doing this?

Many thanks,
Jose



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Optimum storage startegy (Advice?)

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi,

You mentioned that all fields are indexed, does this mean that you have
only single-field indexes? Do you consider creating group indexes for your
queries? Can you share query plans with us?

If you will have proper group indexes for all queries, maybe it worth to
try to avoid joins at all and store everything in one table with flat
structure.

As for loading - it's too long as for me, how do you insert data? As Ilya
mentioned, it worth to enable streaming mode(
https://apacheignite-sql.readme.io/docs/set), and, if you use persistence,
disable WAL at the moment of data loading.

Best Regards,
Evgenii





вс, 25 нояб. 2018 г. в 23:13, Ilya Kasnacheev <il...@gmail.com>:

> Hello!
>
> I think you should be using Streaming Mode or Data Streamer to insert data
> for SQL, which you seem to already do.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> сб, 24 нояб. 2018 г. в 16:49, joseheitor <jo...@heitorprojects.com>:
>
>> Given the following model structure:
>>
>> {
>>   "trans": {
>>     "cust": {
>>       "firstname": "Bone",
>>       "lastname": "Klebes",
>>       "email": "[hidden email]",
>>       "gender": "Male"
>>     },
>>     "ipaddress": "104.89.149.184",
>>     "date": "2017-12-01",
>>     "amount": 1217,
>>     "currency": "NOK"
>>   }
>> }
>>
>> What is the recommended storage strategy for optimum querying?
>> (example: date='2018-12-01' and lastname='Klebes' and amount > 1000)
>>
>> I have tried the following strategies on a partitioned, two-node cluster
>> having 1 000 000 documents, with the following results:
>>
>> 1. - DataGrid:
>>       One document per record
>>       Using IgniteBiPredicate in ScanQuery
>>       Time taken: ~1.5 sec
>>       Inserts are quite fast (1 000 000 documents = 1 000 000 records = 35
>> min)
>>
>> 2. - SQL:
>>       Table structure - [id(PK)], [doc_id], [key], [val] (all columns
>> indexed)
>>       One record for each field (8 records per document)
>>       Using nested JDBC JOIN query
>>       Time taken: ~20ms (PostgreSQL comparison = ~10ms)
>>       Inserts are very slow (1 000 000 documents = 8 000 000 records = 10
>> hours)
>>
>> Is there another strategy that I should consider? Or any techniques that I
>> can use to optimise the queries?
>>
>> Thanks.
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>

Re: Optimum storage startegy (Advice?)

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I think you should be using Streaming Mode or Data Streamer to insert data
for SQL, which you seem to already do.

Regards,
-- 
Ilya Kasnacheev


сб, 24 нояб. 2018 г. в 16:49, joseheitor <jo...@heitorprojects.com>:

> Given the following model structure:
>
> {
>   "trans": {
>     "cust": {
>       "firstname": "Bone",
>       "lastname": "Klebes",
>       "email": "[hidden email]",
>       "gender": "Male"
>     },
>     "ipaddress": "104.89.149.184",
>     "date": "2017-12-01",
>     "amount": 1217,
>     "currency": "NOK"
>   }
> }
>
> What is the recommended storage strategy for optimum querying?
> (example: date='2018-12-01' and lastname='Klebes' and amount > 1000)
>
> I have tried the following strategies on a partitioned, two-node cluster
> having 1 000 000 documents, with the following results:
>
> 1. - DataGrid:
>       One document per record
>       Using IgniteBiPredicate in ScanQuery
>       Time taken: ~1.5 sec
>       Inserts are quite fast (1 000 000 documents = 1 000 000 records = 35
> min)
>
> 2. - SQL:
>       Table structure - [id(PK)], [doc_id], [key], [val] (all columns
> indexed)
>       One record for each field (8 records per document)
>       Using nested JDBC JOIN query
>       Time taken: ~20ms (PostgreSQL comparison = ~10ms)
>       Inserts are very slow (1 000 000 documents = 8 000 000 records = 10
> hours)
>
> Is there another strategy that I should consider? Or any techniques that I
> can use to optimise the queries?
>
> Thanks.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>