You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Abhimanyu Nagrath <ab...@gmail.com> on 2017/11/20 05:38:27 UTC

Total number of events in predictionio are showing less then the actual events

Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
uploaded near about 1 million events(each containing 30k features) . while
uploading I can see the size of hbase disk increasing and after all the
events got uploaded the size of hbase disk is 567GB. In order to verify I
ran the following commands

 - pio-shell --with-spark --conf spark.network.timeout=10000000
--driver-memory 30G --executor-memory 21G --num-executors 7
--executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
spark.executor.heartbeatInterval=10000000
 - import org.apache.predictionio.data.store.PEventStore
 - val eventsRDD = PEventStore.find(appName="test")(sc)
 - val c = eventsRDD.count()
it shows event counts as 18944

After that from the script through which I uploaded the events, I randomly
queried with there events Id and I was getting that event.

I don't know how to make sure that all the events uploaded by me are there
in the app. Any help is appreciated.


Regards,
Abhimanyu

Re: Total number of events in predictionio are showing less then the actual events

Posted by Abhimanyu Nagrath <ab...@gmail.com>.
Hi Pat,

I dont think hbase TTL is the issue because

   1. I added the data 1 day  back
   2. I have a simlar server running for 1.5 million events each having 6k
   feature having data 10 days old and its working fine.

Regards,
Abhimanyu

On Thu, Nov 23, 2017 at 10:58 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> My vague recollection is that HBase may mark things for removal but wait
> for certain operations before they are compacted. If this is the case I’m
> sure there is a way to get the correct count so this may be a question for
> the HBase list.
>
>
> On Nov 23, 2017, at 1:51 AM, Abhimanyu Nagrath <ab...@gmail.com>
> wrote:
>
> Done the same as you have mentioned but problem still ersists
>
>
>
>
> Regards,
> Abhimanyu
>
> On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов <lokotochek@gmail.com
> > wrote:
>
>> Hi Abhimanyu,
>>
>> try setting TTL for rows in your hbase table
>> it can be set in hbase shell:
>> alter 'pio_event:events_?', NAME => 'e', TTL => <seconds to live>
>> and then do the following in the shell:
>> major_compact 'pio_event:events_?'
>>
>> You can configure auto major compact: it will delete all the rows that
>> are older than TTL
>>
>> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath <ab...@gmail.com>
>> написал(а):
>>
>> Hi,
>>
>> I am stuck at this point .How to identify the problem?
>>
>>
>> Regards,
>> Abhimanyu
>>
>> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
>> abhimanyunagrath@gmail.com> wrote:
>>
>>> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
>>> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
>>> uploaded near about 1 million events(each containing 30k features) . while
>>> uploading I can see the size of hbase disk increasing and after all the
>>> events got uploaded the size of hbase disk is 567GB. In order to verify I
>>> ran the following commands
>>>
>>>  - pio-shell --with-spark --conf spark.network.timeout=10000000
>>> --driver-memory 30G --executor-memory 21G --num-executors 7
>>> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
>>> spark.executor.heartbeatInterval=10000000
>>>  - import org.apache.predictionio.data.store.PEventStore
>>>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>>>  - val c = eventsRDD.count()
>>> it shows event counts as 18944
>>>
>>> After that from the script through which I uploaded the events, I
>>> randomly queried with there events Id and I was getting that event.
>>>
>>> I don't know how to make sure that all the events uploaded by me are
>>> there in the app. Any help is appreciated.
>>>
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>
>>
>>
>
>

Re: Total number of events in predictionio are showing less then the actual events

Posted by Pat Ferrel <pa...@occamsmachete.com>.
My vague recollection is that HBase may mark things for removal but wait for certain operations before they are compacted. If this is the case I’m sure there is a way to get the correct count so this may be a question for the HBase list.


On Nov 23, 2017, at 1:51 AM, Abhimanyu Nagrath <ab...@gmail.com> wrote:

Done the same as you have mentioned but problem still ersists




Regards,
Abhimanyu

On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов <lokotochek@gmail.com <ma...@gmail.com>> wrote:
Hi Abhimanyu,

try setting TTL for rows in your hbase table
it can be set in hbase shell:
	alter 'pio_event:events_?', NAME => 'e', TTL => <seconds to live>
and then do the following in the shell:
	major_compact 'pio_event:events_?'

You can configure auto major compact: it will delete all the rows that are older than TTL

> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath <abhimanyunagrath@gmail.com <ma...@gmail.com>> написал(а):
> 
> Hi,
> 
> I am stuck at this point .How to identify the problem?
> 
> 
> Regards,
> Abhimanyu
> 
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <abhimanyunagrath@gmail.com <ma...@gmail.com>> wrote:
> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near about 1 million events(each containing 30k features) . while uploading I can see the size of hbase disk increasing and after all the events got uploaded the size of hbase disk is 567GB. In order to verify I ran the following commands 
> 
>  - pio-shell --with-spark --conf spark.network.timeout=10000000 --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count() 
> it shows event counts as 18944
> 
> After that from the script through which I uploaded the events, I randomly queried with there events Id and I was getting that event.
> 
> I don't know how to make sure that all the events uploaded by me are there in the app. Any help is appreciated.
> 
> 
> Regards,
> Abhimanyu
> 




Re: Total number of events in predictionio are showing less then the actual events

Posted by Abhimanyu Nagrath <ab...@gmail.com>.
Done the same as you have mentioned but problem still ersists




Regards,
Abhimanyu

On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов <lo...@gmail.com>
wrote:

> Hi Abhimanyu,
>
> try setting TTL for rows in your hbase table
> it can be set in hbase shell:
> alter 'pio_event:events_?', NAME => 'e', TTL => <seconds to live>
> and then do the following in the shell:
> major_compact 'pio_event:events_?'
>
> You can configure auto major compact: it will delete all the rows that are
> older than TTL
>
> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath <ab...@gmail.com>
> написал(а):
>
> Hi,
>
> I am stuck at this point .How to identify the problem?
>
>
> Regards,
> Abhimanyu
>
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
> abhimanyunagrath@gmail.com> wrote:
>
>> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
>> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
>> uploaded near about 1 million events(each containing 30k features) . while
>> uploading I can see the size of hbase disk increasing and after all the
>> events got uploaded the size of hbase disk is 567GB. In order to verify I
>> ran the following commands
>>
>>  - pio-shell --with-spark --conf spark.network.timeout=10000000
>> --driver-memory 30G --executor-memory 21G --num-executors 7
>> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
>> spark.executor.heartbeatInterval=10000000
>>  - import org.apache.predictionio.data.store.PEventStore
>>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>>  - val c = eventsRDD.count()
>> it shows event counts as 18944
>>
>> After that from the script through which I uploaded the events, I
>> randomly queried with there events Id and I was getting that event.
>>
>> I don't know how to make sure that all the events uploaded by me are
>> there in the app. Any help is appreciated.
>>
>>
>> Regards,
>> Abhimanyu
>>
>
>
>

Re: Total number of events in predictionio are showing less then the actual events

Posted by Abhimanyu Nagrath <ab...@gmail.com>.
But when I run the command "count 'pio_event:events' " on hbase it shows me
all the row 1.5 million

On Thu, Nov 23, 2017 at 2:53 PM, Александр Лактионов <lo...@gmail.com>
wrote:

> Hi Abhimanyu,
>
> try setting TTL for rows in your hbase table
> it can be set in hbase shell:
> alter 'pio_event:events_?', NAME => 'e', TTL => <seconds to live>
> and then do the following in the shell:
> major_compact 'pio_event:events_?'
>
> You can configure auto major compact: it will delete all the rows that are
> older than TTL
>
> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath <ab...@gmail.com>
> написал(а):
>
> Hi,
>
> I am stuck at this point .How to identify the problem?
>
>
> Regards,
> Abhimanyu
>
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
> abhimanyunagrath@gmail.com> wrote:
>
>> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
>> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
>> uploaded near about 1 million events(each containing 30k features) . while
>> uploading I can see the size of hbase disk increasing and after all the
>> events got uploaded the size of hbase disk is 567GB. In order to verify I
>> ran the following commands
>>
>>  - pio-shell --with-spark --conf spark.network.timeout=10000000
>> --driver-memory 30G --executor-memory 21G --num-executors 7
>> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
>> spark.executor.heartbeatInterval=10000000
>>  - import org.apache.predictionio.data.store.PEventStore
>>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>>  - val c = eventsRDD.count()
>> it shows event counts as 18944
>>
>> After that from the script through which I uploaded the events, I
>> randomly queried with there events Id and I was getting that event.
>>
>> I don't know how to make sure that all the events uploaded by me are
>> there in the app. Any help is appreciated.
>>
>>
>> Regards,
>> Abhimanyu
>>
>
>
>

Re: Total number of events in predictionio are showing less then the actual events

Posted by Александр Лактионов <lo...@gmail.com>.
Hi Abhimanyu,

try setting TTL for rows in your hbase table
it can be set in hbase shell:
	alter 'pio_event:events_?', NAME => 'e', TTL => <seconds to live>
and then do the following in the shell:
	major_compact 'pio_event:events_?'

You can configure auto major compact: it will delete all the rows that are older than TTL

> 23 нояб. 2017 г., в 12:19, Abhimanyu Nagrath <ab...@gmail.com> написал(а):
> 
> Hi,
> 
> I am stuck at this point .How to identify the problem?
> 
> 
> Regards,
> Abhimanyu
> 
> On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <abhimanyunagrath@gmail.com <ma...@gmail.com>> wrote:
> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near about 1 million events(each containing 30k features) . while uploading I can see the size of hbase disk increasing and after all the events got uploaded the size of hbase disk is 567GB. In order to verify I ran the following commands 
> 
>  - pio-shell --with-spark --conf spark.network.timeout=10000000 --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count() 
> it shows event counts as 18944
> 
> After that from the script through which I uploaded the events, I randomly queried with there events Id and I was getting that event.
> 
> I don't know how to make sure that all the events uploaded by me are there in the app. Any help is appreciated.
> 
> 
> Regards,
> Abhimanyu
> 


Re: Total number of events in predictionio are showing less then the actual events

Posted by Abhimanyu Nagrath <ab...@gmail.com>.
Hi,

I am stuck at this point .How to identify the problem?


Regards,
Abhimanyu

On Mon, Nov 20, 2017 at 11:08 AM, Abhimanyu Nagrath <
abhimanyunagrath@gmail.com> wrote:

> Hi , I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase -
> 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have
> uploaded near about 1 million events(each containing 30k features) . while
> uploading I can see the size of hbase disk increasing and after all the
> events got uploaded the size of hbase disk is 567GB. In order to verify I
> ran the following commands
>
>  - pio-shell --with-spark --conf spark.network.timeout=10000000
> --driver-memory 30G --executor-memory 21G --num-executors 7
> --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf
> spark.executor.heartbeatInterval=10000000
>  - import org.apache.predictionio.data.store.PEventStore
>  - val eventsRDD = PEventStore.find(appName="test")(sc)
>  - val c = eventsRDD.count()
> it shows event counts as 18944
>
> After that from the script through which I uploaded the events, I randomly
> queried with there events Id and I was getting that event.
>
> I don't know how to make sure that all the events uploaded by me are there
> in the app. Any help is appreciated.
>
>
> Regards,
> Abhimanyu
>