You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by "Huang, Weiguang" <we...@intel.com> on 2017/11/28 01:35:17 UTC

Data lost from HBase to DataSource

Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

Thank you for looking into my question.

Best,
Weiguang

RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Guess it was the length of the record that caused the problem of data lost. We changed the image records to their file directories as the initial input to Pio, and the problem is gone.

Thank you all guys.

Best,
Weiguang 

-----Original Message-----
From: Huang, Weiguang [mailto:weiguang.huang@intel.com] 
Sent: Tuesday, December 5, 2017 1:41 PM
To: user@predictionio.apache.org
Subject: RE: Data lost from HBase to DataSource

Thanks Takako. I will have a try.

Best,
Weiguang

-----Original Message-----
From: takako shimamoto [mailto:chibochibo@gmail.com]
Sent: Tuesday, December 5, 2017 10:01 AM
To: user@predictionio.apache.org
Subject: Re: Data lost from HBase to DataSource

Which version of HBase are you using?
I guess it is because libraries of the storage/hbase subproject are too old that this causes. If you are using HBase 1.2.6, running assembly task against hbase-common, hbase-client and hbase-server
1.2.6 would work.


2017-11-30 17:25 GMT+09:00 Huang, Weiguang <we...@intel.com>:
> Hi Pat,
>
>
>
> We have compared the format of 2 records as attached from the json 
> file for import. The first one is imported and successfully read in 
> $pio train as we printed out its entityID in logger, and the other 
> should not have been read in pio successfully as its entityId is 
> absent in logger. But the two records have the same json format, as 
> every record has been generated by the same program.
>
> And here is an quick illustration of a record in json, with "encodedImage"
> being shortened from its actual 262,156 characters
>
> {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties":
> {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
>
> Only "entityId", "properties": {"label", "encodedImage"} could be 
> different among every record.
>
>
>
> We also noticed another weird  thing. After the one-time $pio import 
> of 6500 records, we $pio export immediately and got 399 + 399 = 798 
> records in 2 $pio exported files.
>
> As we $pio train for a couple of rounds, the number of records in pio 
> increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
>
> and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.
>
>
>
> Please see below the system logger for $pio import. It seems 
> everything is all right.
>
> $pio import --appid 8 --input
> ../imageNetTemplate/data/imagenet_5_class_resized.json
>
>
>
> /opt/work/spark-2.1.1 is probably an Apache Spark development tree. 
> Please make sure you are using at least 1.3.0.
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-
> hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder
> .class]
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.
> 11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> [INFO] [Runner$] Submission command: 
> /opt/work/spark-2.1.1/bin/spark-submit
> --class org.apache.predictionio.tools.imprt.FileToEvents --jars
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-
> assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incu
> bating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:
> /opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assem
> bly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubatin
> g/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,fil
> e:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-as
> sembly-0.11.0-incubating.jar
> --files
> file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,fi
> le:/opt/work/hbase-1.3.1/conf/hbase-site.xml
> --driver-class-path
> /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/co
> nf --driver-java-options -Dpio.log.dir=/root
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-
> incubating.jar
> --appid 8 --input
> file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTempl
> ate/data/imagenet_5_class_resized.json
> --env
> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOUR
> CES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NA
> ME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STO
> RAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_H
> OME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incub
> ating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_L
> OCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16
> 000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_R
> EPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MOD
> ELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_even
> t,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORA
> GE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMP
> DIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_m
> odel,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt
> /work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICS
> EARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
>
> [INFO] [log] Logging initialized @4913ms
>
> [INFO] [Server] jetty-9.2.z-SNAPSHOT
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAI
> LABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ServerConnector] Started
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [Server] Started @5086ms
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spa
> rk}
>
> [INFO] [FileToEvents$] Events are imported.
>
> [INFO] [FileToEvents$] Done.
>
> [INFO] [ServerConnector] Stopped
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAV
> AILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}
>
>
>
> Thanks for your advice.
>
>
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Thursday, November 30, 2017 2:06 AM
> To: user@predictionio.apache.org
> Cc: Shi, Dongjie <do...@intel.com>
>
>
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> 1596 is how many events were accepted by the EventServer, look at the 
> exported format and compare with the ones you imported. There must be 
> a formatting error or an error when importing (did you check responses 
> for each event import?)
>
>
>
> Looking below I see you are importing JPEG??? This is almost always a 
> bad idea. Image data is usually kept in a filesystems like HDFS and a 
> reference kept in the DB, there are too may serialization questions to 
> do otherwise in my experience. If your Engine requires this you are 
> asking for the kind of trouble you are seeing.
>
>
>
>
>
> On Nov 28, 2017, at 7:16 PM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Here is the result when we tried out your suggestion.
>
>
>
> We checked the data from the Hbase, and the count of the records is 
> exactly the same as we imported into the Hbase, that is 6500.
>
> 2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at 
> ImageDataFromHBaseChecker.scala:27, took 12.016679 s
>
> Number of Records found : 6500
>
>
>
> We exported data from Pio and checked, but got only 1596 – see at the 
> bottom of the below screen record.
>
> $ ls -al
>
> total 412212
>
> drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
>
> drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
>
> -rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
>
> -rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
>
> -rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
>
> -rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
>
> -rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
>
> -rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
>
> $ wc -l part-00000
>
> 399 part-00000
>
> $ wc -l part-00001
>
> 399 part-00001
>
> $ wc -l part-00002
>
> 399 part-00002
>
> $ wc -l part-00003
>
> 399 part-00003
>
> That is 399 * 4 = 1596
>
>
>
> Is this data lost caused by schema changed, or ill data contents, or 
> other possible reasons? Appreciate for your thoughts.
>
>
>
> Thanks,
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Wednesday, November 29, 2017 10:16 AM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> Try my suggestion with export and see if the number of events looks correct.
> I am suggesting that you may not be counting what you think you are 
> using HBase.
>
>
>
>
>
> On Nov 28, 2017, at 5:53 PM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Thanks for your advice.  However, we are not using HBase directly. We 
> use pio to import data into HBase by below command:
>
> pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName 
> /recordFile.json
>
> Could things go wrong here or somewhere else?
>
>
>
> Thanks,
>
> Weiguang
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, November 28, 2017 11:54 PM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> It is dangerous to use HBase directly because the schema may change at 
> any time. Export the data as json and examine it there. To see how 
> many events are in the stream you can just export then using bash to 
> count lines (wc -l). Each line is a JSON event. Or import the data as 
> a dataframe in Spark and use Spark SQL.
>
>
>
> There is no published contract about how events are stored in HBase.
>
>
>
>
>
> On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com> wrote:
>
>
>
> We are also facing the exact same issue. We have confirmed 1.5 million 
> records in HBase. However, I see only 19k records being fed for 
> training (eventsRDD.count()).
>
>
> With Regards,
>
>
>
>      Sachin
>
> ⚜KTBFFH⚜
>
>
>
> On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which 
> shows
> 6500 records. When I read those data in DataSource with Pio, however 
> only some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the 
> records have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very 
> wrong, hundreds of thousands of characters, could this be a reason for 
> the data lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>
>

RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Thanks Takako. I will have a try.

Best,
Weiguang

-----Original Message-----
From: takako shimamoto [mailto:chibochibo@gmail.com] 
Sent: Tuesday, December 5, 2017 10:01 AM
To: user@predictionio.apache.org
Subject: Re: Data lost from HBase to DataSource

Which version of HBase are you using?
I guess it is because libraries of the storage/hbase subproject are too old that this causes. If you are using HBase 1.2.6, running assembly task against hbase-common, hbase-client and hbase-server
1.2.6 would work.


2017-11-30 17:25 GMT+09:00 Huang, Weiguang <we...@intel.com>:
> Hi Pat,
>
>
>
> We have compared the format of 2 records as attached from the json 
> file for import. The first one is imported and successfully read in 
> $pio train as we printed out its entityID in logger, and the other 
> should not have been read in pio successfully as its entityId is 
> absent in logger. But the two records have the same json format, as 
> every record has been generated by the same program.
>
> And here is an quick illustration of a record in json, with "encodedImage"
> being shortened from its actual 262,156 characters
>
> {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties":
> {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
>
> Only "entityId", "properties": {"label", "encodedImage"} could be 
> different among every record.
>
>
>
> We also noticed another weird  thing. After the one-time $pio import 
> of 6500 records, we $pio export immediately and got 399 + 399 = 798 
> records in 2 $pio exported files.
>
> As we $pio train for a couple of rounds, the number of records in pio 
> increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
>
> and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.
>
>
>
> Please see below the system logger for $pio import. It seems 
> everything is all right.
>
> $pio import --appid 8 --input
> ../imageNetTemplate/data/imagenet_5_class_resized.json
>
>
>
> /opt/work/spark-2.1.1 is probably an Apache Spark development tree. 
> Please make sure you are using at least 1.3.0.
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-
> hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder
> .class]
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.
> 11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> [INFO] [Runner$] Submission command: 
> /opt/work/spark-2.1.1/bin/spark-submit
> --class org.apache.predictionio.tools.imprt.FileToEvents --jars 
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-
> assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incu
> bating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:
> /opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assem
> bly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubatin
> g/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,fil
> e:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-as
> sembly-0.11.0-incubating.jar
> --files
> file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,fi
> le:/opt/work/hbase-1.3.1/conf/hbase-site.xml
> --driver-class-path
> /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/co
> nf --driver-java-options -Dpio.log.dir=/root 
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-
> incubating.jar
> --appid 8 --input
> file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTempl
> ate/data/imagenet_5_class_resized.json
> --env
> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOUR
> CES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NA
> ME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STO
> RAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_H
> OME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incub
> ating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_L
> OCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16
> 000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_R
> EPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MOD
> ELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_even
> t,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORA
> GE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMP
> DIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_m
> odel,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt
> /work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICS
> EARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
>
> [INFO] [log] Logging initialized @4913ms
>
> [INFO] [Server] jetty-9.2.z-SNAPSHOT
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@S
> park}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAI
> LABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILAB
> LE,@Spark}
>
> [INFO] [ServerConnector] Started 
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [Server] Started @5086ms
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spa
> rk}
>
> [INFO] [FileToEvents$] Events are imported.
>
> [INFO] [FileToEvents$] Done.
>
> [INFO] [ServerConnector] Stopped 
> Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null
> ,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAV
> AILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABL
> E,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILA
> BLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAIL
> ABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,
> @Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@
> Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark
> }
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE
> ,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spar
> k}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Sp
> ark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}
>
>
>
> Thanks for your advice.
>
>
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Thursday, November 30, 2017 2:06 AM
> To: user@predictionio.apache.org
> Cc: Shi, Dongjie <do...@intel.com>
>
>
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> 1596 is how many events were accepted by the EventServer, look at the 
> exported format and compare with the ones you imported. There must be 
> a formatting error or an error when importing (did you check responses 
> for each event import?)
>
>
>
> Looking below I see you are importing JPEG??? This is almost always a 
> bad idea. Image data is usually kept in a filesystems like HDFS and a 
> reference kept in the DB, there are too may serialization questions to 
> do otherwise in my experience. If your Engine requires this you are 
> asking for the kind of trouble you are seeing.
>
>
>
>
>
> On Nov 28, 2017, at 7:16 PM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Here is the result when we tried out your suggestion.
>
>
>
> We checked the data from the Hbase, and the count of the records is 
> exactly the same as we imported into the Hbase, that is 6500.
>
> 2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at 
> ImageDataFromHBaseChecker.scala:27, took 12.016679 s
>
> Number of Records found : 6500
>
>
>
> We exported data from Pio and checked, but got only 1596 – see at the 
> bottom of the below screen record.
>
> $ ls -al
>
> total 412212
>
> drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
>
> drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
>
> -rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
>
> -rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
>
> -rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
>
> -rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
>
> -rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
>
> -rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
>
> $ wc -l part-00000
>
> 399 part-00000
>
> $ wc -l part-00001
>
> 399 part-00001
>
> $ wc -l part-00002
>
> 399 part-00002
>
> $ wc -l part-00003
>
> 399 part-00003
>
> That is 399 * 4 = 1596
>
>
>
> Is this data lost caused by schema changed, or ill data contents, or 
> other possible reasons? Appreciate for your thoughts.
>
>
>
> Thanks,
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Wednesday, November 29, 2017 10:16 AM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> Try my suggestion with export and see if the number of events looks correct.
> I am suggesting that you may not be counting what you think you are 
> using HBase.
>
>
>
>
>
> On Nov 28, 2017, at 5:53 PM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Thanks for your advice.  However, we are not using HBase directly. We 
> use pio to import data into HBase by below command:
>
> pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName 
> /recordFile.json
>
> Could things go wrong here or somewhere else?
>
>
>
> Thanks,
>
> Weiguang
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, November 28, 2017 11:54 PM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> It is dangerous to use HBase directly because the schema may change at 
> any time. Export the data as json and examine it there. To see how 
> many events are in the stream you can just export then using bash to 
> count lines (wc -l). Each line is a JSON event. Or import the data as 
> a dataframe in Spark and use Spark SQL.
>
>
>
> There is no published contract about how events are stored in HBase.
>
>
>
>
>
> On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com> wrote:
>
>
>
> We are also facing the exact same issue. We have confirmed 1.5 million 
> records in HBase. However, I see only 19k records being fed for 
> training (eventsRDD.count()).
>
>
> With Regards,
>
>
>
>      Sachin
>
> ⚜KTBFFH⚜
>
>
>
> On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang 
> <we...@intel.com>
> wrote:
>
> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which 
> shows
> 6500 records. When I read those data in DataSource with Pio, however 
> only some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the 
> records have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very 
> wrong, hundreds of thousands of characters, could this be a reason for 
> the data lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>
>

Re: Data lost from HBase to DataSource

Posted by takako shimamoto <ch...@gmail.com>.
Which version of HBase are you using?
I guess it is because libraries of the storage/hbase subproject are
too old that this causes. If you are using HBase 1.2.6, running
assembly task against hbase-common, hbase-client and hbase-server
1.2.6 would work.


2017-11-30 17:25 GMT+09:00 Huang, Weiguang <we...@intel.com>:
> Hi Pat,
>
>
>
> We have compared the format of 2 records as attached from the json file for
> import. The first one is imported and successfully read in $pio train as we
> printed out its entityID in logger, and the other should not have been read
> in pio successfully as its entityId is absent in logger. But the two records
> have the same json format, as every record has been generated by the same
> program.
>
> And here is an quick illustration of a record in json, with "encodedImage"
> being shortened from its actual 262,156 characters
>
> {"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties":
> {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
>
> Only "entityId", "properties": {"label", "encodedImage"} could be different
> among every record.
>
>
>
> We also noticed another weird  thing. After the one-time $pio import of 6500
> records, we $pio export immediately and got 399 + 399 = 798 records in 2
> $pio exported files.
>
> As we $pio train for a couple of rounds, the number of records in pio
> increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
>
> and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.
>
>
>
> Please see below the system logger for $pio import. It seems everything is
> all right.
>
> $pio import --appid 8 --input
> ../imageNetTemplate/data/imagenet_5_class_resized.json
>
>
>
> /opt/work/spark-2.1.1 is probably an Apache Spark development tree. Please
> make sure you are using at least 1.3.0.
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> [INFO] [Runner$] Submission command: /opt/work/spark-2.1.1/bin/spark-submit
> --class org.apache.predictionio.tools.imprt.FileToEvents --jars
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-assembly-0.11.0-incubating.jar
> --files
> file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,file:/opt/work/hbase-1.3.1/conf/hbase-site.xml
> --driver-class-path
> /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/conf
> --driver-java-options -Dpio.log.dir=/root
> file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar
> --appid 8 --input
> file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTemplate/data/imagenet_5_class_resized.json
> --env
> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOURCES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_HOME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incubating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMPDIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt/work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
>
> [INFO] [log] Logging initialized @4913ms
>
> [INFO] [Server] jetty-9.2.z-SNAPSHOT
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILABLE,@Spark}
>
> [INFO] [ServerConnector] Started Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [Server] Started @5086ms
>
> [INFO] [ContextHandler] Started
> o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spark}
>
> [INFO] [FileToEvents$] Events are imported.
>
> [INFO] [FileToEvents$] Done.
>
> [INFO] [ServerConnector] Stopped Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Spark}
>
> [INFO] [ContextHandler] Stopped
> o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}
>
>
>
> Thanks for your advice.
>
>
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Thursday, November 30, 2017 2:06 AM
> To: user@predictionio.apache.org
> Cc: Shi, Dongjie <do...@intel.com>
>
>
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> 1596 is how many events were accepted by the EventServer, look at the
> exported format and compare with the ones you imported. There must be a
> formatting error or an error when importing (did you check responses for
> each event import?)
>
>
>
> Looking below I see you are importing JPEG??? This is almost always a bad
> idea. Image data is usually kept in a filesystems like HDFS and a reference
> kept in the DB, there are too may serialization questions to do otherwise in
> my experience. If your Engine requires this you are asking for the kind of
> trouble you are seeing.
>
>
>
>
>
> On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Here is the result when we tried out your suggestion.
>
>
>
> We checked the data from the Hbase, and the count of the records is exactly
> the same as we imported into the Hbase, that is 6500.
>
> 2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at
> ImageDataFromHBaseChecker.scala:27, took 12.016679 s
>
> Number of Records found : 6500
>
>
>
> We exported data from Pio and checked, but got only 1596 – see at the bottom
> of the below screen record.
>
> $ ls -al
>
> total 412212
>
> drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
>
> drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
>
> -rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
>
> -rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
>
> -rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
>
> -rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
>
> -rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
>
> -rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
>
> -rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
>
> $ wc -l part-00000
>
> 399 part-00000
>
> $ wc -l part-00001
>
> 399 part-00001
>
> $ wc -l part-00002
>
> 399 part-00002
>
> $ wc -l part-00003
>
> 399 part-00003
>
> That is 399 * 4 = 1596
>
>
>
> Is this data lost caused by schema changed, or ill data contents, or other
> possible reasons? Appreciate for your thoughts.
>
>
>
> Thanks,
>
> Weiguang
>
>
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Wednesday, November 29, 2017 10:16 AM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> Try my suggestion with export and see if the number of events looks correct.
> I am suggesting that you may not be counting what you think you are using
> HBase.
>
>
>
>
>
> On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <we...@intel.com>
> wrote:
>
>
>
> Hi Pat,
>
>
>
> Thanks for your advice.  However, we are not using HBase directly. We use
> pio to import data into HBase by below command:
>
> pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName
> /recordFile.json
>
> Could things go wrong here or somewhere else?
>
>
>
> Thanks,
>
> Weiguang
>
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, November 28, 2017 11:54 PM
> To: user@predictionio.apache.org
> Cc: user@predictionio.incubator.apache.org
> Subject: Re: Data lost from HBase to DataSource
>
>
>
> It is dangerous to use HBase directly because the schema may change at any
> time. Export the data as json and examine it there. To see how many events
> are in the stream you can just export then using bash to count lines (wc
> -l). Each line is a JSON event. Or import the data as a dataframe in Spark
> and use Spark SQL.
>
>
>
> There is no published contract about how events are stored in HBase.
>
>
>
>
>
> On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com> wrote:
>
>
>
> We are also facing the exact same issue. We have confirmed 1.5 million
> records in HBase. However, I see only 19k records being fed for training
> (eventsRDD.count()).
>
>
> With Regards,
>
>
>
>      Sachin
>
> ⚜KTBFFH⚜
>
>
>
> On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>
> wrote:
>
> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which shows
> 6500 records. When I read those data in DataSource with Pio, however only
> some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the records
> have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very wrong,
> hundreds of thousands of characters, could this be a reason for the data
> lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>
>

RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Hi Pat,

We have compared the format of 2 records as attached from the json file for import. The first one is imported and successfully read in $pio train as we printed out its entityID in logger, and the other should not have been read in pio successfully as its entityId is absent in logger. But the two records have the same json format, as every record has been generated by the same program.
And here is an quick illustration of a record in json, with "encodedImage" being shortened from its actual 262,156 characters
{"event": "imageNet", "entityId": 10004, "entityType": "JPEG", "properties": {"label": "n01484850", "encodedImage": "AAABAAA…..Oynz4=”}}
Only "entityId", "properties": {"label", "encodedImage"} could be different among every record.

We also noticed another weird  thing. After the one-time $pio import of 6500 records, we $pio export immediately and got 399 + 399 = 798 records in 2 $pio exported files.
As we $pio train for a couple of rounds, the number of records in pio increased to 399 + 399 + 399 = 1197 in 3 $pio exported files,
and may to 399 + 399 + 399 + 399 = 1596 after more $pio train.

Please see below the system logger for $pio import. It seems everything is all right.
$pio import --appid 8 --input ../imageNetTemplate/data/imagenet_5_class_resized.json

/opt/work/spark-2.1.1 is probably an Apache Spark development tree. Please make sure you are using at least 1.3.0.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[INFO] [Runner$] Submission command: /opt/work/spark-2.1.1/bin/spark-submit --class org.apache.predictionio.tools.imprt.FileToEvents --jars file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-localfs-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-jdbc-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-elasticsearch1-assembly-0.11.0-incubating.jar,file:/opt/work/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hbase-assembly-0.11.0-incubating.jar --files file:/opt/work/PredictionIO-0.11.0-incubating/conf/log4j.properties,file:/opt/work/hbase-1.3.1/conf/hbase-site.xml --driver-class-path /opt/work/PredictionIO-0.11.0-incubating/conf:/opt/work/hbase-1.3.1/conf --driver-java-options -Dpio.log.dir=/root file:/opt/work/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar --appid 8 --input file:/opt/work/arda-data/pio-templates/dataImportTest/../imageNetTemplate/data/imagenet_5_class_resized.json --env PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_SOURCES_HBASE_HOSTS=Gondolin-Node-050,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_VERSION=0.11.0,PIO_FS_BASEDIR=/root/.pio_store,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost,PIO_STORAGE_SOURCES_HBASE_HOME=/opt/work/hbase-1.3.1,PIO_HOME=/opt/work/PredictionIO-0.11.0-incubating,PIO_FS_ENGINESDIR=/root/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_PATH=/root/.pio_store/models,PIO_STORAGE_SOURCES_HBASE_PORTS=16000,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/opt/work/elasticsearch-1.7.6,PIO_FS_TMPDIR=/root/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/opt/work/PredictionIO-0.11.0-incubating/conf,PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
[INFO] [log] Logging initialized @4913ms
[INFO] [Server] jetty-9.2.z-SNAPSHOT
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@11787b64{/storage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@319642db{/environment,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@357bc488{/static,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@4ea17147{/,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,AVAILABLE,@Spark}
[INFO] [ServerConnector] Started Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
[INFO] [Server] Started @5086ms
[INFO] [ContextHandler] Started o.s.j.s.ServletContextHandler@4f114b{/metrics/json,null,AVAILABLE,@Spark}
[INFO] [FileToEvents$] Events are imported.
[INFO] [FileToEvents$] Done.
[INFO] [ServerConnector] Stopped Spark@16d07cf3{HTTP/1.1}{0.0.0.0:4040}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@309dcdf3{/stages/stage/kill,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@5ba90d8a{/jobs/job/kill,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@2eda4eeb{/api,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@4ea17147{/,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@357bc488{/static,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@7d0100ea{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@68b11545{/executors/threadDump,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@6b321262{/executors/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@35bfa1bb{/executors,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@59498d94{/environment/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@319642db{/environment,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@6367a688{/storage/rdd/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@77b3752b{/storage/rdd,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@5707f613{/storage/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@11787b64{/storage,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@62aeddc8{/stages/pool/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@713a35c5{/stages/pool,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@100ad67e{/stages/stage/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@22da2fe6{/stages/stage,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@7ecda95b{/stages/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@6ad1701a{/stages,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@519c6fcc{/jobs/job/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@f5a7226{/jobs/job,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@432af457{/jobs/json,null,UNAVAILABLE,@Spark}
[INFO] [ContextHandler] Stopped o.s.j.s.ServletContextHandler@6d6ac396{/jobs,null,UNAVAILABLE,@Spark}<mailto:o.s.j.s.ServletContextHandler@6d6ac396%7b/jobs,null,UNAVAILABLE,@Spark%7d>

Thanks for your advice.

Weiguang

From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, November 30, 2017 2:06 AM
To: user@predictionio.apache.org
Cc: Shi, Dongjie <do...@intel.com>
Subject: Re: Data lost from HBase to DataSource

1596 is how many events were accepted by the EventServer, look at the exported format and compare with the ones you imported. There must be a formatting error or an error when importing (did you check responses for each event import?)

Looking below I see you are importing JPEG??? This is almost always a bad idea. Image data is usually kept in a filesystems like HDFS and a reference kept in the DB, there are too may serialization questions to do otherwise in my experience. If your Engine requires this you are asking for the kind of trouble you are seeing.


On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <we...@intel.com>> wrote:

Hi Pat,

Here is the result when we tried out your suggestion.

We checked the data from the Hbase, and the count of the records is exactly the same as we imported into the Hbase, that is 6500.
2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at ImageDataFromHBaseChecker.scala:27, took 12.016679 s
Number of Records found : 6500

We exported data from Pio and checked, but got only 1596 – see at the bottom of the below screen record.
$ ls -al
total 412212
drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
-rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
-rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
-rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
-rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
-rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
-rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
$ wc -l part-00000
399 part-00000
$ wc -l part-00001
399 part-00001
$ wc -l part-00002
399 part-00002
$ wc -l part-00003
399 part-00003
That is 399 * 4 = 1596

Is this data lost caused by schema changed, or ill data contents, or other possible reasons? Appreciate for your thoughts.

Thanks,
Weiguang

From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Wednesday, November 29, 2017 10:16 AM
To: user@predictionio.apache.org<ma...@predictionio.apache.org>
Cc: user@predictionio.incubator.apache.org<ma...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource

Try my suggestion with export and see if the number of events looks correct. I am suggesting that you may not be counting what you think you are using HBase.


On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <we...@intel.com>> wrote:

Hi Pat,

Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName /recordFile.json
Could things go wrong here or somewhere else?

Thanks,
Weiguang
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org<ma...@predictionio.apache.org>
Cc: user@predictionio.incubator.apache.org<ma...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource

It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL.

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com>> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>> wrote:
Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

Thank you for looking into my question.

Best,
Weiguang


Re: Data lost from HBase to DataSource

Posted by Pat Ferrel <pa...@occamsmachete.com>.
1596 is how many events were accepted by the EventServer, look at the exported format and compare with the ones you imported. There must be a formatting error or an error when importing (did you check responses for each event import?)

Looking below I see you are importing JPEG??? This is almost always a bad idea. Image data is usually kept in a filesystems like HDFS and a reference kept in the DB, there are too may serialization questions to do otherwise in my experience. If your Engine requires this you are asking for the kind of trouble you are seeing.


On Nov 28, 2017, at 7:16 PM, Huang, Weiguang <we...@intel.com> wrote:

Hi Pat,
 
Here is the result when we tried out your suggestion.
 
We checked the data from the Hbase, and the count of the records is exactly the same as we imported into the Hbase, that is 6500.
2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at ImageDataFromHBaseChecker.scala:27, took 12.016679 s
Number of Records found : 6500
 
We exported data from Pio and checked, but got only 1596 – see at the bottom of the below screen record.
$ ls -al
total 412212
drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
-rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
-rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
-rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
-rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
-rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
-rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
$ wc -l part-00000
399 part-00000
$ wc -l part-00001
399 part-00001
$ wc -l part-00002
399 part-00002
$ wc -l part-00003
399 part-00003
That is 399 * 4 = 1596
 
Is this data lost caused by schema changed, or ill data contents, or other possible reasons? Appreciate for your thoughts.
 
Thanks,
Weiguang
  <>
 <>From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Wednesday, November 29, 2017 10:16 AM
To: user@predictionio.apache.org
Cc: user@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource
 
Try my suggestion with export and see if the number of events looks correct. I am suggesting that you may not be counting what you think you are using HBase.
 
 
On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <weiguang.huang@intel.com <ma...@intel.com>> wrote:
 
Hi Pat,
 
Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ <hdfs://[host]:9000/pio/> applicationName /recordFile.json
Could things go wrong here or somewhere else?
 
Thanks,
Weiguang
From: Pat Ferrel [mailto:pat@occamsmachete.com <ma...@occamsmachete.com>] 
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org <ma...@predictionio.apache.org>
Cc: user@predictionio.incubator.apache.org <ma...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource
 
It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL. 
 
There is no published contract about how events are stored in HBase.
 
 
On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sachinkamkar@gmail.com <ma...@gmail.com>> wrote:
 
We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,
 
     Sachin
⚜KTBFFH⚜
 
On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <weiguang.huang@intel.com <ma...@intel.com>> wrote:
Hi guys,
 
I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.
 
Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?
 
Thank you for looking into my question.
 
Best,
Weiguang


RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Hi Pat,

Here is the result when we tried out your suggestion.

We checked the data from the Hbase, and the count of the records is exactly the same as we imported into the Hbase, that is 6500.
2017-11-29 10:42:19 INFO  DAGScheduler:54 - Job 0 finished: count at ImageDataFromHBaseChecker.scala:27, took 12.016679 s
Number of Records found : 6500

We exported data from Pio and checked, but got only 1596 – see at the bottom of the below screen record.
$ ls -al
total 412212
drwxr-xr-x  2 root root      4096 Nov 29 02:48 .
drwxr-xr-x 23 root root      4096 Nov 29 02:48 ..
-rw-r--r--  1 root root         8 Nov 29 02:48 ._SUCCESS.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00000.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00001.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00002.crc
-rw-r--r--  1 root root    817976 Nov 29 02:48 .part-00003.crc
-rw-r--r--  1 root root         0 Nov 29 02:48 _SUCCESS
-rw-r--r--  1 root root 104699844 Nov 29 02:48 part-00000
-rw-r--r--  1 root root 104699877 Nov 29 02:48 part-00001
-rw-r--r--  1 root root 104699843 Nov 29 02:48 part-00002
-rw-r--r--  1 root root 104699863 Nov 29 02:48 part-00003
$ wc -l part-00000
399 part-00000
$ wc -l part-00001
399 part-00001
$ wc -l part-00002
399 part-00002
$ wc -l part-00003
399 part-00003
That is 399 * 4 = 1596

Is this data lost caused by schema changed, or ill data contents, or other possible reasons? Appreciate for your thoughts.

Thanks,
Weiguang

From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Wednesday, November 29, 2017 10:16 AM
To: user@predictionio.apache.org
Cc: user@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource

Try my suggestion with export and see if the number of events looks correct. I am suggesting that you may not be counting what you think you are using HBase.


On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <we...@intel.com>> wrote:

Hi Pat,

Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName /recordFile.json
Could things go wrong here or somewhere else?

Thanks,
Weiguang
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org<ma...@predictionio.apache.org>
Cc: user@predictionio.incubator.apache.org<ma...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource

It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL.

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com>> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>> wrote:
Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

Thank you for looking into my question.

Best,
Weiguang


RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Pat,

OK. Thanks. We will try it.

Best,
Weiguang

From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Wednesday, November 29, 2017 10:16 AM
To: user@predictionio.apache.org
Cc: user@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource

Try my suggestion with export and see if the number of events looks correct. I am suggesting that you may not be counting what you think you are using HBase.


On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <we...@intel.com>> wrote:

Hi Pat,

Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName /recordFile.json
Could things go wrong here or somewhere else?

Thanks,
Weiguang
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org<ma...@predictionio.apache.org>
Cc: user@predictionio.incubator.apache.org<ma...@predictionio.incubator.apache.org>
Subject: Re: Data lost from HBase to DataSource

It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL.

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com>> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>> wrote:
Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

Thank you for looking into my question.

Best,
Weiguang


Re: Data lost from HBase to DataSource

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Try my suggestion with export and see if the number of events looks correct. I am suggesting that you may not be counting what you think you are using HBase.


On Nov 28, 2017, at 5:53 PM, Huang, Weiguang <we...@intel.com> wrote:

Hi Pat,
 
Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command: <>
pio import --appid 7 --input hdfs://[host]:9000/pio/ <hdfs://[host]:9000/pio/> applicationName /recordFile.json
Could things go wrong here or somewhere else?
 
Thanks,
Weiguang
 <>From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org
Cc: user@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource
 
It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL. 
 
There is no published contract about how events are stored in HBase.
 
 
On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sachinkamkar@gmail.com <ma...@gmail.com>> wrote:
 
We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,
 
     Sachin
⚜KTBFFH⚜
 
On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <weiguang.huang@intel.com <ma...@intel.com>> wrote:
Hi guys,
 
I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.
 
Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?
 
Thank you for looking into my question.
 
Best,
Weiguang


RE: Data lost from HBase to DataSource

Posted by "Huang, Weiguang" <we...@intel.com>.
Hi Pat,

Thanks for your advice.  However, we are not using HBase directly. We use pio to import data into HBase by below command:
pio import --appid 7 --input hdfs://[host]:9000/pio/ applicationName /recordFile.json
Could things go wrong here or somewhere else?

Thanks,
Weiguang
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Tuesday, November 28, 2017 11:54 PM
To: user@predictionio.apache.org
Cc: user@predictionio.incubator.apache.org
Subject: Re: Data lost from HBase to DataSource

It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL.

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com>> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>> wrote:
Hi guys,

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.
I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

Thank you for looking into my question.

Best,
Weiguang



Re: Data lost from HBase to DataSource

Posted by Pat Ferrel <pa...@occamsmachete.com>.
It is dangerous to use HBase directly because the schema may change at any time. Export the data as json and examine it there. To see how many events are in the stream you can just export then using bash to count lines (wc -l). Each line is a JSON event. Or import the data as a dataframe in Spark and use Spark SQL. 

There is no published contract about how events are stored in HBase.


On Nov 27, 2017, at 9:24 PM, Sachin Kamkar <sa...@gmail.com> wrote:

We are also facing the exact same issue. We have confirmed 1.5 million records in HBase. However, I see only 19k records being fed for training (eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <weiguang.huang@intel.com <ma...@intel.com>> wrote:
Hi guys,

 

I have encoded some JPEG images in json and imported to HBase, which shows 6500 records. When I read those data in DataSource with Pio, however only some 1500 records were fed in PIO.

I use PEventStore.find(appName, entityType, eventNames), and all the records have  the same entityType, eventNames.

 

Any idea what could go wrong? The encoded string from JPEG is very wrong, hundreds of thousands of characters, could this be a reason for the data lost?

 

Thank you for looking into my question.

 

Best,

Weiguang




Re: Data lost from HBase to DataSource

Posted by Sachin Kamkar <sa...@gmail.com>.
We are also facing the exact same issue. We have confirmed 1.5 million
records in HBase. However, I see only 19k records being fed for training
(eventsRDD.count()).

With Regards,

     Sachin
⚜KTBFFH⚜

On Tue, Nov 28, 2017 at 7:05 AM, Huang, Weiguang <we...@intel.com>
wrote:

> Hi guys,
>
>
>
> I have encoded some JPEG images in json and imported to HBase, which shows
> 6500 records. When I read those data in DataSource with Pio, however only
> some 1500 records were fed in PIO.
>
> I use PEventStore.find(appName, entityType, eventNames), and all the
> records have  the same entityType, eventNames.
>
>
>
> Any idea what could go wrong? The encoded string from JPEG is very wrong,
> hundreds of thousands of characters, could this be a reason for the data
> lost?
>
>
>
> Thank you for looking into my question.
>
>
>
> Best,
>
> Weiguang
>