You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by no jihun <je...@gmail.com> on 2016/05/14 11:23:24 UTC

clustered bucket and tablesample

Hello.

I want to ask the correct bucketing and tablesample way.

There is a table X which I created by

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

Then I inserted 500M of rows into X by

set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

Then I want to count or search some rows with condition. roughly,

SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'

But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

Is there any wrong above? But I can't not find any performance gain between
these two query.

query1 and RESULT( with no tablesample.)

SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
s
--------------------------------------------------------------------------------
It scans full data.

query 2 and RESULT

SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
15.82     s
--------------------------------------------------------------------------------
It ALSO scans full data.

query 2 RESULT WHAT I EXPECTED.

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
3.xx     s
--------------------------------------------------------------------------------

Values of action_id and classifier are well distributed and there is no
skewed data.

So I want to ask you what will be a correct query that prune and target
specific bucket by multiple column?

Re: clustered bucket and tablesample

Posted by no jihun <je...@gmail.com>.

Thank you so much Talebzadeh.

I have filed an issue . https://issues.apache.org/jira/browse/HIVE-13766


And I changed clustering column to int, based on string column hash.

something like


String actionClassifier = ""my_action,cl_900";

MessageDigest md = MessageDigest.getInstance(MD_5);

byte byteData[] = md.digest(actionClassifier.getBytes());

BigInteger bigInteger = new BigInteger(byteData);

return Math.abs(bigInteger.abs().intValue()); // <---- cluster will use
this value.


Thanks again.


2016. 5. 16. 오전 5:55에 "Mich Talebzadeh" <mi...@gmail.com>님이 작성:

> Hi,
>
> OK I tried your table.
>
> 0: jdbc:hive2://rhes564:10010/default> describe formatted bucket_x;
> OK
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
> |           col_name            |
> data_type                         |           comment           |
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
> | # col_name                    |
> data_type                                                 |
> comment                     |
> |                               |
> NULL                                                      |
> NULL                        |
> | classifier                    |
> string
> |                             |
> |                               |
> NULL                                                      |
> NULL                        |
> | # Detailed Table Information  |
> NULL                                                      |
> NULL                        |
> | Database:                     |
> test                                                      |
> NULL                        |
> | Owner:                        |
> hduser                                                    |
> NULL                        |
> | CreateTime:                   | Sun May 15 19:10:52 BST
> 2016                              | NULL                        |
> | LastAccessTime:               |
> UNKNOWN                                                   |
> NULL                        |
> | Retention:                    |
> 0                                                         |
> NULL                        |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/test.db/bucket_x  |
> NULL                        |
> | Table Type:                   |
> MANAGED_TABLE                                             |
> NULL                        |
> | Table Parameters:             |
> NULL                                                      |
> NULL                        |
> |                               |
> COLUMN_STATS_ACCURATE                                     |
> {\"BASIC_STATS\":\"true\"}  |
> |                               |
> numFiles                                                  |
> 256                         |
> |                               |
> numRows                                                   |
> 1000                        |
> |                               |
> rawDataSize                                               |
> 93336                       |
> |                               |
> totalSize                                                 |
> 57025                       |
> |                               |
> transient_lastDdlTime                                     |
> 1463338829                  |
> |                               |
> NULL                                                      |
> NULL                        |
> | # Storage Information         |
> NULL                                                      |
> NULL                        |
> | SerDe Library:                |
> org.apache.hadoop.hive.ql.io.orc.OrcSerde                 |
> NULL                        |
> | InputFormat:                  |
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat           |
> NULL                        |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat          |
> NULL                        |
> | Compressed:                   |
> No                                                        |
> NULL                        |
> | Num Buckets:                  |
> 256                                                       |
> NULL                        |
> | Bucket Columns:               |
> [classifier]                                              |
> NULL                        |
> | Sort Columns:                 |
> []                                                        |
> NULL                        |
> | Storage Desc Params:          |
> NULL                                                      |
> NULL                        |
> |                               |
> serialization.format                                      |
> 1                           |
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
>
>
> select count(1) from bucket_x;
> INFO  : OK
> +-------+--+
> |  c0   |
> +-------+--+
> | 1000  |
>
> Now let us try and iterate through all those 256 buckets for the
> following. Create the script ${IN_FILE} and run it against Hive. In my case
> is pretty fast as I use Hive on Spark engine
>
> function genrandomnumber {
> integer BUCKETNUMBER=1
> integer BUCKETS=256
> while ((BUCKETNUMBER <= BUCKETS))
> do
>    echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance
> FROM bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON
> classifier='cl_900');" >> ${IN_FILE}
>    ((BUCKETNUMBER = BUCKETNUMBER + 1))
> done
> }
>
> And the results don't make sense!
>
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 1             | 999        |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 2             | 1          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 3             | 0          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 4             | 0          |
>
> So apparently Bucket 1 has 999 entry and bucket 2 has 1.
>
> Let us try it for classifier='cl_103'
>
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 1             | 999        |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 2             | 1          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 3             | 0          |
> +---------------+------------+--+
> The same crap. To me hash partitioning on a string column is
> unpredictable. With integer it is fine. I believe there is an underlying
> bug in here. Other alternative is to an integer as a surrogate column for
> hash partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 12:29, no jihun <je...@gmail.com> wrote:
>
>> OK Talebzadeh thanks.
>>
>> Have you ever tried tablesample with string value hash?
>> something like 'tablesample( bucket 1 out of 256 on
>> some_field='somedata....')'
>>
>>
>> I wrote a full scenario.
>>
>> # table creating
>> Time taken: 0.155 seconds, Fetched: 36 row(s)
>> hive> CREATE TABLE `bucket_x` (
>>     >   `classifier` string)
>>     > CLUSTERED BY ( classifier)
>>     > INTO 256 BUCKETS
>>     > STORED AS ORC;
>> OK
>>
>>
>> *# check option before data insert;*
>> hive> set hive.enforce.bucketing;
>> hive.enforce.bucketing=true
>>
>>
>> *# insert 1,000 data*
>> (also number of distinct value is 1,000)
>>
>> hive > insert into bucket_x
>> values ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')
>>
>>
>> # check for total count
>> *hive> select count(*) from bucket_x;*
>> OK
>> *1000 // <-- returned 1. correct.*
>>
>> # check for count specific classifier (without tablesample)
>> *hive> select count(*) from bucket_x where classifier='cl_900';*
>> Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
>> Total jobs = 1
>> Launching Job 1 out of 1
>>
>>
>> Status: Running (Executing on YARN cluster with App id
>> application_1462971998082_0025)
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>>  KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      2          2        0        0       1
>>       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>>       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63
>> s
>>
>> --------------------------------------------------------------------------------
>> OK
>> *1 // <-- returned 1. correct.*
>> Time taken: 8.064 seconds, Fetched: 1 row(s)
>>
>>
>> # count specific classifier (with tablesample)
>>
>> *hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
>> classifier='cl_900') where classifier='cl_900';*
>> Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
>> Total jobs = 1
>> Launching Job 1 out of 1
>>
>>
>> Status: Running (Executing on YARN cluster with App id
>> application_1462971998082_0025)
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>>  KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>>       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>>       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81
>> s
>>
>> --------------------------------------------------------------------------------
>> OK
>> *0 // <--- it returns Zero. WRONG! it should return 1 row.*
>> Time taken: 4.216 seconds, Fetched: 1 row(s)
>>
>> # so I checked whole tablesampled data.
>> *hive> select * from bucket_x tablesample(bucket 1 out of 256 on
>> classifier='cl_900');*
>> OK
>> cl_974
>> cl_336
>> cl_457
>> ...
>> cl_852
>> cl_698
>> cl_731
>> Time taken: 0.053 seconds, Fetched: 999 row(s)
>> *// <-- it returned 999 ROWS *
>> *// ??? 999 ???*
>> *// exactly except what I want to search*.
>>
>>
>> I think I am doing totally wrong tablesample in case of string value.
>> Any idea?
>>
>> 2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <mi...@gmail.com>:
>>
>>> This is your code
>>>
>>> SELECT COUNT(*) FROM X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> Well I have a table dummy with 1 billion rows imported from Oracle as
>>> ORC format
>>>
>>> hive> show create table dummy;
>>> OK
>>> CREATE TABLE `dummy`(
>>>   `id` int,
>>>   `clustered` int,
>>>   `scattered` int,
>>>   `randomised` int,
>>>   `random_string` varchar(50),
>>>   `small_vc` varchar(10),
>>>   `padding` varchar(10))
>>>
>>>
>>>
>>> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>> STORED AS INPUTFORMAT
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>> OUTPUTFORMAT
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>> LOCATION
>>>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
>>> TBLPROPERTIES (
>>>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>>>   'numFiles'='1',
>>>   'numRows'='100000000',
>>>   'orc.bloom.filter.columns'='ID',
>>>   'orc.bloom.filter.fpp'='0.05',
>>>   'orc.compress'='SNAPPY',
>>>   'orc.create.index'='true',
>>>   'orc.row.index.stride'='10000',
>>>   'orc.stripe.size'='16777216',
>>>   'rawDataSize'='0',
>>>   'totalSize'='5662644579',
>>>   'transient_lastDdlTime'='1463245925')
>>>
>>>
>>> If I turn on the plan for the following two cases. First a simple case
>>>
>>> hive>
>>>
>>> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>
>>> *                   predicate: (id = 20) (type: boolean)  *
>>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>>> Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 50000000 Data size:
>>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>> Time taken: 0.064 seconds, Fetched: 51 row(s)
>>>
>>>
>>> Now we try with tablesample but  assigning predicate values inside the
>>> bracket as below because you are looking in bucket 1 for those values and
>>> you want optimizer to know that.
>>>
>>>
>>>
>>> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1
>>> OUT OF 256 ON ID = 10)*    > ;
>>> OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>
>>> *                   predicate: (((hash((id = 10)) & 2147483647
>>> <2147483647>) % 256) = 0) (type: boolean)*
>>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>>> Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 50000000 Data size:
>>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>>
>>>
>>> Otherwise I don't see much happening
>>>
>>> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>>>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>>>     > WHERE ID = 10;
>>> OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>                     predicate: (false and (id = 10)) (type: boolean)
>>>                     Statistics: Num rows: 25000000 Data size:
>>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 25000000 Data size:
>>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>>
>>> In general in my experience bucketing in ORC is the only area where ORC
>>> tables come handy.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 May 2016 at 13:38, no jihun <je...@gmail.com> wrote:
>>>
>>>> ah, as i mentioned
>>>> both field type of action_id and classifier is STRING. and I can not
>>>> change the type.
>>>>
>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>> STORED AS ORC
>>>>
>>>> I use two fields for hash then bucketing because each one field is not
>>>> so well distributed.
>>>>
>>>> my concern is not about the strong hash source but about How can I
>>>> tablesample to the a bucket by field value what provided by 'where clause'
>>>>
>>>> when I clustered by string fields which one is right for tablesample?
>>>> 1. provide fields
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>
>>>> 2. provide values
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>>>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mi...@gmail.com>님이
>>>> 작성:
>>>>
>>>> Is action_id can be created as a numeric column:
>>>>>
>>>>> CREATE TABLE X ( action_id bigint,  ..)
>>>>>
>>>>> Bucketing or hash partitioning best works on numeric columns with high
>>>>> cardinality (say a primary key).
>>>>>
>>>>> From my old notes:
>>>>>
>>>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>>>> algorithm to prevent data from clustering within specific partitions.
>>>>> Hashing is very effective if the column selected for bucketing has very
>>>>> high selectivity like an ID column where selectivity (select
>>>>> count(distinct(column))/count(column) ) = 1.  In this case, the
>>>>> created partitions/ files will be as evenly sized as possible. In a
>>>>> nutshell bucketing is a method to get data evenly distributed over many
>>>>> partitions/files.  One should define the number of buckets by a power of
>>>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>>>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>>>>> a join between two tables that are bucketed on the same column with the
>>>>> same number of buckets (anyone has tried this?)
>>>>>
>>>>>
>>>>>
>>>>> One more things. When one defines the number of buckets at table
>>>>> creation level in Hive, the number of partitions/files will be fixed. In
>>>>> contrast, with partitioning you do not have this limitation.
>>>>>
>>>>> can you do
>>>>>
>>>>> show create table X
>>>>>
>>>>> and send the output. please.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I want to ask the correct bucketing and tablesample way.
>>>>>>
>>>>>> There is a table X which I created by
>>>>>>
>>>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>>>> STORED AS ORC
>>>>>>
>>>>>> Then I inserted 500M of rows into X by
>>>>>>
>>>>>> set hive.enforce.bucketing=true;
>>>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>>>
>>>>>> Then I want to count or search some rows with condition. roughly,
>>>>>>
>>>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>>>
>>>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>>>> classifier). So, the better query will be
>>>>>>
>>>>>> SELECT COUNT(*) FROM X
>>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>>>
>>>>>> Is there any wrong above? But I can't not find any performance gain
>>>>>> between these two query.
>>>>>>
>>>>>> query1 and RESULT( with no tablesample.)
>>>>>>
>>>>>> SELECT COUNT(*)) from X
>>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED    256        256        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>>> 15.35 s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> It scans full data.
>>>>>>
>>>>>> query 2 and RESULT
>>>>>>
>>>>>> SELECT COUNT(*)) from X
>>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED    256        256        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>>> 15.82     s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> It ALSO scans full data.
>>>>>>
>>>>>> query 2 RESULT WHAT I EXPECTED.
>>>>>>
>>>>>> Result what I expected is something like...
>>>>>> (use 1 map and relatively faster than without tabmesample)
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>>> 3.xx     s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>
>>>>>> Values of action_id and classifier are well distributed and there is
>>>>>> no skewed data.
>>>>>>
>>>>>> So I want to ask you what will be a correct query that prune and
>>>>>> target specific bucket by multiple column?
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
>> --
>> ----------------------------------------------
>> Jihun No ( 노지훈 )
>> ----------------------------------------------
>> Twitter          : @nozisim
>> Facebook       : nozisim
>> Website         : http://jeesim2.godohosting.com
>>
>> ---------------------------------------------------------------------------------
>> Market Apps   : android market products.
>> <https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>
>>
>
>

Re: clustered bucket and tablesample

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

OK I tried your table.

0: jdbc:hive2://rhes564:10010/default> describe formatted bucket_x;
OK
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+
|           col_name            |
data_type                         |           comment           |
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+
| # col_name                    |
data_type                                                 |
comment                     |
|                               |
NULL                                                      |
NULL                        |
| classifier                    |
string
|                             |
|                               |
NULL                                                      |
NULL                        |
| # Detailed Table Information  |
NULL                                                      |
NULL                        |
| Database:                     |
test                                                      |
NULL                        |
| Owner:                        |
hduser                                                    |
NULL                        |
| CreateTime:                   | Sun May 15 19:10:52 BST
2016                              | NULL                        |
| LastAccessTime:               |
UNKNOWN                                                   |
NULL                        |
| Retention:                    |
0                                                         |
NULL                        |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/test.db/bucket_x  |
NULL                        |
| Table Type:                   |
MANAGED_TABLE                                             |
NULL                        |
| Table Parameters:             |
NULL                                                      |
NULL                        |
|                               |
COLUMN_STATS_ACCURATE                                     |
{\"BASIC_STATS\":\"true\"}  |
|                               |
numFiles                                                  |
256                         |
|                               |
numRows                                                   |
1000                        |
|                               |
rawDataSize                                               |
93336                       |
|                               |
totalSize                                                 |
57025                       |
|                               |
transient_lastDdlTime                                     |
1463338829                  |
|                               |
NULL                                                      |
NULL                        |
| # Storage Information         |
NULL                                                      |
NULL                        |
| SerDe Library:                |
org.apache.hadoop.hive.ql.io.orc.OrcSerde                 |
NULL                        |
| InputFormat:                  |
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat           |
NULL                        |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat          |
NULL                        |
| Compressed:                   |
No                                                        |
NULL                        |
| Num Buckets:                  |
256                                                       |
NULL                        |
| Bucket Columns:               |
[classifier]                                              |
NULL                        |
| Sort Columns:                 |
[]                                                        |
NULL                        |
| Storage Desc Params:          |
NULL                                                      |
NULL                        |
|                               |
serialization.format                                      |
1                           |
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+


select count(1) from bucket_x;
INFO  : OK
+-------+--+
|  c0   |
+-------+--+
| 1000  |

Now let us try and iterate through all those 256 buckets for the following.
Create the script ${IN_FILE} and run it against Hive. In my case is pretty
fast as I use Hive on Spark engine

function genrandomnumber {
integer BUCKETNUMBER=1
integer BUCKETS=256
while ((BUCKETNUMBER <= BUCKETS))
do
   echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance FROM
bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON
classifier='cl_900');" >> ${IN_FILE}
   ((BUCKETNUMBER = BUCKETNUMBER + 1))
done
}

And the results don't make sense!

| bucketnumber  | occurance  |
+---------------+------------+--+
| 1             | 999        |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 2             | 1          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 3             | 0          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 4             | 0          |

So apparently Bucket 1 has 999 entry and bucket 2 has 1.

Let us try it for classifier='cl_103'

| bucketnumber  | occurance  |
+---------------+------------+--+
| 1             | 999        |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 2             | 1          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 3             | 0          |
+---------------+------------+--+
The same crap. To me hash partitioning on a string column is unpredictable.
With integer it is fine. I believe there is an underlying bug in here.
Other alternative is to an integer as a surrogate column for hash
partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 12:29, no jihun <je...@gmail.com> wrote:

> OK Talebzadeh thanks.
>
> Have you ever tried tablesample with string value hash?
> something like 'tablesample( bucket 1 out of 256 on
> some_field='somedata....')'
>
>
> I wrote a full scenario.
>
> # table creating
> Time taken: 0.155 seconds, Fetched: 36 row(s)
> hive> CREATE TABLE `bucket_x` (
>     >   `classifier` string)
>     > CLUSTERED BY ( classifier)
>     > INTO 256 BUCKETS
>     > STORED AS ORC;
> OK
>
>
> *# check option before data insert;*
> hive> set hive.enforce.bucketing;
> hive.enforce.bucketing=true
>
>
> *# insert 1,000 data*
> (also number of distinct value is 1,000)
>
> hive > insert into bucket_x
> values ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')
>
>
> # check for total count
> *hive> select count(*) from bucket_x;*
> OK
> *1000 // <-- returned 1. correct.*
>
> # check for count specific classifier (without tablesample)
> *hive> select count(*) from bucket_x where classifier='cl_900';*
> Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1462971998082_0025)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      2          2        0        0       1
>     0
> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63 s
>
>
> --------------------------------------------------------------------------------
> OK
> *1 // <-- returned 1. correct.*
> Time taken: 8.064 seconds, Fetched: 1 row(s)
>
>
> # count specific classifier (with tablesample)
>
> *hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
> classifier='cl_900') where classifier='cl_900';*
> Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1462971998082_0025)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>     0
> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81 s
>
>
> --------------------------------------------------------------------------------
> OK
> *0 // <--- it returns Zero. WRONG! it should return 1 row.*
> Time taken: 4.216 seconds, Fetched: 1 row(s)
>
> # so I checked whole tablesampled data.
> *hive> select * from bucket_x tablesample(bucket 1 out of 256 on
> classifier='cl_900');*
> OK
> cl_974
> cl_336
> cl_457
> ...
> cl_852
> cl_698
> cl_731
> Time taken: 0.053 seconds, Fetched: 999 row(s)
> *// <-- it returned 999 ROWS *
> *// ??? 999 ???*
> *// exactly except what I want to search*.
>
>
> I think I am doing totally wrong tablesample in case of string value.
> Any idea?
>
> 2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <mi...@gmail.com>:
>
>> This is your code
>>
>> SELECT COUNT(*) FROM X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' AND classifier='bbb'
>>
>> Well I have a table dummy with 1 billion rows imported from Oracle as ORC
>> format
>>
>> hive> show create table dummy;
>> OK
>> CREATE TABLE `dummy`(
>>   `id` int,
>>   `clustered` int,
>>   `scattered` int,
>>   `randomised` int,
>>   `random_string` varchar(50),
>>   `small_vc` varchar(10),
>>   `padding` varchar(10))
>>
>>
>>
>> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>> LOCATION
>>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
>> TBLPROPERTIES (
>>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>>   'numFiles'='1',
>>   'numRows'='100000000',
>>   'orc.bloom.filter.columns'='ID',
>>   'orc.bloom.filter.fpp'='0.05',
>>   'orc.compress'='SNAPPY',
>>   'orc.create.index'='true',
>>   'orc.row.index.stride'='10000',
>>   'orc.stripe.size'='16777216',
>>   'rawDataSize'='0',
>>   'totalSize'='5662644579',
>>   'transient_lastDdlTime'='1463245925')
>>
>>
>> If I turn on the plan for the following two cases. First a simple case
>>
>> hive>
>>
>> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>
>> *                   predicate: (id = 20) (type: boolean)  *
>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>> Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 50000000 Data size:
>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>> Time taken: 0.064 seconds, Fetched: 51 row(s)
>>
>>
>> Now we try with tablesample but  assigning predicate values inside the
>> bracket as below because you are looking in bucket 1 for those values and
>> you want optimizer to know that.
>>
>>
>>
>> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
>> OF 256 ON ID = 10)*    > ;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>
>> *                   predicate: (((hash((id = 10)) & 2147483647
>> <2147483647>) % 256) = 0) (type: boolean)*
>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>> Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 50000000 Data size:
>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>>
>>
>> Otherwise I don't see much happening
>>
>> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>>     > WHERE ID = 10;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>                     predicate: (false and (id = 10)) (type: boolean)
>>                     Statistics: Num rows: 25000000 Data size: 1415661184
>> Basic stats: COMPLETE Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 25000000 Data size:
>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>>
>> In general in my experience bucketing in ORC is the only area where ORC
>> tables come handy.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 May 2016 at 13:38, no jihun <je...@gmail.com> wrote:
>>
>>> ah, as i mentioned
>>> both field type of action_id and classifier is STRING. and I can not
>>> change the type.
>>>
>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>> STORED AS ORC
>>>
>>> I use two fields for hash then bucketing because each one field is not
>>> so well distributed.
>>>
>>> my concern is not about the strong hash source but about How can I
>>> tablesample to the a bucket by field value what provided by 'where clause'
>>>
>>> when I clustered by string fields which one is right for tablesample?
>>> 1. provide fields
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>
>>> 2. provide values
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mi...@gmail.com>님이
>>> 작성:
>>>
>>> Is action_id can be created as a numeric column:
>>>>
>>>> CREATE TABLE X ( action_id bigint,  ..)
>>>>
>>>> Bucketing or hash partitioning best works on numeric columns with high
>>>> cardinality (say a primary key).
>>>>
>>>> From my old notes:
>>>>
>>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>>> algorithm to prevent data from clustering within specific partitions.
>>>> Hashing is very effective if the column selected for bucketing has very
>>>> high selectivity like an ID column where selectivity (select
>>>> count(distinct(column))/count(column) ) = 1.  In this case, the
>>>> created partitions/ files will be as evenly sized as possible. In a
>>>> nutshell bucketing is a method to get data evenly distributed over many
>>>> partitions/files.  One should define the number of buckets by a power of
>>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>>>> a join between two tables that are bucketed on the same column with the
>>>> same number of buckets (anyone has tried this?)
>>>>
>>>>
>>>>
>>>> One more things. When one defines the number of buckets at table
>>>> creation level in Hive, the number of partitions/files will be fixed. In
>>>> contrast, with partitioning you do not have this limitation.
>>>>
>>>> can you do
>>>>
>>>> show create table X
>>>>
>>>> and send the output. please.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> I want to ask the correct bucketing and tablesample way.
>>>>>
>>>>> There is a table X which I created by
>>>>>
>>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>>> STORED AS ORC
>>>>>
>>>>> Then I inserted 500M of rows into X by
>>>>>
>>>>> set hive.enforce.bucketing=true;
>>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>>
>>>>> Then I want to count or search some rows with condition. roughly,
>>>>>
>>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>>
>>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>>> classifier). So, the better query will be
>>>>>
>>>>> SELECT COUNT(*) FROM X
>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>>
>>>>> Is there any wrong above? But I can't not find any performance gain
>>>>> between these two query.
>>>>>
>>>>> query1 and RESULT( with no tablesample.)
>>>>>
>>>>> SELECT COUNT(*)) from X
>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 15.35 s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> It scans full data.
>>>>>
>>>>> query 2 and RESULT
>>>>>
>>>>> SELECT COUNT(*)) from X
>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 15.82     s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> It ALSO scans full data.
>>>>>
>>>>> query 2 RESULT WHAT I EXPECTED.
>>>>>
>>>>> Result what I expected is something like...
>>>>> (use 1 map and relatively faster than without tabmesample)
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 3.xx     s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>
>>>>> Values of action_id and classifier are well distributed and there is
>>>>> no skewed data.
>>>>>
>>>>> So I want to ask you what will be a correct query that prune and
>>>>> target specific bucket by multiple column?
>>>>>
>>>>
>>>>
>>
>
>
> --
> ----------------------------------------------
> Jihun No ( 노지훈 )
> ----------------------------------------------
> Twitter          : @nozisim
> Facebook       : nozisim
> Website         : http://jeesim2.godohosting.com
>
> ---------------------------------------------------------------------------------
> Market Apps   : android market products.
> <https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>
>

Re: clustered bucket and tablesample

Posted by no jihun <je...@gmail.com>.

OK Talebzadeh thanks.

Have you ever tried tablesample with string value hash?
something like 'tablesample( bucket 1 out of 256 on
some_field='somedata....')'


I wrote a full scenario.

# table creating
Time taken: 0.155 seconds, Fetched: 36 row(s)
hive> CREATE TABLE `bucket_x` (
    >   `classifier` string)
    > CLUSTERED BY ( classifier)
    > INTO 256 BUCKETS
    > STORED AS ORC;
OK


*# check option before data insert;*
hive> set hive.enforce.bucketing;
hive.enforce.bucketing=true


*# insert 1,000 data*
(also number of distinct value is 1,000)

hive > insert into bucket_x
values ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')


# check for total count
*hive> select count(*) from bucket_x;*
OK
*1000 // <-- returned 1. correct.*

# check for count specific classifier (without tablesample)
*hive> select count(*) from bucket_x where classifier='cl_900';*
Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1462971998082_0025)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
 KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       1
    0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0
    0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63 s

--------------------------------------------------------------------------------
OK
*1 // <-- returned 1. correct.*
Time taken: 8.064 seconds, Fetched: 1 row(s)


# count specific classifier (with tablesample)

*hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
classifier='cl_900') where classifier='cl_900';*
Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1462971998082_0025)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
 KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0
    0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0
    0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81 s

--------------------------------------------------------------------------------
OK
*0 // <--- it returns Zero. WRONG! it should return 1 row.*
Time taken: 4.216 seconds, Fetched: 1 row(s)

# so I checked whole tablesampled data.
*hive> select * from bucket_x tablesample(bucket 1 out of 256 on
classifier='cl_900');*
OK
cl_974
cl_336
cl_457
...
cl_852
cl_698
cl_731
Time taken: 0.053 seconds, Fetched: 999 row(s)
*// <-- it returned 999 ROWS *
*// ??? 999 ???*
*// exactly except what I want to search*.


I think I am doing totally wrong tablesample in case of string value.
Any idea?

2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <mi...@gmail.com>:

> This is your code
>
> SELECT COUNT(*) FROM X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' AND classifier='bbb'
>
> Well I have a table dummy with 1 billion rows imported from Oracle as ORC
> format
>
> hive> show create table dummy;
> OK
> CREATE TABLE `dummy`(
>   `id` int,
>   `clustered` int,
>   `scattered` int,
>   `randomised` int,
>   `random_string` varchar(50),
>   `small_vc` varchar(10),
>   `padding` varchar(10))
>
>
>
> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION
>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
> TBLPROPERTIES (
>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>   'numFiles'='1',
>   'numRows'='100000000',
>   'orc.bloom.filter.columns'='ID',
>   'orc.bloom.filter.fpp'='0.05',
>   'orc.compress'='SNAPPY',
>   'orc.create.index'='true',
>   'orc.row.index.stride'='10000',
>   'orc.stripe.size'='16777216',
>   'rawDataSize'='0',
>   'totalSize'='5662644579',
>   'transient_lastDdlTime'='1463245925')
>
>
> If I turn on the plan for the following two cases. First a simple case
>
> hive>
>
> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>
> *                   predicate: (id = 20) (type: boolean)  *
> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
> Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 50000000 Data size: 2831322368
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.064 seconds, Fetched: 51 row(s)
>
>
> Now we try with tablesample but  assigning predicate values inside the
> bracket as below because you are looking in bucket 1 for those values and
> you want optimizer to know that.
>
>
>
> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
> OF 256 ON ID = 10)*    > ;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName:
> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>
> *                   predicate: (((hash((id = 10)) & 2147483647) % 256) =
> 0) (type: boolean)*                    Statistics: Num rows: 50000000
> Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 50000000 Data size: 2831322368
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
>
> Otherwise I don't see much happening
>
> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>     > WHERE ID = 10;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName:
> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (false and (id = 10)) (type: boolean)
>                     Statistics: Num rows: 25000000 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 25000000 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
> In general in my experience bucketing in ORC is the only area where ORC
> tables come handy.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 13:38, no jihun <je...@gmail.com> wrote:
>
>> ah, as i mentioned
>> both field type of action_id and classifier is STRING. and I can not
>> change the type.
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> I use two fields for hash then bucketing because each one field is not so
>> well distributed.
>>
>> my concern is not about the strong hash source but about How can I
>> tablesample to the a bucket by field value what provided by 'where clause'
>>
>> when I clustered by string fields which one is right for tablesample?
>> 1. provide fields
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>
>> 2. provide values
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mi...@gmail.com>님이 작성:
>>
>> Is action_id can be created as a numeric column:
>>>
>>> CREATE TABLE X ( action_id bigint,  ..)
>>>
>>> Bucketing or hash partitioning best works on numeric columns with high
>>> cardinality (say a primary key).
>>>
>>> From my old notes:
>>>
>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>> algorithm to prevent data from clustering within specific partitions.
>>> Hashing is very effective if the column selected for bucketing has very
>>> high selectivity like an ID column where selectivity (select
>>> count(distinct(column))/count(column) ) = 1.  In this case, the created
>>> partitions/ files will be as evenly sized as possible. In a nutshell
>>> bucketing is a method to get data evenly distributed over many
>>> partitions/files.  One should define the number of buckets by a power of
>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>>> a join between two tables that are bucketed on the same column with the
>>> same number of buckets (anyone has tried this?)
>>>
>>>
>>>
>>> One more things. When one defines the number of buckets at table
>>> creation level in Hive, the number of partitions/files will be fixed. In
>>> contrast, with partitioning you do not have this limitation.
>>>
>>> can you do
>>>
>>> show create table X
>>>
>>> and send the output. please.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:
>>>
>>>> Hello.
>>>>
>>>> I want to ask the correct bucketing and tablesample way.
>>>>
>>>> There is a table X which I created by
>>>>
>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>> STORED AS ORC
>>>>
>>>> Then I inserted 500M of rows into X by
>>>>
>>>> set hive.enforce.bucketing=true;
>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>
>>>> Then I want to count or search some rows with condition. roughly,
>>>>
>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>
>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>> classifier). So, the better query will be
>>>>
>>>> SELECT COUNT(*) FROM X
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>
>>>> Is there any wrong above? But I can't not find any performance gain
>>>> between these two query.
>>>>
>>>> query1 and RESULT( with no tablesample.)
>>>>
>>>> SELECT COUNT(*)) from X
>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 15.35 s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> It scans full data.
>>>>
>>>> query 2 and RESULT
>>>>
>>>> SELECT COUNT(*)) from X
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 15.82     s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> It ALSO scans full data.
>>>>
>>>> query 2 RESULT WHAT I EXPECTED.
>>>>
>>>> Result what I expected is something like...
>>>> (use 1 map and relatively faster than without tabmesample)
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 3.xx     s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>
>>>> Values of action_id and classifier are well distributed and there is no
>>>> skewed data.
>>>>
>>>> So I want to ask you what will be a correct query that prune and target
>>>> specific bucket by multiple column?
>>>>
>>>
>>>
>


-- 
----------------------------------------------
Jihun No ( 노지훈 )
----------------------------------------------
Twitter          : @nozisim
Facebook       : nozisim
Website         : http://jeesim2.godohosting.com
---------------------------------------------------------------------------------
Market Apps   : android market products.
<https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>

Re: clustered bucket and tablesample

Posted by Mich Talebzadeh <mi...@gmail.com>.

This is your code

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

Well I have a table dummy with 1 billion rows imported from Oracle as ORC
format

hive> show create table dummy;
OK
CREATE TABLE `dummy`(
  `id` int,
  `clustered` int,
  `scattered` int,
  `randomised` int,
  `random_string` varchar(50),
  `small_vc` varchar(10),
  `padding` varchar(10))



*CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='1',
  'numRows'='100000000',
  'orc.bloom.filter.columns'='ID',
  'orc.bloom.filter.fpp'='0.05',
  'orc.compress'='SNAPPY',
  'orc.create.index'='true',
  'orc.row.index.stride'='10000',
  'orc.stripe.size'='16777216',
  'rawDataSize'='0',
  'totalSize'='5662644579',
  'transient_lastDdlTime'='1463245925')


If I turn on the plan for the following two cases. First a simple case

hive>

*EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator

*                   predicate: (id = 20) (type: boolean)  *
Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.064 seconds, Fetched: 51 row(s)


Now we try with tablesample but  assigning predicate values inside the
bracket as below because you are looking in bucket 1 for those values and
you want optimizer to know that.



*hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
OF 256 ON ID = 10)*    > ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator

*                   predicate: (((hash((id = 10)) & 2147483647) % 256) = 0)
(type: boolean)*                    Statistics: Num rows: 50000000 Data
size: 2831322368 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


Otherwise I don't see much happening

hive>  EXPLAIN SELECT COUNT(1) FROM dummy
    > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
    > WHERE ID = 10;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (false and (id = 10)) (type: boolean)
                    Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

In general in my experience bucketing in ORC is the only area where ORC
tables come handy.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 May 2016 at 13:38, no jihun <je...@gmail.com> wrote:

> ah, as i mentioned
> both field type of action_id and classifier is STRING. and I can not
> change the type.
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> I use two fields for hash then bucketing because each one field is not so
> well distributed.
>
> my concern is not about the strong hash source but about How can I
> tablesample to the a bucket by field value what provided by 'where clause'
>
> when I clustered by string fields which one is right for tablesample?
> 1. provide fields
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>
> 2. provide values
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mi...@gmail.com>님이 작성:
>
> Is action_id can be created as a numeric column:
>>
>> CREATE TABLE X ( action_id bigint,  ..)
>>
>> Bucketing or hash partitioning best works on numeric columns with high
>> cardinality (say a primary key).
>>
>> From my old notes:
>>
>> Bucketing in Hive refers to hash partitioning where a hashing function is
>> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>> algorithm to prevent data from clustering within specific partitions.
>> Hashing is very effective if the column selected for bucketing has very
>> high selectivity like an ID column where selectivity (select
>> count(distinct(column))/count(column) ) = 1.  In this case, the created
>> partitions/ files will be as evenly sized as possible. In a nutshell
>> bucketing is a method to get data evenly distributed over many
>> partitions/files.  One should define the number of buckets by a power of
>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>> a join between two tables that are bucketed on the same column with the
>> same number of buckets (anyone has tried this?)
>>
>>
>>
>> One more things. When one defines the number of buckets at table creation
>> level in Hive, the number of partitions/files will be fixed. In contrast,
>> with partitioning you do not have this limitation.
>>
>> can you do
>>
>> show create table X
>>
>> and send the output. please.
>>
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:
>>
>>> Hello.
>>>
>>> I want to ask the correct bucketing and tablesample way.
>>>
>>> There is a table X which I created by
>>>
>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>> STORED AS ORC
>>>
>>> Then I inserted 500M of rows into X by
>>>
>>> set hive.enforce.bucketing=true;
>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>
>>> Then I want to count or search some rows with condition. roughly,
>>>
>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> But I'd better to USE tablesample as I clustered X (action_id,
>>> classifier). So, the better query will be
>>>
>>> SELECT COUNT(*) FROM X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> Is there any wrong above? But I can't not find any performance gain
>>> between these two query.
>>>
>>> query1 and RESULT( with no tablesample.)
>>>
>>> SELECT COUNT(*)) from X
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 15.35 s
>>> ------------------------------------------------------------
>>> --------------------
>>> It scans full data.
>>>
>>> query 2 and RESULT
>>>
>>> SELECT COUNT(*)) from X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 15.82     s
>>> ------------------------------------------------------------
>>> --------------------
>>> It ALSO scans full data.
>>>
>>> query 2 RESULT WHAT I EXPECTED.
>>>
>>> Result what I expected is something like...
>>> (use 1 map and relatively faster than without tabmesample)
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 3.xx     s
>>> ------------------------------------------------------------
>>> --------------------
>>>
>>> Values of action_id and classifier are well distributed and there is no
>>> skewed data.
>>>
>>> So I want to ask you what will be a correct query that prune and target
>>> specific bucket by multiple column?
>>>
>>
>>

Re: clustered bucket and tablesample

Posted by no jihun <je...@gmail.com>.

ah, as i mentioned
both field type of action_id and classifier is STRING. and I can not change
the type.

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

I use two fields for hash then bucketing because each one field is not so
well distributed.

my concern is not about the strong hash source but about How can I
tablesample to the a bucket by field value what provided by 'where clause'

when I clustered by string fields which one is right for tablesample?
1. provide fields
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)

2. provide values
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mi...@gmail.com>님이 작성:

> Is action_id can be created as a numeric column:
>
> CREATE TABLE X ( action_id bigint,  ..)
>
> Bucketing or hash partitioning best works on numeric columns with high
> cardinality (say a primary key).
>
> From my old notes:
>
> Bucketing in Hive refers to hash partitioning where a hashing function is
> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
> algorithm to prevent data from clustering within specific partitions.
> Hashing is very effective if the column selected for bucketing has very
> high selectivity like an ID column where selectivity (select
> count(distinct(column))/count(column) ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a partition wise join i.e.
> a join between two tables that are bucketed on the same column with the
> same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
> can you do
>
> show create table X
>
> and send the output. please.
>
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:
>
>> Hello.
>>
>> I want to ask the correct bucketing and tablesample way.
>>
>> There is a table X which I created by
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> Then I inserted 500M of rows into X by
>>
>> set hive.enforce.bucketing=true;
>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>
>> Then I want to count or search some rows with condition. roughly,
>>
>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>
>> But I'd better to USE tablesample as I clustered X (action_id,
>> classifier). So, the better query will be
>>
>> SELECT COUNT(*) FROM X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' AND classifier='bbb'
>>
>> Is there any wrong above? But I can't not find any performance gain
>> between these two query.
>>
>> query1 and RESULT( with no tablesample.)
>>
>> SELECT COUNT(*)) from X
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED    256        256        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
>> s
>>
>> --------------------------------------------------------------------------------
>> It scans full data.
>>
>> query 2 and RESULT
>>
>> SELECT COUNT(*)) from X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED    256        256        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>> 15.82     s
>>
>> --------------------------------------------------------------------------------
>> It ALSO scans full data.
>>
>> query 2 RESULT WHAT I EXPECTED.
>>
>> Result what I expected is something like...
>> (use 1 map and relatively faster than without tabmesample)
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      1          1        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>> 3.xx     s
>>
>> --------------------------------------------------------------------------------
>>
>> Values of action_id and classifier are well distributed and there is no
>> skewed data.
>>
>> So I want to ask you what will be a correct query that prune and target
>> specific bucket by multiple column?
>>
>
>

Re: clustered bucket and tablesample

Posted by Mich Talebzadeh <mi...@gmail.com>.

Is action_id can be created as a numeric column:

CREATE TABLE X ( action_id bigint,  ..)

Bucketing or hash partitioning best works on numeric columns with high
cardinality (say a primary key).

From my old notes:

Bucketing in Hive refers to hash partitioning where a hashing function is
applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
algorithm to prevent data from clustering within specific partitions.
Hashing is very effective if the column selected for bucketing has very
high selectivity like an ID column where selectivity (select
count(distinct(column))/count(column) ) = 1.  In this case, the created
partitions/ files will be as evenly sized as possible. In a nutshell
bucketing is a method to get data evenly distributed over many
partitions/files.  One should define the number of buckets by a power of
two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
will help concurrency in Hive. It may even allow a partition wise join i.e.
a join between two tables that are bucketed on the same column with the
same number of buckets (anyone has tried this?)



One more things. When one defines the number of buckets at table creation
level in Hive, the number of partitions/files will be fixed. In contrast,
with partitioning you do not have this limitation.

can you do

show create table X

and send the output. please.



Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 May 2016 at 12:23, no jihun <je...@gmail.com> wrote:

> Hello.
>
> I want to ask the correct bucketing and tablesample way.
>
> There is a table X which I created by
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> Then I inserted 500M of rows into X by
>
> set hive.enforce.bucketing=true;
> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>
> Then I want to count or search some rows with condition. roughly,
>
> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>
> But I'd better to USE tablesample as I clustered X (action_id,
> classifier). So, the better query will be
>
> SELECT COUNT(*) FROM X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' AND classifier='bbb'
>
> Is there any wrong above? But I can't not find any performance gain
> between these two query.
>
> query1 and RESULT( with no tablesample.)
>
> SELECT COUNT(*)) from X
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
> s
>
> --------------------------------------------------------------------------------
> It scans full data.
>
> query 2 and RESULT
>
> SELECT COUNT(*)) from X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 15.82     s
>
> --------------------------------------------------------------------------------
> It ALSO scans full data.
>
> query 2 RESULT WHAT I EXPECTED.
>
> Result what I expected is something like...
> (use 1 map and relatively faster than without tabmesample)
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 3.xx     s
>
> --------------------------------------------------------------------------------
>
> Values of action_id and classifier are well distributed and there is no
> skewed data.
>
> So I want to ask you what will be a correct query that prune and target
> specific bucket by multiple column?
>