You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "Petter von Dolwitz (Hem)" <pe...@gmail.com> on 2014/01/09 10:06:51 UTC

Re: Pointing multiple external tables to the same location

Hi Navis (and others),

seems like my solution with views does not work after all. That is, it
works fine as long as I do not use filter pushdown. My setup is something
like below:

CREATE EXTERNAL TABLE MasterTable (
  column1 STRING,
  column2 STRING,
  column3 STRING
  column4 STRING)
  PARTITIONED BY (partition INT)
  ROW FORMAT SERDE 'MySerde'
  STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 'my_location';

CREATE VIEW IF NOT EXISTS View1
  (column1, column2, column3, column4, partition)
PARTITIONED ON (partition)
AS SELECT column1, column2, column3, column4, partition
FROM MasterEventTable
WHERE column1='value1' AND column2='value2';

CREATE VIEW IF NOT EXISTS View2
  (column1, column2, column3, column4, partition)
PARTITIONED ON (partition)
AS SELECT column1, column2, column3, column4, partition
FROM MasterEventTable
WHERE column1='value3' AND column2='value4';


The following query works fine without filter pushdown:
SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);

Now if I enable filter pushdown (setting hive.optimize.index.filter=true)
and apply the filter in my record reader I do not get the correct result. I
do not get any records back at all. It seems like only the second filter
(column1='value3' AND column2='value4) is pushed to my record reader. The
underlying file is only traversed once. I would have expected that I either
got an OR expression down ((column1='value3' AND column2='value4) OR
(column1='value1' AND column2='value2)) or that the underlying file was
scanned twice with each separate expression.

Do you have any thoughts on this?

Thanks,
Petter





2013/12/22 Petter von Dolwitz (Hem) <pe...@gmail.com>

> Hi Navis,
>
> thank you for sorting this out! I have tried getting around this by using
> views towards a single master table instead in combination with UDFs
>  instead . Seems to work so far.
>
> /Petter
>
>
> 2013/12/18 Navis류승우 <na...@nexr.com>
>
>> Hive uses path to table(or partition) mapping internally (you can see
>> that in MapredWork, etc.), which might caused first table overwritten by
>> other.
>>
>> I didn't tried symlink on hdfs, which could be a solution.
>>
>>
>>
>> 2013/12/12 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>
>> Hi,
>>>
>>> I have declared several external tables pointing to the same location.
>>> The things that tells these tables apart (apart from their names) is that
>>> they have unique properties. These properties help me choose the correct
>>> rows from the underlying file. I use a single storage handler (accompanied
>>> by a single InputFormat and a single Serde) . The first columns in all
>>> tables are the same but the last (a struct) is unique and
>>> is constructed from the Serde (with help of the serde properties). A
>>> simplified version of the tables look like so:
>>>
>>> CREATE EXTERNAL TABLE Table1 (
>>>   column1 STRING,
>>>   column2 STRING)
>>>   STORED BY 'MyStorageHandler'
>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>   LOCATION 'mylocation'
>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>
>>> CREATE EXTERNAL TABLE Table2 (
>>>   column1 STRING,
>>>   column2 STRING)
>>>   STORED BY 'MyStorageHandler'
>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>   LOCATION 'mylocation'
>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>
>>>
>>> All works well for simple select queries towards the two tables. The
>>> following query gives very strange results though:
>>>
>>> SELECT * FROM (
>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>   union all
>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>   ) my_union
>>> ORDER BY my_union.column1
>>>
>>>
>>> It seems like one job task is created per file stored in the table
>>> location. This task gets the table properties from the second table and in
>>> the SerDe-step later on it seems like the records gets mixed up.
>>>
>>> I would have expected that hive would need to iterated the source files
>>> two times using two different tasks (with the correct table properties
>>> passed) in order to get this to work.
>>>
>>> Anyone here that can shed some light on this scenario?
>>>
>>> Thanks,
>>> Petter
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Posted by "Petter von Dolwitz (Hem)" <pe...@gmail.com>.

Hi Navis,

one scan task with the correct filter given would be most efficient I
guess. I think the Filter Pushdown mechanism is not working correctly in
this case. It cannot be a too exotic use case to have two views on a table
that when querying yields two different select statements towards this
table. Still the Filter Pushdown mechanism presents only one of them. In my
mind, the pushed down filter should either be correct (a common denominator
of the two select statements), or empty. Or do you think I have violated
some design rule by designing the database this way?

Links is not a viable option in my situation.

Thanks,
Petter


2014/1/16 Navis류승우 <na...@nexr.com>

> I thinks it's back to original problem.
>
> What you wanted is separated scan(task) for different view. But hive does
> not work like that. If two tables or views (or mix of them) has same
> location, it's regarded as same table with same table description (will be
> overridden by lastly visited table or view).
>
> As I suggested the first reply, hadoop link might be helpful for figuring
> out this.
>
>
>
> 2014/1/14 Petter von Dolwitz (Hem) <pe...@gmail.com>
>
> I'm using Hive 0.10 (the version bundled with CDH4.4).
>>
>> The explain at my end looks similar to yours. I guess my real concern is
>> around the way I have implemented the filters.
>>
>> This is how I have done it:
>> - In the constructor of my RecordReader I read the property
>> hive.io.filter.expr.serialized and use the IndexPredicateAnalyzer to find
>> out what parts of the filter that I can apply in my RecordReader.
>> - I process only the rows that match the filter.
>>
>> Since the filter represented in hive.io.filter.expr.serialized only
>> contains one of the filters (column1 < 100 in the example above) the rows
>> matching the other filter (column1 < 30) is lost. This specific example is
>> overlapping so I'm not sure if the result points out the problem (column1 <
>> 30 is covered by column1 < 100). In the example at my end the filters are
>> not overlapping.
>>
>> Is the RecordReader the correct place to implement this filter? Should it
>> work or should the filter integration be done at another level? For the
>> example above, what did you expect hive.io.filter.text to contain?
>>
>> I might add that the tables are partitioned if that makes any difference.
>> I originally had filter negotiation in place in a StorageHandler but
>> StorageHandler did not support partitions so I switched to implementing the
>> filter directly in the RecordReader. In the RecordReader I cannot negotiate
>> filter with Hive but I can apply the filter that I can handle to prune data
>> early.
>>
>> Thank you for your support,
>> Petter
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2014/1/14 Navis류승우 <na...@nexr.com>
>>
>>> In my try, it worked (and should be).
>>>
>>> CREATE EXTERNAL TABLE MasterTable (
>>>   column1 STRING, column2 STRING)
>>>   LOCATION 'hdfs://localhost:9000/home/navis/my_location';
>>>
>>> CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
>>> column2 FROM MasterTable WHERE column1<30;
>>>
>>> CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
>>> column2 FROM MasterTable WHERE column1<100;
>>>
>>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);
>>>
>>> below is result of explain, which takes single whole scan for master
>>> table and handled by two TS followed by FIL with expected predicates.
>>>
>>>         view1:view1:mastertable
>>>           TableScan
>>>             alias: mastertable
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (column1 < 30)
>>>                   type: boolean
>>>
>>>         view2:view2:mastertable
>>>           TableScan
>>>             alias: mastertable
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (column1 < 100)
>>>                   type: boolean
>>>
>>>       Truncated Path -> Alias:
>>>         hdfs://localhost:9000/home/navis/my_location
>>> [view1:view1:mastertable, view2:view2:mastertable]
>>>
>>> Can I ask the version of hive you are using?
>>>
>>>
>>> 2014/1/9 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>
>>> Hi Navis (and others),
>>>>
>>>> seems like my solution with views does not work after all. That is, it
>>>> works fine as long as I do not use filter pushdown. My setup is something
>>>> like below:
>>>>
>>>> CREATE EXTERNAL TABLE MasterTable (
>>>>   column1 STRING,
>>>>   column2 STRING,
>>>>   column3 STRING
>>>>   column4 STRING)
>>>>   PARTITIONED BY (partition INT)
>>>>   ROW FORMAT SERDE 'MySerde'
>>>>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
>>>> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>>>   LOCATION 'my_location';
>>>>
>>>> CREATE VIEW IF NOT EXISTS View1
>>>>   (column1, column2, column3, column4, partition)
>>>> PARTITIONED ON (partition)
>>>> AS SELECT column1, column2, column3, column4, partition
>>>> FROM MasterEventTable
>>>> WHERE column1='value1' AND column2='value2';
>>>>
>>>> CREATE VIEW IF NOT EXISTS View2
>>>>   (column1, column2, column3, column4, partition)
>>>> PARTITIONED ON (partition)
>>>> AS SELECT column1, column2, column3, column4, partition
>>>> FROM MasterEventTable
>>>> WHERE column1='value3' AND column2='value4';
>>>>
>>>>
>>>> The following query works fine without filter pushdown:
>>>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>>>>
>>>> Now if I enable filter pushdown (setting
>>>> hive.optimize.index.filter=true) and apply the filter in my record reader I
>>>> do not get the correct result. I do not get any records back at all. It
>>>> seems like only the second filter (column1='value3' AND column2='value4) is
>>>> pushed to my record reader. The underlying file is only traversed once. I
>>>> would have expected that I either got an OR expression down
>>>> ((column1='value3' AND column2='value4) OR (column1='value1' AND
>>>> column2='value2)) or that the underlying file was scanned twice with each
>>>> separate expression.
>>>>
>>>> Do you have any thoughts on this?
>>>>
>>>> Thanks,
>>>> Petter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2013/12/22 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>>
>>>> Hi Navis,
>>>>>
>>>>> thank you for sorting this out! I have tried getting around this by
>>>>> using views towards a single master table instead in combination with UDFs
>>>>>  instead . Seems to work so far.
>>>>>
>>>>> /Petter
>>>>>
>>>>>
>>>>> 2013/12/18 Navis류승우 <na...@nexr.com>
>>>>>
>>>>>> Hive uses path to table(or partition) mapping internally (you can see
>>>>>> that in MapredWork, etc.), which might caused first table overwritten by
>>>>>> other.
>>>>>>
>>>>>> I didn't tried symlink on hdfs, which could be a solution.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/12/12 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I have declared several external tables pointing to the same
>>>>>>> location. The things that tells these tables apart (apart from their names)
>>>>>>> is that they have unique properties. These properties help me choose the
>>>>>>> correct rows from the underlying file. I use a single storage handler
>>>>>>> (accompanied by a single InputFormat and a single Serde) . The first
>>>>>>> columns in all tables are the same but the last (a struct) is unique and
>>>>>>> is constructed from the Serde (with help of the serde properties). A
>>>>>>> simplified version of the tables look like so:
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE Table1 (
>>>>>>>   column1 STRING,
>>>>>>>   column2 STRING)
>>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>>>>   LOCATION 'mylocation'
>>>>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE Table2 (
>>>>>>>   column1 STRING,
>>>>>>>   column2 STRING)
>>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>>>>   LOCATION 'mylocation'
>>>>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>>>>
>>>>>>>
>>>>>>> All works well for simple select queries towards the two tables. The
>>>>>>> following query gives very strange results though:
>>>>>>>
>>>>>>> SELECT * FROM (
>>>>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>>>>   union all
>>>>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>>>>   ) my_union
>>>>>>> ORDER BY my_union.column1
>>>>>>>
>>>>>>>
>>>>>>> It seems like one job task is created per file stored in the table
>>>>>>> location. This task gets the table properties from the second table and in
>>>>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>>>>
>>>>>>> I would have expected that hive would need to iterated the source
>>>>>>> files two times using two different tasks (with the correct table
>>>>>>> properties passed) in order to get this to work.
>>>>>>>
>>>>>>> Anyone here that can shed some light on this scenario?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Petter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Posted by Navis류승우 <na...@nexr.com>.

I thinks it's back to original problem.

What you wanted is separated scan(task) for different view. But hive does
not work like that. If two tables or views (or mix of them) has same
location, it's regarded as same table with same table description (will be
overridden by lastly visited table or view).

As I suggested the first reply, hadoop link might be helpful for figuring
out this.



2014/1/14 Petter von Dolwitz (Hem) <pe...@gmail.com>

> I'm using Hive 0.10 (the version bundled with CDH4.4).
>
> The explain at my end looks similar to yours. I guess my real concern is
> around the way I have implemented the filters.
>
> This is how I have done it:
> - In the constructor of my RecordReader I read the property
> hive.io.filter.expr.serialized and use the IndexPredicateAnalyzer to find
> out what parts of the filter that I can apply in my RecordReader.
> - I process only the rows that match the filter.
>
> Since the filter represented in hive.io.filter.expr.serialized only
> contains one of the filters (column1 < 100 in the example above) the rows
> matching the other filter (column1 < 30) is lost. This specific example is
> overlapping so I'm not sure if the result points out the problem (column1 <
> 30 is covered by column1 < 100). In the example at my end the filters are
> not overlapping.
>
> Is the RecordReader the correct place to implement this filter? Should it
> work or should the filter integration be done at another level? For the
> example above, what did you expect hive.io.filter.text to contain?
>
> I might add that the tables are partitioned if that makes any difference.
> I originally had filter negotiation in place in a StorageHandler but
> StorageHandler did not support partitions so I switched to implementing the
> filter directly in the RecordReader. In the RecordReader I cannot negotiate
> filter with Hive but I can apply the filter that I can handle to prune data
> early.
>
> Thank you for your support,
> Petter
>
>
>
>
>
>
>
>
>
>
> 2014/1/14 Navis류승우 <na...@nexr.com>
>
>> In my try, it worked (and should be).
>>
>> CREATE EXTERNAL TABLE MasterTable (
>>   column1 STRING, column2 STRING)
>>   LOCATION 'hdfs://localhost:9000/home/navis/my_location';
>>
>> CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
>> column2 FROM MasterTable WHERE column1<30;
>>
>> CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
>> column2 FROM MasterTable WHERE column1<100;
>>
>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);
>>
>> below is result of explain, which takes single whole scan for master
>> table and handled by two TS followed by FIL with expected predicates.
>>
>>         view1:view1:mastertable
>>           TableScan
>>             alias: mastertable
>>             Filter Operator
>>               predicate:
>>                   expr: (column1 < 30)
>>                   type: boolean
>>
>>         view2:view2:mastertable
>>           TableScan
>>             alias: mastertable
>>             Filter Operator
>>               predicate:
>>                   expr: (column1 < 100)
>>                   type: boolean
>>
>>       Truncated Path -> Alias:
>>         hdfs://localhost:9000/home/navis/my_location
>> [view1:view1:mastertable, view2:view2:mastertable]
>>
>> Can I ask the version of hive you are using?
>>
>>
>> 2014/1/9 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>
>> Hi Navis (and others),
>>>
>>> seems like my solution with views does not work after all. That is, it
>>> works fine as long as I do not use filter pushdown. My setup is something
>>> like below:
>>>
>>> CREATE EXTERNAL TABLE MasterTable (
>>>   column1 STRING,
>>>   column2 STRING,
>>>   column3 STRING
>>>   column4 STRING)
>>>   PARTITIONED BY (partition INT)
>>>   ROW FORMAT SERDE 'MySerde'
>>>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
>>> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>>   LOCATION 'my_location';
>>>
>>> CREATE VIEW IF NOT EXISTS View1
>>>   (column1, column2, column3, column4, partition)
>>> PARTITIONED ON (partition)
>>> AS SELECT column1, column2, column3, column4, partition
>>> FROM MasterEventTable
>>> WHERE column1='value1' AND column2='value2';
>>>
>>> CREATE VIEW IF NOT EXISTS View2
>>>   (column1, column2, column3, column4, partition)
>>> PARTITIONED ON (partition)
>>> AS SELECT column1, column2, column3, column4, partition
>>> FROM MasterEventTable
>>> WHERE column1='value3' AND column2='value4';
>>>
>>>
>>> The following query works fine without filter pushdown:
>>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>>>
>>> Now if I enable filter pushdown (setting
>>> hive.optimize.index.filter=true) and apply the filter in my record reader I
>>> do not get the correct result. I do not get any records back at all. It
>>> seems like only the second filter (column1='value3' AND column2='value4) is
>>> pushed to my record reader. The underlying file is only traversed once. I
>>> would have expected that I either got an OR expression down
>>> ((column1='value3' AND column2='value4) OR (column1='value1' AND
>>> column2='value2)) or that the underlying file was scanned twice with each
>>> separate expression.
>>>
>>> Do you have any thoughts on this?
>>>
>>> Thanks,
>>> Petter
>>>
>>>
>>>
>>>
>>>
>>> 2013/12/22 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>
>>> Hi Navis,
>>>>
>>>> thank you for sorting this out! I have tried getting around this by
>>>> using views towards a single master table instead in combination with UDFs
>>>>  instead . Seems to work so far.
>>>>
>>>> /Petter
>>>>
>>>>
>>>> 2013/12/18 Navis류승우 <na...@nexr.com>
>>>>
>>>>> Hive uses path to table(or partition) mapping internally (you can see
>>>>> that in MapredWork, etc.), which might caused first table overwritten by
>>>>> other.
>>>>>
>>>>> I didn't tried symlink on hdfs, which could be a solution.
>>>>>
>>>>>
>>>>>
>>>>> 2013/12/12 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> I have declared several external tables pointing to the same
>>>>>> location. The things that tells these tables apart (apart from their names)
>>>>>> is that they have unique properties. These properties help me choose the
>>>>>> correct rows from the underlying file. I use a single storage handler
>>>>>> (accompanied by a single InputFormat and a single Serde) . The first
>>>>>> columns in all tables are the same but the last (a struct) is unique and
>>>>>> is constructed from the Serde (with help of the serde properties). A
>>>>>> simplified version of the tables look like so:
>>>>>>
>>>>>> CREATE EXTERNAL TABLE Table1 (
>>>>>>   column1 STRING,
>>>>>>   column2 STRING)
>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>>>   LOCATION 'mylocation'
>>>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>>>
>>>>>> CREATE EXTERNAL TABLE Table2 (
>>>>>>   column1 STRING,
>>>>>>   column2 STRING)
>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>>>   LOCATION 'mylocation'
>>>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>>>
>>>>>>
>>>>>> All works well for simple select queries towards the two tables. The
>>>>>> following query gives very strange results though:
>>>>>>
>>>>>> SELECT * FROM (
>>>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>>>   union all
>>>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>>>   ) my_union
>>>>>> ORDER BY my_union.column1
>>>>>>
>>>>>>
>>>>>> It seems like one job task is created per file stored in the table
>>>>>> location. This task gets the table properties from the second table and in
>>>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>>>
>>>>>> I would have expected that hive would need to iterated the source
>>>>>> files two times using two different tasks (with the correct table
>>>>>> properties passed) in order to get this to work.
>>>>>>
>>>>>> Anyone here that can shed some light on this scenario?
>>>>>>
>>>>>> Thanks,
>>>>>> Petter
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Posted by "Petter von Dolwitz (Hem)" <pe...@gmail.com>.

I'm using Hive 0.10 (the version bundled with CDH4.4).

The explain at my end looks similar to yours. I guess my real concern is
around the way I have implemented the filters.

This is how I have done it:
- In the constructor of my RecordReader I read the property
hive.io.filter.expr.serialized and use the IndexPredicateAnalyzer to find
out what parts of the filter that I can apply in my RecordReader.
- I process only the rows that match the filter.

Since the filter represented in hive.io.filter.expr.serialized only
contains one of the filters (column1 < 100 in the example above) the rows
matching the other filter (column1 < 30) is lost. This specific example is
overlapping so I'm not sure if the result points out the problem (column1 <
30 is covered by column1 < 100). In the example at my end the filters are
not overlapping.

Is the RecordReader the correct place to implement this filter? Should it
work or should the filter integration be done at another level? For the
example above, what did you expect hive.io.filter.text to contain?

I might add that the tables are partitioned if that makes any difference. I
originally had filter negotiation in place in a StorageHandler but
StorageHandler did not support partitions so I switched to implementing the
filter directly in the RecordReader. In the RecordReader I cannot negotiate
filter with Hive but I can apply the filter that I can handle to prune data
early.

Thank you for your support,
Petter










2014/1/14 Navis류승우 <na...@nexr.com>

> In my try, it worked (and should be).
>
> CREATE EXTERNAL TABLE MasterTable (
>   column1 STRING, column2 STRING)
>   LOCATION 'hdfs://localhost:9000/home/navis/my_location';
>
> CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
> column2 FROM MasterTable WHERE column1<30;
>
> CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
> column2 FROM MasterTable WHERE column1<100;
>
> SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);
>
> below is result of explain, which takes single whole scan for master table
> and handled by two TS followed by FIL with expected predicates.
>
>         view1:view1:mastertable
>           TableScan
>             alias: mastertable
>             Filter Operator
>               predicate:
>                   expr: (column1 < 30)
>                   type: boolean
>
>         view2:view2:mastertable
>           TableScan
>             alias: mastertable
>             Filter Operator
>               predicate:
>                   expr: (column1 < 100)
>                   type: boolean
>
>       Truncated Path -> Alias:
>         hdfs://localhost:9000/home/navis/my_location
> [view1:view1:mastertable, view2:view2:mastertable]
>
> Can I ask the version of hive you are using?
>
>
> 2014/1/9 Petter von Dolwitz (Hem) <pe...@gmail.com>
>
> Hi Navis (and others),
>>
>> seems like my solution with views does not work after all. That is, it
>> works fine as long as I do not use filter pushdown. My setup is something
>> like below:
>>
>> CREATE EXTERNAL TABLE MasterTable (
>>   column1 STRING,
>>   column2 STRING,
>>   column3 STRING
>>   column4 STRING)
>>   PARTITIONED BY (partition INT)
>>   ROW FORMAT SERDE 'MySerde'
>>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>   LOCATION 'my_location';
>>
>> CREATE VIEW IF NOT EXISTS View1
>>   (column1, column2, column3, column4, partition)
>> PARTITIONED ON (partition)
>> AS SELECT column1, column2, column3, column4, partition
>> FROM MasterEventTable
>> WHERE column1='value1' AND column2='value2';
>>
>> CREATE VIEW IF NOT EXISTS View2
>>   (column1, column2, column3, column4, partition)
>> PARTITIONED ON (partition)
>> AS SELECT column1, column2, column3, column4, partition
>> FROM MasterEventTable
>> WHERE column1='value3' AND column2='value4';
>>
>>
>> The following query works fine without filter pushdown:
>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>>
>> Now if I enable filter pushdown (setting hive.optimize.index.filter=true)
>> and apply the filter in my record reader I do not get the correct result. I
>> do not get any records back at all. It seems like only the second filter
>> (column1='value3' AND column2='value4) is pushed to my record reader. The
>> underlying file is only traversed once. I would have expected that I either
>> got an OR expression down ((column1='value3' AND column2='value4) OR
>> (column1='value1' AND column2='value2)) or that the underlying file was
>> scanned twice with each separate expression.
>>
>> Do you have any thoughts on this?
>>
>> Thanks,
>> Petter
>>
>>
>>
>>
>>
>> 2013/12/22 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>
>> Hi Navis,
>>>
>>> thank you for sorting this out! I have tried getting around this by
>>> using views towards a single master table instead in combination with UDFs
>>>  instead . Seems to work so far.
>>>
>>> /Petter
>>>
>>>
>>> 2013/12/18 Navis류승우 <na...@nexr.com>
>>>
>>>> Hive uses path to table(or partition) mapping internally (you can see
>>>> that in MapredWork, etc.), which might caused first table overwritten by
>>>> other.
>>>>
>>>> I didn't tried symlink on hdfs, which could be a solution.
>>>>
>>>>
>>>>
>>>> 2013/12/12 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>>
>>>> Hi,
>>>>>
>>>>> I have declared several external tables pointing to the same location.
>>>>> The things that tells these tables apart (apart from their names) is that
>>>>> they have unique properties. These properties help me choose the correct
>>>>> rows from the underlying file. I use a single storage handler (accompanied
>>>>> by a single InputFormat and a single Serde) . The first columns in all
>>>>> tables are the same but the last (a struct) is unique and
>>>>> is constructed from the Serde (with help of the serde properties). A
>>>>> simplified version of the tables look like so:
>>>>>
>>>>> CREATE EXTERNAL TABLE Table1 (
>>>>>   column1 STRING,
>>>>>   column2 STRING)
>>>>>   STORED BY 'MyStorageHandler'
>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>>   LOCATION 'mylocation'
>>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>>
>>>>> CREATE EXTERNAL TABLE Table2 (
>>>>>   column1 STRING,
>>>>>   column2 STRING)
>>>>>   STORED BY 'MyStorageHandler'
>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>>   LOCATION 'mylocation'
>>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>>
>>>>>
>>>>> All works well for simple select queries towards the two tables. The
>>>>> following query gives very strange results though:
>>>>>
>>>>> SELECT * FROM (
>>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>>   union all
>>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>>   ) my_union
>>>>> ORDER BY my_union.column1
>>>>>
>>>>>
>>>>> It seems like one job task is created per file stored in the table
>>>>> location. This task gets the table properties from the second table and in
>>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>>
>>>>> I would have expected that hive would need to iterated the source
>>>>> files two times using two different tasks (with the correct table
>>>>> properties passed) in order to get this to work.
>>>>>
>>>>> Anyone here that can shed some light on this scenario?
>>>>>
>>>>> Thanks,
>>>>> Petter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Posted by Navis류승우 <na...@nexr.com>.

In my try, it worked (and should be).

CREATE EXTERNAL TABLE MasterTable (
  column1 STRING, column2 STRING)
  LOCATION 'hdfs://localhost:9000/home/navis/my_location';

CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
column2 FROM MasterTable WHERE column1<30;

CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
column2 FROM MasterTable WHERE column1<100;

SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);

below is result of explain, which takes single whole scan for master table
and handled by two TS followed by FIL with expected predicates.

        view1:view1:mastertable
          TableScan
            alias: mastertable
            Filter Operator
              predicate:
                  expr: (column1 < 30)
                  type: boolean

        view2:view2:mastertable
          TableScan
            alias: mastertable
            Filter Operator
              predicate:
                  expr: (column1 < 100)
                  type: boolean

      Truncated Path -> Alias:
        hdfs://localhost:9000/home/navis/my_location
[view1:view1:mastertable, view2:view2:mastertable]

Can I ask the version of hive you are using?


2014/1/9 Petter von Dolwitz (Hem) <pe...@gmail.com>

> Hi Navis (and others),
>
> seems like my solution with views does not work after all. That is, it
> works fine as long as I do not use filter pushdown. My setup is something
> like below:
>
> CREATE EXTERNAL TABLE MasterTable (
>   column1 STRING,
>   column2 STRING,
>   column3 STRING
>   column4 STRING)
>   PARTITIONED BY (partition INT)
>   ROW FORMAT SERDE 'MySerde'
>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   LOCATION 'my_location';
>
> CREATE VIEW IF NOT EXISTS View1
>   (column1, column2, column3, column4, partition)
> PARTITIONED ON (partition)
> AS SELECT column1, column2, column3, column4, partition
> FROM MasterEventTable
> WHERE column1='value1' AND column2='value2';
>
> CREATE VIEW IF NOT EXISTS View2
>   (column1, column2, column3, column4, partition)
> PARTITIONED ON (partition)
> AS SELECT column1, column2, column3, column4, partition
> FROM MasterEventTable
> WHERE column1='value3' AND column2='value4';
>
>
> The following query works fine without filter pushdown:
> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>
> Now if I enable filter pushdown (setting hive.optimize.index.filter=true)
> and apply the filter in my record reader I do not get the correct result. I
> do not get any records back at all. It seems like only the second filter
> (column1='value3' AND column2='value4) is pushed to my record reader. The
> underlying file is only traversed once. I would have expected that I either
> got an OR expression down ((column1='value3' AND column2='value4) OR
> (column1='value1' AND column2='value2)) or that the underlying file was
> scanned twice with each separate expression.
>
> Do you have any thoughts on this?
>
> Thanks,
> Petter
>
>
>
>
>
> 2013/12/22 Petter von Dolwitz (Hem) <pe...@gmail.com>
>
> Hi Navis,
>>
>> thank you for sorting this out! I have tried getting around this by using
>> views towards a single master table instead in combination with UDFs
>>  instead . Seems to work so far.
>>
>> /Petter
>>
>>
>> 2013/12/18 Navis류승우 <na...@nexr.com>
>>
>>> Hive uses path to table(or partition) mapping internally (you can see
>>> that in MapredWork, etc.), which might caused first table overwritten by
>>> other.
>>>
>>> I didn't tried symlink on hdfs, which could be a solution.
>>>
>>>
>>>
>>> 2013/12/12 Petter von Dolwitz (Hem) <pe...@gmail.com>
>>>
>>> Hi,
>>>>
>>>> I have declared several external tables pointing to the same location.
>>>> The things that tells these tables apart (apart from their names) is that
>>>> they have unique properties. These properties help me choose the correct
>>>> rows from the underlying file. I use a single storage handler (accompanied
>>>> by a single InputFormat and a single Serde) . The first columns in all
>>>> tables are the same but the last (a struct) is unique and
>>>> is constructed from the Serde (with help of the serde properties). A
>>>> simplified version of the tables look like so:
>>>>
>>>> CREATE EXTERNAL TABLE Table1 (
>>>>   column1 STRING,
>>>>   column2 STRING)
>>>>   STORED BY 'MyStorageHandler'
>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>   LOCATION 'mylocation'
>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>
>>>> CREATE EXTERNAL TABLE Table2 (
>>>>   column1 STRING,
>>>>   column2 STRING)
>>>>   STORED BY 'MyStorageHandler'
>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>   LOCATION 'mylocation'
>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>
>>>>
>>>> All works well for simple select queries towards the two tables. The
>>>> following query gives very strange results though:
>>>>
>>>> SELECT * FROM (
>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>   union all
>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>   ) my_union
>>>> ORDER BY my_union.column1
>>>>
>>>>
>>>> It seems like one job task is created per file stored in the table
>>>> location. This task gets the table properties from the second table and in
>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>
>>>> I would have expected that hive would need to iterated the source files
>>>> two times using two different tasks (with the correct table properties
>>>> passed) in order to get this to work.
>>>>
>>>> Anyone here that can shed some light on this scenario?
>>>>
>>>> Thanks,
>>>> Petter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>