You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by An...@parc.com on 2015/04/01 04:42:37 UTC

Request for feedback on work intent for non-equijoin support

Dear Hive development community members,

I am interested in learning more about the current support for non-equijoins in Hive and/or other Hadoop SQL engines, and in getting feedback about community interest in more extensive support for such a feature. I intend to work on this challenge, assuming people find it compelling, and I intend to contribute results to the community. Where possible, it would be great to receive feedback and engage in collaborations along the way (for a bit more context, see the postscript of this message).

My initial goal is to support query conditions such as the following:

A.x < B.y
A.x in_range [B.y, B.z]
distance(A.x, B.y) < D

where A and B are distinct tables/files. It is my understanding that current support for performing non-equijoins like those above is quite limited, and where some forms are supported (like in Cloudera's Impala), this support is based on doing a potentially expensive cross product join. Depending on the data types involved, I believe that joins with these conditions can be made to be tractable (at least on the average) with join algorithms that exploit properties of the data types, possibly with some pre-scanning of the data.

I am asking for feedback on the interest & need in the community for this work, as well as any pointers to similar work. In particular, I would appreciate any answers people could give on the following questions:

- Is my understanding of the state of the art in Hive and similar tools accurate? Are there groups currently working on similar or related issues, or tools that already accomplish some or all of what I have proposed?
- Is there significant value to the community in the support of such a feature? In other words, are the manual workarounds necessary because of the absence of non-equijoins such as these enough of a pain to justify the work I propose?
- Being aware that the potential pre-scanning adds to the cost of the join, and that data could still blow-up in the worst case, am I missing any other important considerations and tradeoffs for this problem?
- What would be a good avenue to contribute this feature to the community (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or plugin)?
- What is the best way to get started in working with the community?

Thanks for your attention and any info you can provide!

Andres Quiroz

P.S. If you are interested in some context, and why/how I am proposing to do this work, please read on.

I am part of a small project team at PARC working on the general problems of data integration and automated ETL. We have proposed a tool called HiperFuse that is designed to accept declarative, high-level queries in order to produce joined (fused) data sets from multiple heterogeneous raw data sources. In our preliminary work, which you can find here (pointer to the paper), we designed the architecture of the tool and obtained some results separately on the problems of automated data cleansing, data type inference, and query planning. One of the planned prototype implementations of HiperFuse relies on Hadoop MR, and because the declarative language we proposed was closely related to SQL, we thought that we could exploit the existing work in Hive and/or other open-source tools for handling the SQL part and layer our work on top of that. For example, the query given in the paper could easily be expressed in SQL-like form with a non-equijoin condition:

SELECT web_access_log.ip, census.income
FROM web_access_log, ip2zip, census
WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
AND ip2zip.zip = census.zip

As you can see, the first impasse that we hit in order to bring the elements together to solve this query end-to-end was the realization and performance of the non-equality join in the query. The intent now is to tackle this problem in a general sense and provide a solution for a wide range of queries.

The work I propose to do would be based on three main components within HiperFuse:

- Enhancements to the extensible data type framework in HiperFuse that would categorize data types based on the properties needed to support the join algorithms, in order to write join-ready domain-specific data type libraries.
- The join algorithms themselves, based on Hive or directly on Hadoop MR.
- A query planner, which would determine the right algorithm to apply and automatically schedule any necessary pre-scanning of the data.


RE: Request for feedback on work intent for non-equijoin support

Posted by "Xu, Cheng A" <ch...@intel.com>.
You can start your work from JoinOperator. Before that, you should follow the steps in https://cwiki.apache.org/confluence/display/Hive/GettingStarted 

-----Original Message-----
From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com] 
Sent: Wednesday, April 08, 2015 8:49 PM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

So, I'd like to get started on this. The description in the design doc and the theta join paper from Northeastern seem like a good place to start, to have a baseline that I can later use for the more specific join algorithms I want to try. 

I created a JIRA account, and my username is Andres.Quiroz

Brock, since I'm completely new to this code, could you (or anyone else) please point me to the relevant modules to start learning and ramping up? Also, please let me know if I can contact you directly for discussing this specific topic, or if I should always send a message to the mailing list.

Thank you,

Andrés

-----Original Message-----
From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
Sent: Thursday, April 02, 2015 9:07 AM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.

Andrés

-----Original Message-----
From: Brock Noland [mailto:brock@apache.org]
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
> From Hive side, there has been some thought on the subject here:
> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
> some ideas but nobody has gotten around to giving it a try.  It might 
> be of interest.
>
> Thanks
> Szehon
>
>
> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
> <le...@gmail.com>
> wrote:
>
>> D'oh!  Thanks Chao.
>>
>> -- Lefty
>>
>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>
>> > Hey Lefty,
>> >
>> > You need to use the ftp protocol, not http.
>> > After clicking the link, you'll need to remove "http://" from the
>> address
>> > bar.
>> >
>> > Best,
>> > Chao
>> >
>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
>> > <le...@gmail.com>
>> > wrote:
>> >
>> > > Andrés, I followed that link and got the dread 404 Not Found:
>> > >
>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > was not found on this server."
>> > >
>> > > -- Lefty
>> > >
>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>> > >
>> > > > Dear Lefty,
>> > > >
>> > > > Thank you very much for pointing that out and for your initial
>> > pointers.
>> > > > Here is the missing link:
>> > > >
>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > >
>> > > > Regards,
>> > > >
>> > > > Andrés
>> > > >
>> > > > -----Original Message-----
>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>> > > > To: dev@hive.apache.org
>> > > > Subject: Re: Request for feedback on work intent for 
>> > > > non-equijoin
>> > support
>> > > >
>> > > > Hello Andres, the link to your paper is missing:
>> > > >
>> > > > In our preliminary work, which you can find here (pointer to 
>> > > > the
>> paper)
>> > > ...
>> > > >
>> > > >
>> > > > You can find general information about contributing to Hive in 
>> > > > the
>> > > > wiki:  Resources
>> > > > for Contributors
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>> orContributors
>> > > > >
>> > > > , How to Contribute
>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>> > > >
>> > > > -- Lefty
>> > > >
>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>> > > >
>> > > > >  Dear Hive development community members,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am interested in learning more about the current support 
>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, 
>> > > > > and in
>> getting
>> > > > > feedback about community interest in more extensive support 
>> > > > > for
>> such
>> > a
>> > > > > feature. I intend to work on this challenge, assuming people 
>> > > > > find
>> it
>> > > > > compelling, and I intend to contribute results to the community.
>> > Where
>> > > > > possible, it would be great to receive feedback and engage in 
>> > > > > collaborations along the way (for a bit more context, see the 
>> > > > > postscript of this message).
>> > > > >
>> > > > >
>> > > > >
>> > > > > My initial goal is to support query conditions such as the
>> following:
>> > > > >
>> > > > >
>> > > > >
>> > > > > A.x < B.y
>> > > > >
>> > > > > A.x in_range [B.y, B.z]
>> > > > >
>> > > > > distance(A.x, B.y) < D
>> > > > >
>> > > > >
>> > > > >
>> > > > > where A and B are distinct tables/files. It is my 
>> > > > > understanding
>> that
>> > > > > current support for performing non-equijoins like those above 
>> > > > > is
>> > quite
>> > > > > limited, and where some forms are supported (like in 
>> > > > > Cloudera's Impala), this support is based on doing a 
>> > > > > potentially expensive
>> cross
>> > > > product join.
>> > > > > Depending on the data types involved, I believe that joins 
>> > > > > with
>> these
>> > > > > conditions can be made to be tractable (at least on the
>> > > > > average)
>> with
>> > > > > join algorithms that exploit properties of the data types, 
>> > > > > possibly with some pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am asking for feedback on the interest & need in the 
>> > > > > community
>> for
>> > > > > this work, as well as any pointers to similar work. In 
>> > > > > particular,
>> I
>> > > > > would appreciate any answers people could give on the 
>> > > > > following
>> > > > questions:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Is my understanding of the state of the art in Hive and 
>> > > > > similar tools accurate? Are there groups currently working on 
>> > > > > similar or related issues, or tools that already accomplish 
>> > > > > some or all of
>> what
>> > I
>> > > > have proposed?
>> > > > >
>> > > > > - Is there significant value to the community in the support 
>> > > > > of
>> such
>> > a
>> > > > > feature? In other words, are the manual workarounds necessary
>> because
>> > > > > of the absence of non-equijoins such as these enough of a 
>> > > > > pain to justify the work I propose?
>> > > > >
>> > > > > - Being aware that the potential pre-scanning adds to the 
>> > > > > cost of
>> the
>> > > > > join, and that data could still blow-up in the worst case, am 
>> > > > > I missing any other important considerations and tradeoffs 
>> > > > > for this
>> > > > problem?
>> > > > >
>> > > > > - What would be a good avenue to contribute this feature to 
>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or 
>> > > > > as a Hive extension or plugin)?
>> > > > >
>> > > > > - What is the best way to get started in working with the
>> community?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thanks for your attention and any info you can provide!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Andres Quiroz
>> > > > >
>> > > > >
>> > > > >
>> > > > > P.S. If you are interested in some context, and why/how I am
>> > proposing
>> > > > > to do this work, please read on.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am part of a small project team at PARC working on the 
>> > > > > general problems of data integration and automated ETL. We 
>> > > > > have proposed a tool called HiperFuse that is designed to 
>> > > > > accept declarative, high-level queries in order to produce 
>> > > > > joined (fused) data sets
>> from
>> > > > > multiple heterogeneous raw data sources. In our preliminary 
>> > > > > work, which you can find here (pointer to the paper), we 
>> > > > > designed the architecture of the tool and obtained some 
>> > > > > results separately on
>> the
>> > > > > problems of automated data cleansing, data type inference, 
>> > > > > and
>> query
>> > > > > planning. One of the planned prototype implementations of 
>> > > > > HiperFuse relies on Hadoop MR, and because the declarative 
>> > > > > language we
>> proposed
>> > > > > was closely related to SQL, we thought that we could exploit 
>> > > > > the existing work in Hive and/or other open-source tools for 
>> > > > > handling
>> the
>> > > > > SQL part and layer our work on top of that. For example, the 
>> > > > > query given in the paper could easily be expressed in 
>> > > > > SQL-like form with
>> a
>> > > > > non-equijoin
>> > > > > condition:
>> > > > >
>> > > > >
>> > > > >
>> > > > > SELECT web_access_log.ip, census.income
>> > > > >
>> > > > > FROM web_access_log, ip2zip, census
>> > > > >
>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, 
>> > > > > ip2zip.ip_high]
>> > > > >
>> > > > > AND ip2zip.zip = census.zip
>> > > > >
>> > > > >
>> > > > >
>> > > > > As you can see, the first impasse that we hit in order to 
>> > > > > bring the elements together to solve this query end-to-end 
>> > > > > was the
>> realization
>> > > > > and performance of the non-equality join in the query. The 
>> > > > > intent
>> now
>> > > > > is to tackle this problem in a general sense and provide a 
>> > > > > solution for a wide range of queries.
>> > > > >
>> > > > >
>> > > > >
>> > > > > The work I propose to do would be based on three main 
>> > > > > components within
>> > > > > HiperFuse:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Enhancements to the extensible data type framework in 
>> > > > > HiperFuse
>> > that
>> > > > > would categorize data types based on the properties needed to
>> support
>> > > > > the join algorithms, in order to write join-ready 
>> > > > > domain-specific
>> > data
>> > > > > type libraries.
>> > > > >
>> > > > > - The join algorithms themselves, based on Hive or directly 
>> > > > > on
>> Hadoop
>> > > MR.
>> > > > >
>> > > > > - A query planner, which would determine the right algorithm 
>> > > > > to
>> apply
>> > > > > and automatically schedule any necessary pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best,
>> > Chao
>> >
>>

Re: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
Ok, that would be great! Except for Monday and Friday, I could meet any
day next week in the afternoon (Pacific time), since it is the end of the
day for me. 

Thanks a lot,

Andrés

On 5/15/15, 4:13 PM, "Thejas Nair" <th...@gmail.com> wrote:

>Hi Andres,
>Glad to hear about the progress!
>
>Vikram is a hive join implementation expert. He can guide you through
>this.
>We can setup a webex or google hangout and discuss this. Does sometime
>next week work for you ? (Please let us know some hours that work for
>you,  in Pacific time zone).
>
>Anybody else who is interested in the theta join work is also welcome
>to join the discussion. Please let me know.
>
>Thanks,
>Thejas
>
>
>On Fri, May 15, 2015 at 12:48 PM,  <An...@parc.com> wrote:
>> Hello,
>>
>> At this point, I have implemented a standalone version of the
>> 1-bucket-theta join algorithm described in the northeastern paper on
>> Hadoop MR, and would like to start porting it to Hive.
>>
>> I have been looking at the code and believe that the main goal would be
>>to
>> implement a new JoinOperator. However, it¹s still not very clear to me
>>how
>> this class interacts with the rest of the platform (i.e. How it fits in
>> the overall query processing workflow).
>>
>> Could someone please provide or point me to a crash course on
>>implementing
>> a join operator? If nothing else, a list of steps and other classes
>>that I
>> may have to touch or add would be a very helpful starting point.
>>
>> Also, I suppose tez is preferred for the implementation, right?
>>
>> Thanks for your help,
>>
>> Andrés
>>
>> On 4/8/15, 2:32 PM, "Thejas Nair" <th...@gmail.com> wrote:
>>
>>>Yes, the theta join paper in northeastern is a good place to start.
>>>There is also a presentation from the folks in youtube, which is also
>>>very useful.
>>>I had a look at this issue as well earlier, and I had written up a
>>>rough proposal.  I had not organized the document well enough for
>>>sharing publicly, but in case you find it useful, I have attached it
>>>to wiki -
>>>https://cwiki.apache.org/confluence/download/attachments/27362075/theta%
>>>20
>>>join%20proposal%20-%20thejas.pdf?version=1&modificationDate=142851770295
>>>4&
>>>api=v2
>>>It also includes a list of some of the changes that are needed (it is
>>>probably not comprehensive enough).
>>>
>>>
>>>On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
>>>> So, I'd like to get started on this. The description in the design doc
>>>>and the theta join paper from Northeastern seem like a good place to
>>>>start, to have a baseline that I can later use for the more specific
>>>>join algorithms I want to try.
>>>>
>>>> I created a JIRA account, and my username is Andres.Quiroz
>>>>
>>>> Brock, since I'm completely new to this code, could you (or anyone
>>>>else) please point me to the relevant modules to start learning and
>>>>ramping up? Also, please let me know if I can contact you directly for
>>>>discussing this specific topic, or if I should always send a message to
>>>>the mailing list.
>>>>
>>>> Thank you,
>>>>
>>>> Andrés
>>>>
>>>> -----Original Message-----
>>>> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
>>>> Sent: Thursday, April 02, 2015 9:07 AM
>>>> To: dev@hive.apache.org
>>>> Subject: RE: Request for feedback on work intent for non-equijoin
>>>>support
>>>>
>>>> This is a great pointer, Szehon and Brock, thank you. I will catch up
>>>>with the material on theta joins and circle back.
>>>>
>>>> Andrés
>>>>
>>>> -----Original Message-----
>>>> From: Brock Noland [mailto:brock@apache.org]
>>>> Sent: Thursday, April 02, 2015 1:31 AM
>>>> To: dev@hive.apache.org
>>>> Subject: Re: Request for feedback on work intent for non-equijoin
>>>>support
>>>>
>>>> Nice, it'd be great if someone finally implemented this :)
>>>>
>>>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com>
>>>>wrote:
>>>>> From Hive side, there has been some thought on the subject here:
>>>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>>>> some ideas but nobody has gotten around to giving it a try.  It might
>>>>> be of interest.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>>
>>>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>>>> <le...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> D'oh!  Thanks Chao.
>>>>>>
>>>>>> -- Lefty
>>>>>>
>>>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>>>>
>>>>>> > Hey Lefty,
>>>>>> >
>>>>>> > You need to use the ftp protocol, not http.
>>>>>> > After clicking the link, you'll need to remove "http://" from the
>>>>>> address
>>>>>> > bar.
>>>>>> >
>>>>>> > Best,
>>>>>> > Chao
>>>>>> >
>>>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>>>> > <le...@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>>>> > >
>>>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>>> > > was not found on this server."
>>>>>> > >
>>>>>> > > -- Lefty
>>>>>> > >
>>>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>>>>> > >
>>>>>> > > > Dear Lefty,
>>>>>> > > >
>>>>>> > > > Thank you very much for pointing that out and for your initial
>>>>>> > pointers.
>>>>>> > > > Here is the missing link:
>>>>>> > > >
>>>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>>> > > >
>>>>>> > > > Regards,
>>>>>> > > >
>>>>>> > > > Andrés
>>>>>> > > >
>>>>>> > > > -----Original Message-----
>>>>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>>>> > > > To: dev@hive.apache.org
>>>>>> > > > Subject: Re: Request for feedback on work intent for
>>>>>> > > > non-equijoin
>>>>>> > support
>>>>>> > > >
>>>>>> > > > Hello Andres, the link to your paper is missing:
>>>>>> > > >
>>>>>> > > > In our preliminary work, which you can find here (pointer to
>>>>>> > > > the
>>>>>> paper)
>>>>>> > > ...
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > You can find general information about contributing to Hive in
>>>>>> > > > the
>>>>>> > > > wiki:  Resources
>>>>>> > > > for Contributors
>>>>>> > > > <
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> 
>>>>>>https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>>>>>> orContributors
>>>>>> > > > >
>>>>>> > > > , How to Contribute
>>>>>> > > >
>>>>>><https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>>>> > > >
>>>>>> > > > -- Lefty
>>>>>> > > >
>>>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com>
>>>>>>wrote:
>>>>>> > > >
>>>>>> > > > >  Dear Hive development community members,
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am interested in learning more about the current support
>>>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>>>> > > > > and in
>>>>>> getting
>>>>>> > > > > feedback about community interest in more extensive support
>>>>>> > > > > for
>>>>>> such
>>>>>> > a
>>>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>>>> > > > > find
>>>>>> it
>>>>>> > > > > compelling, and I intend to contribute results to the
>>>>>>community.
>>>>>> > Where
>>>>>> > > > > possible, it would be great to receive feedback and engage
>>>>>>in
>>>>>> > > > > collaborations along the way (for a bit more context, see
>>>>>>the
>>>>>> > > > > postscript of this message).
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > My initial goal is to support query conditions such as the
>>>>>> following:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > A.x < B.y
>>>>>> > > > >
>>>>>> > > > > A.x in_range [B.y, B.z]
>>>>>> > > > >
>>>>>> > > > > distance(A.x, B.y) < D
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > where A and B are distinct tables/files. It is my
>>>>>> > > > > understanding
>>>>>> that
>>>>>> > > > > current support for performing non-equijoins like those
>>>>>>above
>>>>>> > > > > is
>>>>>> > quite
>>>>>> > > > > limited, and where some forms are supported (like in
>>>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>>>> > > > > potentially expensive
>>>>>> cross
>>>>>> > > > product join.
>>>>>> > > > > Depending on the data types involved, I believe that joins
>>>>>> > > > > with
>>>>>> these
>>>>>> > > > > conditions can be made to be tractable (at least on the
>>>>>> > > > > average)
>>>>>> with
>>>>>> > > > > join algorithms that exploit properties of the data types,
>>>>>> > > > > possibly with some pre-scanning of the data.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am asking for feedback on the interest & need in the
>>>>>> > > > > community
>>>>>> for
>>>>>> > > > > this work, as well as any pointers to similar work. In
>>>>>> > > > > particular,
>>>>>> I
>>>>>> > > > > would appreciate any answers people could give on the
>>>>>> > > > > following
>>>>>> > > > questions:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>>>> > > > > similar tools accurate? Are there groups currently working
>>>>>>on
>>>>>> > > > > similar or related issues, or tools that already accomplish
>>>>>> > > > > some or all of
>>>>>> what
>>>>>> > I
>>>>>> > > > have proposed?
>>>>>> > > > >
>>>>>> > > > > - Is there significant value to the community in the support
>>>>>> > > > > of
>>>>>> such
>>>>>> > a
>>>>>> > > > > feature? In other words, are the manual workarounds
>>>>>>necessary
>>>>>> because
>>>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>>>> > > > > pain to justify the work I propose?
>>>>>> > > > >
>>>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>>>> > > > > cost of
>>>>>> the
>>>>>> > > > > join, and that data could still blow-up in the worst case,
>>>>>>am
>>>>>> > > > > I missing any other important considerations and tradeoffs
>>>>>> > > > > for this
>>>>>> > > > problem?
>>>>>> > > > >
>>>>>> > > > > - What would be a good avenue to contribute this feature to
>>>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop,
>>>>>>or
>>>>>> > > > > as a Hive extension or plugin)?
>>>>>> > > > >
>>>>>> > > > > - What is the best way to get started in working with the
>>>>>> community?
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > Thanks for your attention and any info you can provide!
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > Andres Quiroz
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>>>> > proposing
>>>>>> > > > > to do this work, please read on.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > I am part of a small project team at PARC working on the
>>>>>> > > > > general problems of data integration and automated ETL. We
>>>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>>>> > > > > accept declarative, high-level queries in order to produce
>>>>>> > > > > joined (fused) data sets
>>>>>> from
>>>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>>>> > > > > work, which you can find here (pointer to the paper), we
>>>>>> > > > > designed the architecture of the tool and obtained some
>>>>>> > > > > results separately on
>>>>>> the
>>>>>> > > > > problems of automated data cleansing, data type inference,
>>>>>> > > > > and
>>>>>> query
>>>>>> > > > > planning. One of the planned prototype implementations of
>>>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>>>> > > > > language we
>>>>>> proposed
>>>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>>>> > > > > handling
>>>>>> the
>>>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>>>> > > > > query given in the paper could easily be expressed in
>>>>>> > > > > SQL-like form with
>>>>>> a
>>>>>> > > > > non-equijoin
>>>>>> > > > > condition:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > SELECT web_access_log.ip, census.income
>>>>>> > > > >
>>>>>> > > > > FROM web_access_log, ip2zip, census
>>>>>> > > > >
>>>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>>>> > > > > ip2zip.ip_high]
>>>>>> > > > >
>>>>>> > > > > AND ip2zip.zip = census.zip
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > As you can see, the first impasse that we hit in order to
>>>>>> > > > > bring the elements together to solve this query end-to-end
>>>>>> > > > > was the
>>>>>> realization
>>>>>> > > > > and performance of the non-equality join in the query. The
>>>>>> > > > > intent
>>>>>> now
>>>>>> > > > > is to tackle this problem in a general sense and provide a
>>>>>> > > > > solution for a wide range of queries.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > The work I propose to do would be based on three main
>>>>>> > > > > components within
>>>>>> > > > > HiperFuse:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > - Enhancements to the extensible data type framework in
>>>>>> > > > > HiperFuse
>>>>>> > that
>>>>>> > > > > would categorize data types based on the properties needed
>>>>>>to
>>>>>> support
>>>>>> > > > > the join algorithms, in order to write join-ready
>>>>>> > > > > domain-specific
>>>>>> > data
>>>>>> > > > > type libraries.
>>>>>> > > > >
>>>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>>>> > > > > on
>>>>>> Hadoop
>>>>>> > > MR.
>>>>>> > > > >
>>>>>> > > > > - A query planner, which would determine the right algorithm
>>>>>> > > > > to
>>>>>> apply
>>>>>> > > > > and automatically schedule any necessary pre-scanning of the
>>>>>>data.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Best,
>>>>>> > Chao
>>>>>> >
>>>>>>
>>


Re: Request for feedback on work intent for non-equijoin support

Posted by Thejas Nair <th...@gmail.com>.
Hi Andres,
Glad to hear about the progress!

Vikram is a hive join implementation expert. He can guide you through this.
We can setup a webex or google hangout and discuss this. Does sometime
next week work for you ? (Please let us know some hours that work for
you,  in Pacific time zone).

Anybody else who is interested in the theta join work is also welcome
to join the discussion. Please let me know.

Thanks,
Thejas


On Fri, May 15, 2015 at 12:48 PM,  <An...@parc.com> wrote:
> Hello,
>
> At this point, I have implemented a standalone version of the
> 1-bucket-theta join algorithm described in the northeastern paper on
> Hadoop MR, and would like to start porting it to Hive.
>
> I have been looking at the code and believe that the main goal would be to
> implement a new JoinOperator. However, it¹s still not very clear to me how
> this class interacts with the rest of the platform (i.e. How it fits in
> the overall query processing workflow).
>
> Could someone please provide or point me to a crash course on implementing
> a join operator? If nothing else, a list of steps and other classes that I
> may have to touch or add would be a very helpful starting point.
>
> Also, I suppose tez is preferred for the implementation, right?
>
> Thanks for your help,
>
> Andrés
>
> On 4/8/15, 2:32 PM, "Thejas Nair" <th...@gmail.com> wrote:
>
>>Yes, the theta join paper in northeastern is a good place to start.
>>There is also a presentation from the folks in youtube, which is also
>>very useful.
>>I had a look at this issue as well earlier, and I had written up a
>>rough proposal.  I had not organized the document well enough for
>>sharing publicly, but in case you find it useful, I have attached it
>>to wiki -
>>https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20
>>join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&
>>api=v2
>>It also includes a list of some of the changes that are needed (it is
>>probably not comprehensive enough).
>>
>>
>>On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
>>> So, I'd like to get started on this. The description in the design doc
>>>and the theta join paper from Northeastern seem like a good place to
>>>start, to have a baseline that I can later use for the more specific
>>>join algorithms I want to try.
>>>
>>> I created a JIRA account, and my username is Andres.Quiroz
>>>
>>> Brock, since I'm completely new to this code, could you (or anyone
>>>else) please point me to the relevant modules to start learning and
>>>ramping up? Also, please let me know if I can contact you directly for
>>>discussing this specific topic, or if I should always send a message to
>>>the mailing list.
>>>
>>> Thank you,
>>>
>>> Andrés
>>>
>>> -----Original Message-----
>>> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
>>> Sent: Thursday, April 02, 2015 9:07 AM
>>> To: dev@hive.apache.org
>>> Subject: RE: Request for feedback on work intent for non-equijoin
>>>support
>>>
>>> This is a great pointer, Szehon and Brock, thank you. I will catch up
>>>with the material on theta joins and circle back.
>>>
>>> Andrés
>>>
>>> -----Original Message-----
>>> From: Brock Noland [mailto:brock@apache.org]
>>> Sent: Thursday, April 02, 2015 1:31 AM
>>> To: dev@hive.apache.org
>>> Subject: Re: Request for feedback on work intent for non-equijoin
>>>support
>>>
>>> Nice, it'd be great if someone finally implemented this :)
>>>
>>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
>>>> From Hive side, there has been some thought on the subject here:
>>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>>> some ideas but nobody has gotten around to giving it a try.  It might
>>>> be of interest.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>>
>>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>>> <le...@gmail.com>
>>>> wrote:
>>>>
>>>>> D'oh!  Thanks Chao.
>>>>>
>>>>> -- Lefty
>>>>>
>>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>>>
>>>>> > Hey Lefty,
>>>>> >
>>>>> > You need to use the ftp protocol, not http.
>>>>> > After clicking the link, you'll need to remove "http://" from the
>>>>> address
>>>>> > bar.
>>>>> >
>>>>> > Best,
>>>>> > Chao
>>>>> >
>>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>>> > <le...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>>> > >
>>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>> > > was not found on this server."
>>>>> > >
>>>>> > > -- Lefty
>>>>> > >
>>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>>>> > >
>>>>> > > > Dear Lefty,
>>>>> > > >
>>>>> > > > Thank you very much for pointing that out and for your initial
>>>>> > pointers.
>>>>> > > > Here is the missing link:
>>>>> > > >
>>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>>> > > >
>>>>> > > > Regards,
>>>>> > > >
>>>>> > > > Andrés
>>>>> > > >
>>>>> > > > -----Original Message-----
>>>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>>> > > > To: dev@hive.apache.org
>>>>> > > > Subject: Re: Request for feedback on work intent for
>>>>> > > > non-equijoin
>>>>> > support
>>>>> > > >
>>>>> > > > Hello Andres, the link to your paper is missing:
>>>>> > > >
>>>>> > > > In our preliminary work, which you can find here (pointer to
>>>>> > > > the
>>>>> paper)
>>>>> > > ...
>>>>> > > >
>>>>> > > >
>>>>> > > > You can find general information about contributing to Hive in
>>>>> > > > the
>>>>> > > > wiki:  Resources
>>>>> > > > for Contributors
>>>>> > > > <
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>>>>> orContributors
>>>>> > > > >
>>>>> > > > , How to Contribute
>>>>> > > >
>>>>><https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>>> > > >
>>>>> > > > -- Lefty
>>>>> > > >
>>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com>
>>>>>wrote:
>>>>> > > >
>>>>> > > > >  Dear Hive development community members,
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > I am interested in learning more about the current support
>>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>>> > > > > and in
>>>>> getting
>>>>> > > > > feedback about community interest in more extensive support
>>>>> > > > > for
>>>>> such
>>>>> > a
>>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>>> > > > > find
>>>>> it
>>>>> > > > > compelling, and I intend to contribute results to the
>>>>>community.
>>>>> > Where
>>>>> > > > > possible, it would be great to receive feedback and engage in
>>>>> > > > > collaborations along the way (for a bit more context, see the
>>>>> > > > > postscript of this message).
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > My initial goal is to support query conditions such as the
>>>>> following:
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > A.x < B.y
>>>>> > > > >
>>>>> > > > > A.x in_range [B.y, B.z]
>>>>> > > > >
>>>>> > > > > distance(A.x, B.y) < D
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > where A and B are distinct tables/files. It is my
>>>>> > > > > understanding
>>>>> that
>>>>> > > > > current support for performing non-equijoins like those above
>>>>> > > > > is
>>>>> > quite
>>>>> > > > > limited, and where some forms are supported (like in
>>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>>> > > > > potentially expensive
>>>>> cross
>>>>> > > > product join.
>>>>> > > > > Depending on the data types involved, I believe that joins
>>>>> > > > > with
>>>>> these
>>>>> > > > > conditions can be made to be tractable (at least on the
>>>>> > > > > average)
>>>>> with
>>>>> > > > > join algorithms that exploit properties of the data types,
>>>>> > > > > possibly with some pre-scanning of the data.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > I am asking for feedback on the interest & need in the
>>>>> > > > > community
>>>>> for
>>>>> > > > > this work, as well as any pointers to similar work. In
>>>>> > > > > particular,
>>>>> I
>>>>> > > > > would appreciate any answers people could give on the
>>>>> > > > > following
>>>>> > > > questions:
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>>> > > > > similar tools accurate? Are there groups currently working on
>>>>> > > > > similar or related issues, or tools that already accomplish
>>>>> > > > > some or all of
>>>>> what
>>>>> > I
>>>>> > > > have proposed?
>>>>> > > > >
>>>>> > > > > - Is there significant value to the community in the support
>>>>> > > > > of
>>>>> such
>>>>> > a
>>>>> > > > > feature? In other words, are the manual workarounds necessary
>>>>> because
>>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>>> > > > > pain to justify the work I propose?
>>>>> > > > >
>>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>>> > > > > cost of
>>>>> the
>>>>> > > > > join, and that data could still blow-up in the worst case, am
>>>>> > > > > I missing any other important considerations and tradeoffs
>>>>> > > > > for this
>>>>> > > > problem?
>>>>> > > > >
>>>>> > > > > - What would be a good avenue to contribute this feature to
>>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or
>>>>> > > > > as a Hive extension or plugin)?
>>>>> > > > >
>>>>> > > > > - What is the best way to get started in working with the
>>>>> community?
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Thanks for your attention and any info you can provide!
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Andres Quiroz
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>>> > proposing
>>>>> > > > > to do this work, please read on.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > I am part of a small project team at PARC working on the
>>>>> > > > > general problems of data integration and automated ETL. We
>>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>>> > > > > accept declarative, high-level queries in order to produce
>>>>> > > > > joined (fused) data sets
>>>>> from
>>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>>> > > > > work, which you can find here (pointer to the paper), we
>>>>> > > > > designed the architecture of the tool and obtained some
>>>>> > > > > results separately on
>>>>> the
>>>>> > > > > problems of automated data cleansing, data type inference,
>>>>> > > > > and
>>>>> query
>>>>> > > > > planning. One of the planned prototype implementations of
>>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>>> > > > > language we
>>>>> proposed
>>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>>> > > > > handling
>>>>> the
>>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>>> > > > > query given in the paper could easily be expressed in
>>>>> > > > > SQL-like form with
>>>>> a
>>>>> > > > > non-equijoin
>>>>> > > > > condition:
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > SELECT web_access_log.ip, census.income
>>>>> > > > >
>>>>> > > > > FROM web_access_log, ip2zip, census
>>>>> > > > >
>>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>>> > > > > ip2zip.ip_high]
>>>>> > > > >
>>>>> > > > > AND ip2zip.zip = census.zip
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > As you can see, the first impasse that we hit in order to
>>>>> > > > > bring the elements together to solve this query end-to-end
>>>>> > > > > was the
>>>>> realization
>>>>> > > > > and performance of the non-equality join in the query. The
>>>>> > > > > intent
>>>>> now
>>>>> > > > > is to tackle this problem in a general sense and provide a
>>>>> > > > > solution for a wide range of queries.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > The work I propose to do would be based on three main
>>>>> > > > > components within
>>>>> > > > > HiperFuse:
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > - Enhancements to the extensible data type framework in
>>>>> > > > > HiperFuse
>>>>> > that
>>>>> > > > > would categorize data types based on the properties needed to
>>>>> support
>>>>> > > > > the join algorithms, in order to write join-ready
>>>>> > > > > domain-specific
>>>>> > data
>>>>> > > > > type libraries.
>>>>> > > > >
>>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>>> > > > > on
>>>>> Hadoop
>>>>> > > MR.
>>>>> > > > >
>>>>> > > > > - A query planner, which would determine the right algorithm
>>>>> > > > > to
>>>>> apply
>>>>> > > > > and automatically schedule any necessary pre-scanning of the
>>>>>data.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Best,
>>>>> > Chao
>>>>> >
>>>>>
>

Re: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
Hello,

At this point, I have implemented a standalone version of the
1-bucket-theta join algorithm described in the northeastern paper on
Hadoop MR, and would like to start porting it to Hive.

I have been looking at the code and believe that the main goal would be to
implement a new JoinOperator. However, it¹s still not very clear to me how
this class interacts with the rest of the platform (i.e. How it fits in
the overall query processing workflow).

Could someone please provide or point me to a crash course on implementing
a join operator? If nothing else, a list of steps and other classes that I
may have to touch or add would be a very helpful starting point.

Also, I suppose tez is preferred for the implementation, right?

Thanks for your help,

Andrés

On 4/8/15, 2:32 PM, "Thejas Nair" <th...@gmail.com> wrote:

>Yes, the theta join paper in northeastern is a good place to start.
>There is also a presentation from the folks in youtube, which is also
>very useful.
>I had a look at this issue as well earlier, and I had written up a
>rough proposal.  I had not organized the document well enough for
>sharing publicly, but in case you find it useful, I have attached it
>to wiki - 
>https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20
>join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&
>api=v2
>It also includes a list of some of the changes that are needed (it is
>probably not comprehensive enough).
>
>
>On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
>> So, I'd like to get started on this. The description in the design doc
>>and the theta join paper from Northeastern seem like a good place to
>>start, to have a baseline that I can later use for the more specific
>>join algorithms I want to try.
>>
>> I created a JIRA account, and my username is Andres.Quiroz
>>
>> Brock, since I'm completely new to this code, could you (or anyone
>>else) please point me to the relevant modules to start learning and
>>ramping up? Also, please let me know if I can contact you directly for
>>discussing this specific topic, or if I should always send a message to
>>the mailing list.
>>
>> Thank you,
>>
>> Andrés
>>
>> -----Original Message-----
>> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
>> Sent: Thursday, April 02, 2015 9:07 AM
>> To: dev@hive.apache.org
>> Subject: RE: Request for feedback on work intent for non-equijoin
>>support
>>
>> This is a great pointer, Szehon and Brock, thank you. I will catch up
>>with the material on theta joins and circle back.
>>
>> Andrés
>>
>> -----Original Message-----
>> From: Brock Noland [mailto:brock@apache.org]
>> Sent: Thursday, April 02, 2015 1:31 AM
>> To: dev@hive.apache.org
>> Subject: Re: Request for feedback on work intent for non-equijoin
>>support
>>
>> Nice, it'd be great if someone finally implemented this :)
>>
>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
>>> From Hive side, there has been some thought on the subject here:
>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>> some ideas but nobody has gotten around to giving it a try.  It might
>>> be of interest.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>> <le...@gmail.com>
>>> wrote:
>>>
>>>> D'oh!  Thanks Chao.
>>>>
>>>> -- Lefty
>>>>
>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>>
>>>> > Hey Lefty,
>>>> >
>>>> > You need to use the ftp protocol, not http.
>>>> > After clicking the link, you'll need to remove "http://" from the
>>>> address
>>>> > bar.
>>>> >
>>>> > Best,
>>>> > Chao
>>>> >
>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>> > <le...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>> > >
>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > was not found on this server."
>>>> > >
>>>> > > -- Lefty
>>>> > >
>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>>> > >
>>>> > > > Dear Lefty,
>>>> > > >
>>>> > > > Thank you very much for pointing that out and for your initial
>>>> > pointers.
>>>> > > > Here is the missing link:
>>>> > > >
>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > >
>>>> > > > Regards,
>>>> > > >
>>>> > > > Andrés
>>>> > > >
>>>> > > > -----Original Message-----
>>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>> > > > To: dev@hive.apache.org
>>>> > > > Subject: Re: Request for feedback on work intent for
>>>> > > > non-equijoin
>>>> > support
>>>> > > >
>>>> > > > Hello Andres, the link to your paper is missing:
>>>> > > >
>>>> > > > In our preliminary work, which you can find here (pointer to
>>>> > > > the
>>>> paper)
>>>> > > ...
>>>> > > >
>>>> > > >
>>>> > > > You can find general information about contributing to Hive in
>>>> > > > the
>>>> > > > wiki:  Resources
>>>> > > > for Contributors
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>>>> orContributors
>>>> > > > >
>>>> > > > , How to Contribute
>>>> > > > 
>>>><https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>> > > >
>>>> > > > -- Lefty
>>>> > > >
>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com>
>>>>wrote:
>>>> > > >
>>>> > > > >  Dear Hive development community members,
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am interested in learning more about the current support
>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>> > > > > and in
>>>> getting
>>>> > > > > feedback about community interest in more extensive support
>>>> > > > > for
>>>> such
>>>> > a
>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>> > > > > find
>>>> it
>>>> > > > > compelling, and I intend to contribute results to the
>>>>community.
>>>> > Where
>>>> > > > > possible, it would be great to receive feedback and engage in
>>>> > > > > collaborations along the way (for a bit more context, see the
>>>> > > > > postscript of this message).
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > My initial goal is to support query conditions such as the
>>>> following:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > A.x < B.y
>>>> > > > >
>>>> > > > > A.x in_range [B.y, B.z]
>>>> > > > >
>>>> > > > > distance(A.x, B.y) < D
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > where A and B are distinct tables/files. It is my
>>>> > > > > understanding
>>>> that
>>>> > > > > current support for performing non-equijoins like those above
>>>> > > > > is
>>>> > quite
>>>> > > > > limited, and where some forms are supported (like in
>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>> > > > > potentially expensive
>>>> cross
>>>> > > > product join.
>>>> > > > > Depending on the data types involved, I believe that joins
>>>> > > > > with
>>>> these
>>>> > > > > conditions can be made to be tractable (at least on the
>>>> > > > > average)
>>>> with
>>>> > > > > join algorithms that exploit properties of the data types,
>>>> > > > > possibly with some pre-scanning of the data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am asking for feedback on the interest & need in the
>>>> > > > > community
>>>> for
>>>> > > > > this work, as well as any pointers to similar work. In
>>>> > > > > particular,
>>>> I
>>>> > > > > would appreciate any answers people could give on the
>>>> > > > > following
>>>> > > > questions:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>> > > > > similar tools accurate? Are there groups currently working on
>>>> > > > > similar or related issues, or tools that already accomplish
>>>> > > > > some or all of
>>>> what
>>>> > I
>>>> > > > have proposed?
>>>> > > > >
>>>> > > > > - Is there significant value to the community in the support
>>>> > > > > of
>>>> such
>>>> > a
>>>> > > > > feature? In other words, are the manual workarounds necessary
>>>> because
>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>> > > > > pain to justify the work I propose?
>>>> > > > >
>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>> > > > > cost of
>>>> the
>>>> > > > > join, and that data could still blow-up in the worst case, am
>>>> > > > > I missing any other important considerations and tradeoffs
>>>> > > > > for this
>>>> > > > problem?
>>>> > > > >
>>>> > > > > - What would be a good avenue to contribute this feature to
>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or
>>>> > > > > as a Hive extension or plugin)?
>>>> > > > >
>>>> > > > > - What is the best way to get started in working with the
>>>> community?
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Thanks for your attention and any info you can provide!
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Andres Quiroz
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>> > proposing
>>>> > > > > to do this work, please read on.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am part of a small project team at PARC working on the
>>>> > > > > general problems of data integration and automated ETL. We
>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>> > > > > accept declarative, high-level queries in order to produce
>>>> > > > > joined (fused) data sets
>>>> from
>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>> > > > > work, which you can find here (pointer to the paper), we
>>>> > > > > designed the architecture of the tool and obtained some
>>>> > > > > results separately on
>>>> the
>>>> > > > > problems of automated data cleansing, data type inference,
>>>> > > > > and
>>>> query
>>>> > > > > planning. One of the planned prototype implementations of
>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>> > > > > language we
>>>> proposed
>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>> > > > > handling
>>>> the
>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>> > > > > query given in the paper could easily be expressed in
>>>> > > > > SQL-like form with
>>>> a
>>>> > > > > non-equijoin
>>>> > > > > condition:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > SELECT web_access_log.ip, census.income
>>>> > > > >
>>>> > > > > FROM web_access_log, ip2zip, census
>>>> > > > >
>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>> > > > > ip2zip.ip_high]
>>>> > > > >
>>>> > > > > AND ip2zip.zip = census.zip
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > As you can see, the first impasse that we hit in order to
>>>> > > > > bring the elements together to solve this query end-to-end
>>>> > > > > was the
>>>> realization
>>>> > > > > and performance of the non-equality join in the query. The
>>>> > > > > intent
>>>> now
>>>> > > > > is to tackle this problem in a general sense and provide a
>>>> > > > > solution for a wide range of queries.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > The work I propose to do would be based on three main
>>>> > > > > components within
>>>> > > > > HiperFuse:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Enhancements to the extensible data type framework in
>>>> > > > > HiperFuse
>>>> > that
>>>> > > > > would categorize data types based on the properties needed to
>>>> support
>>>> > > > > the join algorithms, in order to write join-ready
>>>> > > > > domain-specific
>>>> > data
>>>> > > > > type libraries.
>>>> > > > >
>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>> > > > > on
>>>> Hadoop
>>>> > > MR.
>>>> > > > >
>>>> > > > > - A query planner, which would determine the right algorithm
>>>> > > > > to
>>>> apply
>>>> > > > > and automatically schedule any necessary pre-scanning of the
>>>>data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best,
>>>> > Chao
>>>> >
>>>>


Re: Request for feedback on work intent for non-equijoin support

Posted by Thejas Nair <th...@gmail.com>.
I don't have cycles for working on it in the next month or two. Maybe
after that.


On Wed, Apr 8, 2015 at 2:16 PM,  <An...@parc.com> wrote:
> This is certainly very helpful, thank you. Do you have any cycles to devote to this issue at the moment, or in the near future?
>
> -----Original Message-----
> From: Thejas Nair [mailto:thejas.nair@gmail.com]
> Sent: Wednesday, April 08, 2015 2:32 PM
> To: dev
> Subject: Re: Request for feedback on work intent for non-equijoin support
>
> Yes, the theta join paper in northeastern is a good place to start.
> There is also a presentation from the folks in youtube, which is also very useful.
> I had a look at this issue as well earlier, and I had written up a rough proposal.  I had not organized the document well enough for sharing publicly, but in case you find it useful, I have attached it to wiki - https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&api=v2
> It also includes a list of some of the changes that are needed (it is probably not comprehensive enough).
>
>
> On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
>> So, I'd like to get started on this. The description in the design doc and the theta join paper from Northeastern seem like a good place to start, to have a baseline that I can later use for the more specific join algorithms I want to try.
>>
>> I created a JIRA account, and my username is Andres.Quiroz
>>
>> Brock, since I'm completely new to this code, could you (or anyone else) please point me to the relevant modules to start learning and ramping up? Also, please let me know if I can contact you directly for discussing this specific topic, or if I should always send a message to the mailing list.
>>
>> Thank you,
>>
>> Andrés
>>
>> -----Original Message-----
>> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
>> Sent: Thursday, April 02, 2015 9:07 AM
>> To: dev@hive.apache.org
>> Subject: RE: Request for feedback on work intent for non-equijoin
>> support
>>
>> This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.
>>
>> Andrés
>>
>> -----Original Message-----
>> From: Brock Noland [mailto:brock@apache.org]
>> Sent: Thursday, April 02, 2015 1:31 AM
>> To: dev@hive.apache.org
>> Subject: Re: Request for feedback on work intent for non-equijoin
>> support
>>
>> Nice, it'd be great if someone finally implemented this :)
>>
>> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
>>> From Hive side, there has been some thought on the subject here:
>>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>>> some ideas but nobody has gotten around to giving it a try.  It might
>>> be of interest.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>>> <le...@gmail.com>
>>> wrote:
>>>
>>>> D'oh!  Thanks Chao.
>>>>
>>>> -- Lefty
>>>>
>>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>>
>>>> > Hey Lefty,
>>>> >
>>>> > You need to use the ftp protocol, not http.
>>>> > After clicking the link, you'll need to remove "http://" from the
>>>> address
>>>> > bar.
>>>> >
>>>> > Best,
>>>> > Chao
>>>> >
>>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>>> > <le...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>>> > >
>>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > was not found on this server."
>>>> > >
>>>> > > -- Lefty
>>>> > >
>>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>>> > >
>>>> > > > Dear Lefty,
>>>> > > >
>>>> > > > Thank you very much for pointing that out and for your initial
>>>> > pointers.
>>>> > > > Here is the missing link:
>>>> > > >
>>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>>> > > >
>>>> > > > Regards,
>>>> > > >
>>>> > > > Andrés
>>>> > > >
>>>> > > > -----Original Message-----
>>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>>> > > > To: dev@hive.apache.org
>>>> > > > Subject: Re: Request for feedback on work intent for
>>>> > > > non-equijoin
>>>> > support
>>>> > > >
>>>> > > > Hello Andres, the link to your paper is missing:
>>>> > > >
>>>> > > > In our preliminary work, which you can find here (pointer to
>>>> > > > the
>>>> paper)
>>>> > > ...
>>>> > > >
>>>> > > >
>>>> > > > You can find general information about contributing to Hive in
>>>> > > > the
>>>> > > > wiki:  Resources
>>>> > > > for Contributors
>>>> > > > <
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources
>>>> f
>>>> orContributors
>>>> > > > >
>>>> > > > , How to Contribute
>>>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>>> > > >
>>>> > > > -- Lefty
>>>> > > >
>>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>>>> > > >
>>>> > > > >  Dear Hive development community members,
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am interested in learning more about the current support
>>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>>> > > > > and in
>>>> getting
>>>> > > > > feedback about community interest in more extensive support
>>>> > > > > for
>>>> such
>>>> > a
>>>> > > > > feature. I intend to work on this challenge, assuming people
>>>> > > > > find
>>>> it
>>>> > > > > compelling, and I intend to contribute results to the community.
>>>> > Where
>>>> > > > > possible, it would be great to receive feedback and engage
>>>> > > > > in collaborations along the way (for a bit more context, see
>>>> > > > > the postscript of this message).
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > My initial goal is to support query conditions such as the
>>>> following:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > A.x < B.y
>>>> > > > >
>>>> > > > > A.x in_range [B.y, B.z]
>>>> > > > >
>>>> > > > > distance(A.x, B.y) < D
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > where A and B are distinct tables/files. It is my
>>>> > > > > understanding
>>>> that
>>>> > > > > current support for performing non-equijoins like those
>>>> > > > > above is
>>>> > quite
>>>> > > > > limited, and where some forms are supported (like in
>>>> > > > > Cloudera's Impala), this support is based on doing a
>>>> > > > > potentially expensive
>>>> cross
>>>> > > > product join.
>>>> > > > > Depending on the data types involved, I believe that joins
>>>> > > > > with
>>>> these
>>>> > > > > conditions can be made to be tractable (at least on the
>>>> > > > > average)
>>>> with
>>>> > > > > join algorithms that exploit properties of the data types,
>>>> > > > > possibly with some pre-scanning of the data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am asking for feedback on the interest & need in the
>>>> > > > > community
>>>> for
>>>> > > > > this work, as well as any pointers to similar work. In
>>>> > > > > particular,
>>>> I
>>>> > > > > would appreciate any answers people could give on the
>>>> > > > > following
>>>> > > > questions:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Is my understanding of the state of the art in Hive and
>>>> > > > > similar tools accurate? Are there groups currently working
>>>> > > > > on similar or related issues, or tools that already
>>>> > > > > accomplish some or all of
>>>> what
>>>> > I
>>>> > > > have proposed?
>>>> > > > >
>>>> > > > > - Is there significant value to the community in the support
>>>> > > > > of
>>>> such
>>>> > a
>>>> > > > > feature? In other words, are the manual workarounds
>>>> > > > > necessary
>>>> because
>>>> > > > > of the absence of non-equijoins such as these enough of a
>>>> > > > > pain to justify the work I propose?
>>>> > > > >
>>>> > > > > - Being aware that the potential pre-scanning adds to the
>>>> > > > > cost of
>>>> the
>>>> > > > > join, and that data could still blow-up in the worst case,
>>>> > > > > am I missing any other important considerations and
>>>> > > > > tradeoffs for this
>>>> > > > problem?
>>>> > > > >
>>>> > > > > - What would be a good avenue to contribute this feature to
>>>> > > > > the community (e.g. as a standalone tool on top of Hadoop,
>>>> > > > > or as a Hive extension or plugin)?
>>>> > > > >
>>>> > > > > - What is the best way to get started in working with the
>>>> community?
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Thanks for your attention and any info you can provide!
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > Andres Quiroz
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > P.S. If you are interested in some context, and why/how I am
>>>> > proposing
>>>> > > > > to do this work, please read on.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > I am part of a small project team at PARC working on the
>>>> > > > > general problems of data integration and automated ETL. We
>>>> > > > > have proposed a tool called HiperFuse that is designed to
>>>> > > > > accept declarative, high-level queries in order to produce
>>>> > > > > joined (fused) data sets
>>>> from
>>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>>> > > > > work, which you can find here (pointer to the paper), we
>>>> > > > > designed the architecture of the tool and obtained some
>>>> > > > > results separately on
>>>> the
>>>> > > > > problems of automated data cleansing, data type inference,
>>>> > > > > and
>>>> query
>>>> > > > > planning. One of the planned prototype implementations of
>>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>>> > > > > language we
>>>> proposed
>>>> > > > > was closely related to SQL, we thought that we could exploit
>>>> > > > > the existing work in Hive and/or other open-source tools for
>>>> > > > > handling
>>>> the
>>>> > > > > SQL part and layer our work on top of that. For example, the
>>>> > > > > query given in the paper could easily be expressed in
>>>> > > > > SQL-like form with
>>>> a
>>>> > > > > non-equijoin
>>>> > > > > condition:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > SELECT web_access_log.ip, census.income
>>>> > > > >
>>>> > > > > FROM web_access_log, ip2zip, census
>>>> > > > >
>>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>>> > > > > ip2zip.ip_high]
>>>> > > > >
>>>> > > > > AND ip2zip.zip = census.zip
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > As you can see, the first impasse that we hit in order to
>>>> > > > > bring the elements together to solve this query end-to-end
>>>> > > > > was the
>>>> realization
>>>> > > > > and performance of the non-equality join in the query. The
>>>> > > > > intent
>>>> now
>>>> > > > > is to tackle this problem in a general sense and provide a
>>>> > > > > solution for a wide range of queries.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > The work I propose to do would be based on three main
>>>> > > > > components within
>>>> > > > > HiperFuse:
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > - Enhancements to the extensible data type framework in
>>>> > > > > HiperFuse
>>>> > that
>>>> > > > > would categorize data types based on the properties needed
>>>> > > > > to
>>>> support
>>>> > > > > the join algorithms, in order to write join-ready
>>>> > > > > domain-specific
>>>> > data
>>>> > > > > type libraries.
>>>> > > > >
>>>> > > > > - The join algorithms themselves, based on Hive or directly
>>>> > > > > on
>>>> Hadoop
>>>> > > MR.
>>>> > > > >
>>>> > > > > - A query planner, which would determine the right algorithm
>>>> > > > > to
>>>> apply
>>>> > > > > and automatically schedule any necessary pre-scanning of the data.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best,
>>>> > Chao
>>>> >
>>>>

RE: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
This is certainly very helpful, thank you. Do you have any cycles to devote to this issue at the moment, or in the near future?

-----Original Message-----
From: Thejas Nair [mailto:thejas.nair@gmail.com] 
Sent: Wednesday, April 08, 2015 2:32 PM
To: dev
Subject: Re: Request for feedback on work intent for non-equijoin support

Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also very useful.
I had a look at this issue as well earlier, and I had written up a rough proposal.  I had not organized the document well enough for sharing publicly, but in case you find it useful, I have attached it to wiki - https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&api=v2
It also includes a list of some of the changes that are needed (it is probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
> So, I'd like to get started on this. The description in the design doc and the theta join paper from Northeastern seem like a good place to start, to have a baseline that I can later use for the more specific join algorithms I want to try.
>
> I created a JIRA account, and my username is Andres.Quiroz
>
> Brock, since I'm completely new to this code, could you (or anyone else) please point me to the relevant modules to start learning and ramping up? Also, please let me know if I can contact you directly for discussing this specific topic, or if I should always send a message to the mailing list.
>
> Thank you,
>
> Andrés
>
> -----Original Message-----
> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
> Sent: Thursday, April 02, 2015 9:07 AM
> To: dev@hive.apache.org
> Subject: RE: Request for feedback on work intent for non-equijoin 
> support
>
> This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.
>
> Andrés
>
> -----Original Message-----
> From: Brock Noland [mailto:brock@apache.org]
> Sent: Thursday, April 02, 2015 1:31 AM
> To: dev@hive.apache.org
> Subject: Re: Request for feedback on work intent for non-equijoin 
> support
>
> Nice, it'd be great if someone finally implemented this :)
>
> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
>> From Hive side, there has been some thought on the subject here:
>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
>> some ideas but nobody has gotten around to giving it a try.  It might 
>> be of interest.
>>
>> Thanks
>> Szehon
>>
>>
>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
>> <le...@gmail.com>
>> wrote:
>>
>>> D'oh!  Thanks Chao.
>>>
>>> -- Lefty
>>>
>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>
>>> > Hey Lefty,
>>> >
>>> > You need to use the ftp protocol, not http.
>>> > After clicking the link, you'll need to remove "http://" from the
>>> address
>>> > bar.
>>> >
>>> > Best,
>>> > Chao
>>> >
>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
>>> > <le...@gmail.com>
>>> > wrote:
>>> >
>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>> > >
>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>> > > was not found on this server."
>>> > >
>>> > > -- Lefty
>>> > >
>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>> > >
>>> > > > Dear Lefty,
>>> > > >
>>> > > > Thank you very much for pointing that out and for your initial
>>> > pointers.
>>> > > > Here is the missing link:
>>> > > >
>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>> > > >
>>> > > > Regards,
>>> > > >
>>> > > > Andrés
>>> > > >
>>> > > > -----Original Message-----
>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>> > > > To: dev@hive.apache.org
>>> > > > Subject: Re: Request for feedback on work intent for 
>>> > > > non-equijoin
>>> > support
>>> > > >
>>> > > > Hello Andres, the link to your paper is missing:
>>> > > >
>>> > > > In our preliminary work, which you can find here (pointer to 
>>> > > > the
>>> paper)
>>> > > ...
>>> > > >
>>> > > >
>>> > > > You can find general information about contributing to Hive in 
>>> > > > the
>>> > > > wiki:  Resources
>>> > > > for Contributors
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resources
>>> f
>>> orContributors
>>> > > > >
>>> > > > , How to Contribute
>>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>> > > >
>>> > > > -- Lefty
>>> > > >
>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>>> > > >
>>> > > > >  Dear Hive development community members,
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am interested in learning more about the current support 
>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, 
>>> > > > > and in
>>> getting
>>> > > > > feedback about community interest in more extensive support 
>>> > > > > for
>>> such
>>> > a
>>> > > > > feature. I intend to work on this challenge, assuming people 
>>> > > > > find
>>> it
>>> > > > > compelling, and I intend to contribute results to the community.
>>> > Where
>>> > > > > possible, it would be great to receive feedback and engage 
>>> > > > > in collaborations along the way (for a bit more context, see 
>>> > > > > the postscript of this message).
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > My initial goal is to support query conditions such as the
>>> following:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > A.x < B.y
>>> > > > >
>>> > > > > A.x in_range [B.y, B.z]
>>> > > > >
>>> > > > > distance(A.x, B.y) < D
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > where A and B are distinct tables/files. It is my 
>>> > > > > understanding
>>> that
>>> > > > > current support for performing non-equijoins like those 
>>> > > > > above is
>>> > quite
>>> > > > > limited, and where some forms are supported (like in 
>>> > > > > Cloudera's Impala), this support is based on doing a 
>>> > > > > potentially expensive
>>> cross
>>> > > > product join.
>>> > > > > Depending on the data types involved, I believe that joins 
>>> > > > > with
>>> these
>>> > > > > conditions can be made to be tractable (at least on the
>>> > > > > average)
>>> with
>>> > > > > join algorithms that exploit properties of the data types, 
>>> > > > > possibly with some pre-scanning of the data.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am asking for feedback on the interest & need in the 
>>> > > > > community
>>> for
>>> > > > > this work, as well as any pointers to similar work. In 
>>> > > > > particular,
>>> I
>>> > > > > would appreciate any answers people could give on the 
>>> > > > > following
>>> > > > questions:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > - Is my understanding of the state of the art in Hive and 
>>> > > > > similar tools accurate? Are there groups currently working 
>>> > > > > on similar or related issues, or tools that already 
>>> > > > > accomplish some or all of
>>> what
>>> > I
>>> > > > have proposed?
>>> > > > >
>>> > > > > - Is there significant value to the community in the support 
>>> > > > > of
>>> such
>>> > a
>>> > > > > feature? In other words, are the manual workarounds 
>>> > > > > necessary
>>> because
>>> > > > > of the absence of non-equijoins such as these enough of a 
>>> > > > > pain to justify the work I propose?
>>> > > > >
>>> > > > > - Being aware that the potential pre-scanning adds to the 
>>> > > > > cost of
>>> the
>>> > > > > join, and that data could still blow-up in the worst case, 
>>> > > > > am I missing any other important considerations and 
>>> > > > > tradeoffs for this
>>> > > > problem?
>>> > > > >
>>> > > > > - What would be a good avenue to contribute this feature to 
>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, 
>>> > > > > or as a Hive extension or plugin)?
>>> > > > >
>>> > > > > - What is the best way to get started in working with the
>>> community?
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Thanks for your attention and any info you can provide!
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Andres Quiroz
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > P.S. If you are interested in some context, and why/how I am
>>> > proposing
>>> > > > > to do this work, please read on.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am part of a small project team at PARC working on the 
>>> > > > > general problems of data integration and automated ETL. We 
>>> > > > > have proposed a tool called HiperFuse that is designed to 
>>> > > > > accept declarative, high-level queries in order to produce 
>>> > > > > joined (fused) data sets
>>> from
>>> > > > > multiple heterogeneous raw data sources. In our preliminary 
>>> > > > > work, which you can find here (pointer to the paper), we 
>>> > > > > designed the architecture of the tool and obtained some 
>>> > > > > results separately on
>>> the
>>> > > > > problems of automated data cleansing, data type inference, 
>>> > > > > and
>>> query
>>> > > > > planning. One of the planned prototype implementations of 
>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative 
>>> > > > > language we
>>> proposed
>>> > > > > was closely related to SQL, we thought that we could exploit 
>>> > > > > the existing work in Hive and/or other open-source tools for 
>>> > > > > handling
>>> the
>>> > > > > SQL part and layer our work on top of that. For example, the 
>>> > > > > query given in the paper could easily be expressed in 
>>> > > > > SQL-like form with
>>> a
>>> > > > > non-equijoin
>>> > > > > condition:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > SELECT web_access_log.ip, census.income
>>> > > > >
>>> > > > > FROM web_access_log, ip2zip, census
>>> > > > >
>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, 
>>> > > > > ip2zip.ip_high]
>>> > > > >
>>> > > > > AND ip2zip.zip = census.zip
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > As you can see, the first impasse that we hit in order to 
>>> > > > > bring the elements together to solve this query end-to-end 
>>> > > > > was the
>>> realization
>>> > > > > and performance of the non-equality join in the query. The 
>>> > > > > intent
>>> now
>>> > > > > is to tackle this problem in a general sense and provide a 
>>> > > > > solution for a wide range of queries.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > The work I propose to do would be based on three main 
>>> > > > > components within
>>> > > > > HiperFuse:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > - Enhancements to the extensible data type framework in 
>>> > > > > HiperFuse
>>> > that
>>> > > > > would categorize data types based on the properties needed 
>>> > > > > to
>>> support
>>> > > > > the join algorithms, in order to write join-ready 
>>> > > > > domain-specific
>>> > data
>>> > > > > type libraries.
>>> > > > >
>>> > > > > - The join algorithms themselves, based on Hive or directly 
>>> > > > > on
>>> Hadoop
>>> > > MR.
>>> > > > >
>>> > > > > - A query planner, which would determine the right algorithm 
>>> > > > > to
>>> apply
>>> > > > > and automatically schedule any necessary pre-scanning of the data.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Best,
>>> > Chao
>>> >
>>>

Re: Request for feedback on work intent for non-equijoin support

Posted by Thejas Nair <th...@gmail.com>.
Yes, the theta join paper in northeastern is a good place to start.
There is also a presentation from the folks in youtube, which is also
very useful.
I had a look at this issue as well earlier, and I had written up a
rough proposal.  I had not organized the document well enough for
sharing publicly, but in case you find it useful, I have attached it
to wiki - https://cwiki.apache.org/confluence/download/attachments/27362075/theta%20join%20proposal%20-%20thejas.pdf?version=1&modificationDate=1428517702954&api=v2
It also includes a list of some of the changes that are needed (it is
probably not comprehensive enough).


On Wed, Apr 8, 2015 at 5:49 AM,  <An...@parc.com> wrote:
> So, I'd like to get started on this. The description in the design doc and the theta join paper from Northeastern seem like a good place to start, to have a baseline that I can later use for the more specific join algorithms I want to try.
>
> I created a JIRA account, and my username is Andres.Quiroz
>
> Brock, since I'm completely new to this code, could you (or anyone else) please point me to the relevant modules to start learning and ramping up? Also, please let me know if I can contact you directly for discussing this specific topic, or if I should always send a message to the mailing list.
>
> Thank you,
>
> Andrés
>
> -----Original Message-----
> From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com]
> Sent: Thursday, April 02, 2015 9:07 AM
> To: dev@hive.apache.org
> Subject: RE: Request for feedback on work intent for non-equijoin support
>
> This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.
>
> Andrés
>
> -----Original Message-----
> From: Brock Noland [mailto:brock@apache.org]
> Sent: Thursday, April 02, 2015 1:31 AM
> To: dev@hive.apache.org
> Subject: Re: Request for feedback on work intent for non-equijoin support
>
> Nice, it'd be great if someone finally implemented this :)
>
> On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
>> From Hive side, there has been some thought on the subject here:
>> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has
>> some ideas but nobody has gotten around to giving it a try.  It might
>> be of interest.
>>
>> Thanks
>> Szehon
>>
>>
>> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz
>> <le...@gmail.com>
>> wrote:
>>
>>> D'oh!  Thanks Chao.
>>>
>>> -- Lefty
>>>
>>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>>
>>> > Hey Lefty,
>>> >
>>> > You need to use the ftp protocol, not http.
>>> > After clicking the link, you'll need to remove "http://" from the
>>> address
>>> > bar.
>>> >
>>> > Best,
>>> > Chao
>>> >
>>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz
>>> > <le...@gmail.com>
>>> > wrote:
>>> >
>>> > > Andrés, I followed that link and got the dread 404 Not Found:
>>> > >
>>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>> > > was not found on this server."
>>> > >
>>> > > -- Lefty
>>> > >
>>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>>> > >
>>> > > > Dear Lefty,
>>> > > >
>>> > > > Thank you very much for pointing that out and for your initial
>>> > pointers.
>>> > > > Here is the missing link:
>>> > > >
>>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>>> > > >
>>> > > > Regards,
>>> > > >
>>> > > > Andrés
>>> > > >
>>> > > > -----Original Message-----
>>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>>> > > > To: dev@hive.apache.org
>>> > > > Subject: Re: Request for feedback on work intent for
>>> > > > non-equijoin
>>> > support
>>> > > >
>>> > > > Hello Andres, the link to your paper is missing:
>>> > > >
>>> > > > In our preliminary work, which you can find here (pointer to
>>> > > > the
>>> paper)
>>> > > ...
>>> > > >
>>> > > >
>>> > > > You can find general information about contributing to Hive in
>>> > > > the
>>> > > > wiki:  Resources
>>> > > > for Contributors
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>>> orContributors
>>> > > > >
>>> > > > , How to Contribute
>>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>>> > > >
>>> > > > -- Lefty
>>> > > >
>>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>>> > > >
>>> > > > >  Dear Hive development community members,
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am interested in learning more about the current support
>>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines,
>>> > > > > and in
>>> getting
>>> > > > > feedback about community interest in more extensive support
>>> > > > > for
>>> such
>>> > a
>>> > > > > feature. I intend to work on this challenge, assuming people
>>> > > > > find
>>> it
>>> > > > > compelling, and I intend to contribute results to the community.
>>> > Where
>>> > > > > possible, it would be great to receive feedback and engage in
>>> > > > > collaborations along the way (for a bit more context, see the
>>> > > > > postscript of this message).
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > My initial goal is to support query conditions such as the
>>> following:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > A.x < B.y
>>> > > > >
>>> > > > > A.x in_range [B.y, B.z]
>>> > > > >
>>> > > > > distance(A.x, B.y) < D
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > where A and B are distinct tables/files. It is my
>>> > > > > understanding
>>> that
>>> > > > > current support for performing non-equijoins like those above
>>> > > > > is
>>> > quite
>>> > > > > limited, and where some forms are supported (like in
>>> > > > > Cloudera's Impala), this support is based on doing a
>>> > > > > potentially expensive
>>> cross
>>> > > > product join.
>>> > > > > Depending on the data types involved, I believe that joins
>>> > > > > with
>>> these
>>> > > > > conditions can be made to be tractable (at least on the
>>> > > > > average)
>>> with
>>> > > > > join algorithms that exploit properties of the data types,
>>> > > > > possibly with some pre-scanning of the data.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am asking for feedback on the interest & need in the
>>> > > > > community
>>> for
>>> > > > > this work, as well as any pointers to similar work. In
>>> > > > > particular,
>>> I
>>> > > > > would appreciate any answers people could give on the
>>> > > > > following
>>> > > > questions:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > - Is my understanding of the state of the art in Hive and
>>> > > > > similar tools accurate? Are there groups currently working on
>>> > > > > similar or related issues, or tools that already accomplish
>>> > > > > some or all of
>>> what
>>> > I
>>> > > > have proposed?
>>> > > > >
>>> > > > > - Is there significant value to the community in the support
>>> > > > > of
>>> such
>>> > a
>>> > > > > feature? In other words, are the manual workarounds necessary
>>> because
>>> > > > > of the absence of non-equijoins such as these enough of a
>>> > > > > pain to justify the work I propose?
>>> > > > >
>>> > > > > - Being aware that the potential pre-scanning adds to the
>>> > > > > cost of
>>> the
>>> > > > > join, and that data could still blow-up in the worst case, am
>>> > > > > I missing any other important considerations and tradeoffs
>>> > > > > for this
>>> > > > problem?
>>> > > > >
>>> > > > > - What would be a good avenue to contribute this feature to
>>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or
>>> > > > > as a Hive extension or plugin)?
>>> > > > >
>>> > > > > - What is the best way to get started in working with the
>>> community?
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Thanks for your attention and any info you can provide!
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Andres Quiroz
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > P.S. If you are interested in some context, and why/how I am
>>> > proposing
>>> > > > > to do this work, please read on.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I am part of a small project team at PARC working on the
>>> > > > > general problems of data integration and automated ETL. We
>>> > > > > have proposed a tool called HiperFuse that is designed to
>>> > > > > accept declarative, high-level queries in order to produce
>>> > > > > joined (fused) data sets
>>> from
>>> > > > > multiple heterogeneous raw data sources. In our preliminary
>>> > > > > work, which you can find here (pointer to the paper), we
>>> > > > > designed the architecture of the tool and obtained some
>>> > > > > results separately on
>>> the
>>> > > > > problems of automated data cleansing, data type inference,
>>> > > > > and
>>> query
>>> > > > > planning. One of the planned prototype implementations of
>>> > > > > HiperFuse relies on Hadoop MR, and because the declarative
>>> > > > > language we
>>> proposed
>>> > > > > was closely related to SQL, we thought that we could exploit
>>> > > > > the existing work in Hive and/or other open-source tools for
>>> > > > > handling
>>> the
>>> > > > > SQL part and layer our work on top of that. For example, the
>>> > > > > query given in the paper could easily be expressed in
>>> > > > > SQL-like form with
>>> a
>>> > > > > non-equijoin
>>> > > > > condition:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > SELECT web_access_log.ip, census.income
>>> > > > >
>>> > > > > FROM web_access_log, ip2zip, census
>>> > > > >
>>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low,
>>> > > > > ip2zip.ip_high]
>>> > > > >
>>> > > > > AND ip2zip.zip = census.zip
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > As you can see, the first impasse that we hit in order to
>>> > > > > bring the elements together to solve this query end-to-end
>>> > > > > was the
>>> realization
>>> > > > > and performance of the non-equality join in the query. The
>>> > > > > intent
>>> now
>>> > > > > is to tackle this problem in a general sense and provide a
>>> > > > > solution for a wide range of queries.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > The work I propose to do would be based on three main
>>> > > > > components within
>>> > > > > HiperFuse:
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > - Enhancements to the extensible data type framework in
>>> > > > > HiperFuse
>>> > that
>>> > > > > would categorize data types based on the properties needed to
>>> support
>>> > > > > the join algorithms, in order to write join-ready
>>> > > > > domain-specific
>>> > data
>>> > > > > type libraries.
>>> > > > >
>>> > > > > - The join algorithms themselves, based on Hive or directly
>>> > > > > on
>>> Hadoop
>>> > > MR.
>>> > > > >
>>> > > > > - A query planner, which would determine the right algorithm
>>> > > > > to
>>> apply
>>> > > > > and automatically schedule any necessary pre-scanning of the data.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Best,
>>> > Chao
>>> >
>>>

RE: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
So, I'd like to get started on this. The description in the design doc and the theta join paper from Northeastern seem like a good place to start, to have a baseline that I can later use for the more specific join algorithms I want to try. 

I created a JIRA account, and my username is Andres.Quiroz

Brock, since I'm completely new to this code, could you (or anyone else) please point me to the relevant modules to start learning and ramping up? Also, please let me know if I can contact you directly for discussing this specific topic, or if I should always send a message to the mailing list.

Thank you,

Andrés

-----Original Message-----
From: Andres.Quiroz@parc.com [mailto:Andres.Quiroz@parc.com] 
Sent: Thursday, April 02, 2015 9:07 AM
To: dev@hive.apache.org
Subject: RE: Request for feedback on work intent for non-equijoin support

This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.

Andrés

-----Original Message-----
From: Brock Noland [mailto:brock@apache.org]
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
> From Hive side, there has been some thought on the subject here:
> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
> some ideas but nobody has gotten around to giving it a try.  It might 
> be of interest.
>
> Thanks
> Szehon
>
>
> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
> <le...@gmail.com>
> wrote:
>
>> D'oh!  Thanks Chao.
>>
>> -- Lefty
>>
>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>
>> > Hey Lefty,
>> >
>> > You need to use the ftp protocol, not http.
>> > After clicking the link, you'll need to remove "http://" from the
>> address
>> > bar.
>> >
>> > Best,
>> > Chao
>> >
>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
>> > <le...@gmail.com>
>> > wrote:
>> >
>> > > Andrés, I followed that link and got the dread 404 Not Found:
>> > >
>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > was not found on this server."
>> > >
>> > > -- Lefty
>> > >
>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>> > >
>> > > > Dear Lefty,
>> > > >
>> > > > Thank you very much for pointing that out and for your initial
>> > pointers.
>> > > > Here is the missing link:
>> > > >
>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > >
>> > > > Regards,
>> > > >
>> > > > Andrés
>> > > >
>> > > > -----Original Message-----
>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>> > > > To: dev@hive.apache.org
>> > > > Subject: Re: Request for feedback on work intent for 
>> > > > non-equijoin
>> > support
>> > > >
>> > > > Hello Andres, the link to your paper is missing:
>> > > >
>> > > > In our preliminary work, which you can find here (pointer to 
>> > > > the
>> paper)
>> > > ...
>> > > >
>> > > >
>> > > > You can find general information about contributing to Hive in 
>> > > > the
>> > > > wiki:  Resources
>> > > > for Contributors
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>> orContributors
>> > > > >
>> > > > , How to Contribute
>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>> > > >
>> > > > -- Lefty
>> > > >
>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>> > > >
>> > > > >  Dear Hive development community members,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am interested in learning more about the current support 
>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, 
>> > > > > and in
>> getting
>> > > > > feedback about community interest in more extensive support 
>> > > > > for
>> such
>> > a
>> > > > > feature. I intend to work on this challenge, assuming people 
>> > > > > find
>> it
>> > > > > compelling, and I intend to contribute results to the community.
>> > Where
>> > > > > possible, it would be great to receive feedback and engage in 
>> > > > > collaborations along the way (for a bit more context, see the 
>> > > > > postscript of this message).
>> > > > >
>> > > > >
>> > > > >
>> > > > > My initial goal is to support query conditions such as the
>> following:
>> > > > >
>> > > > >
>> > > > >
>> > > > > A.x < B.y
>> > > > >
>> > > > > A.x in_range [B.y, B.z]
>> > > > >
>> > > > > distance(A.x, B.y) < D
>> > > > >
>> > > > >
>> > > > >
>> > > > > where A and B are distinct tables/files. It is my 
>> > > > > understanding
>> that
>> > > > > current support for performing non-equijoins like those above 
>> > > > > is
>> > quite
>> > > > > limited, and where some forms are supported (like in 
>> > > > > Cloudera's Impala), this support is based on doing a 
>> > > > > potentially expensive
>> cross
>> > > > product join.
>> > > > > Depending on the data types involved, I believe that joins 
>> > > > > with
>> these
>> > > > > conditions can be made to be tractable (at least on the
>> > > > > average)
>> with
>> > > > > join algorithms that exploit properties of the data types, 
>> > > > > possibly with some pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am asking for feedback on the interest & need in the 
>> > > > > community
>> for
>> > > > > this work, as well as any pointers to similar work. In 
>> > > > > particular,
>> I
>> > > > > would appreciate any answers people could give on the 
>> > > > > following
>> > > > questions:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Is my understanding of the state of the art in Hive and 
>> > > > > similar tools accurate? Are there groups currently working on 
>> > > > > similar or related issues, or tools that already accomplish 
>> > > > > some or all of
>> what
>> > I
>> > > > have proposed?
>> > > > >
>> > > > > - Is there significant value to the community in the support 
>> > > > > of
>> such
>> > a
>> > > > > feature? In other words, are the manual workarounds necessary
>> because
>> > > > > of the absence of non-equijoins such as these enough of a 
>> > > > > pain to justify the work I propose?
>> > > > >
>> > > > > - Being aware that the potential pre-scanning adds to the 
>> > > > > cost of
>> the
>> > > > > join, and that data could still blow-up in the worst case, am 
>> > > > > I missing any other important considerations and tradeoffs 
>> > > > > for this
>> > > > problem?
>> > > > >
>> > > > > - What would be a good avenue to contribute this feature to 
>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or 
>> > > > > as a Hive extension or plugin)?
>> > > > >
>> > > > > - What is the best way to get started in working with the
>> community?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thanks for your attention and any info you can provide!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Andres Quiroz
>> > > > >
>> > > > >
>> > > > >
>> > > > > P.S. If you are interested in some context, and why/how I am
>> > proposing
>> > > > > to do this work, please read on.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am part of a small project team at PARC working on the 
>> > > > > general problems of data integration and automated ETL. We 
>> > > > > have proposed a tool called HiperFuse that is designed to 
>> > > > > accept declarative, high-level queries in order to produce 
>> > > > > joined (fused) data sets
>> from
>> > > > > multiple heterogeneous raw data sources. In our preliminary 
>> > > > > work, which you can find here (pointer to the paper), we 
>> > > > > designed the architecture of the tool and obtained some 
>> > > > > results separately on
>> the
>> > > > > problems of automated data cleansing, data type inference, 
>> > > > > and
>> query
>> > > > > planning. One of the planned prototype implementations of 
>> > > > > HiperFuse relies on Hadoop MR, and because the declarative 
>> > > > > language we
>> proposed
>> > > > > was closely related to SQL, we thought that we could exploit 
>> > > > > the existing work in Hive and/or other open-source tools for 
>> > > > > handling
>> the
>> > > > > SQL part and layer our work on top of that. For example, the 
>> > > > > query given in the paper could easily be expressed in 
>> > > > > SQL-like form with
>> a
>> > > > > non-equijoin
>> > > > > condition:
>> > > > >
>> > > > >
>> > > > >
>> > > > > SELECT web_access_log.ip, census.income
>> > > > >
>> > > > > FROM web_access_log, ip2zip, census
>> > > > >
>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, 
>> > > > > ip2zip.ip_high]
>> > > > >
>> > > > > AND ip2zip.zip = census.zip
>> > > > >
>> > > > >
>> > > > >
>> > > > > As you can see, the first impasse that we hit in order to 
>> > > > > bring the elements together to solve this query end-to-end 
>> > > > > was the
>> realization
>> > > > > and performance of the non-equality join in the query. The 
>> > > > > intent
>> now
>> > > > > is to tackle this problem in a general sense and provide a 
>> > > > > solution for a wide range of queries.
>> > > > >
>> > > > >
>> > > > >
>> > > > > The work I propose to do would be based on three main 
>> > > > > components within
>> > > > > HiperFuse:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Enhancements to the extensible data type framework in 
>> > > > > HiperFuse
>> > that
>> > > > > would categorize data types based on the properties needed to
>> support
>> > > > > the join algorithms, in order to write join-ready 
>> > > > > domain-specific
>> > data
>> > > > > type libraries.
>> > > > >
>> > > > > - The join algorithms themselves, based on Hive or directly 
>> > > > > on
>> Hadoop
>> > > MR.
>> > > > >
>> > > > > - A query planner, which would determine the right algorithm 
>> > > > > to
>> apply
>> > > > > and automatically schedule any necessary pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best,
>> > Chao
>> >
>>

RE: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
This is a great pointer, Szehon and Brock, thank you. I will catch up with the material on theta joins and circle back.

Andrés

-----Original Message-----
From: Brock Noland [mailto:brock@apache.org] 
Sent: Thursday, April 02, 2015 1:31 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
> From Hive side, there has been some thought on the subject here:
> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has 
> some ideas but nobody has gotten around to giving it a try.  It might 
> be of interest.
>
> Thanks
> Szehon
>
>
> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz 
> <le...@gmail.com>
> wrote:
>
>> D'oh!  Thanks Chao.
>>
>> -- Lefty
>>
>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>
>> > Hey Lefty,
>> >
>> > You need to use the ftp protocol, not http.
>> > After clicking the link, you'll need to remove "http://" from the
>> address
>> > bar.
>> >
>> > Best,
>> > Chao
>> >
>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz 
>> > <le...@gmail.com>
>> > wrote:
>> >
>> > > Andrés, I followed that link and got the dread 404 Not Found:
>> > >
>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf 
>> > > was not found on this server."
>> > >
>> > > -- Lefty
>> > >
>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>> > >
>> > > > Dear Lefty,
>> > > >
>> > > > Thank you very much for pointing that out and for your initial
>> > pointers.
>> > > > Here is the missing link:
>> > > >
>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > >
>> > > > Regards,
>> > > >
>> > > > Andrés
>> > > >
>> > > > -----Original Message-----
>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>> > > > To: dev@hive.apache.org
>> > > > Subject: Re: Request for feedback on work intent for 
>> > > > non-equijoin
>> > support
>> > > >
>> > > > Hello Andres, the link to your paper is missing:
>> > > >
>> > > > In our preliminary work, which you can find here (pointer to 
>> > > > the
>> paper)
>> > > ...
>> > > >
>> > > >
>> > > > You can find general information about contributing to Hive in 
>> > > > the
>> > > > wiki:  Resources
>> > > > for Contributors
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-Resourcesf
>> orContributors
>> > > > >
>> > > > , How to Contribute
>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>> > > >
>> > > > -- Lefty
>> > > >
>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>> > > >
>> > > > >  Dear Hive development community members,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am interested in learning more about the current support 
>> > > > > for non-equijoins in Hive and/or other Hadoop SQL engines, 
>> > > > > and in
>> getting
>> > > > > feedback about community interest in more extensive support 
>> > > > > for
>> such
>> > a
>> > > > > feature. I intend to work on this challenge, assuming people 
>> > > > > find
>> it
>> > > > > compelling, and I intend to contribute results to the community.
>> > Where
>> > > > > possible, it would be great to receive feedback and engage in 
>> > > > > collaborations along the way (for a bit more context, see the 
>> > > > > postscript of this message).
>> > > > >
>> > > > >
>> > > > >
>> > > > > My initial goal is to support query conditions such as the
>> following:
>> > > > >
>> > > > >
>> > > > >
>> > > > > A.x < B.y
>> > > > >
>> > > > > A.x in_range [B.y, B.z]
>> > > > >
>> > > > > distance(A.x, B.y) < D
>> > > > >
>> > > > >
>> > > > >
>> > > > > where A and B are distinct tables/files. It is my 
>> > > > > understanding
>> that
>> > > > > current support for performing non-equijoins like those above 
>> > > > > is
>> > quite
>> > > > > limited, and where some forms are supported (like in 
>> > > > > Cloudera's Impala), this support is based on doing a 
>> > > > > potentially expensive
>> cross
>> > > > product join.
>> > > > > Depending on the data types involved, I believe that joins 
>> > > > > with
>> these
>> > > > > conditions can be made to be tractable (at least on the 
>> > > > > average)
>> with
>> > > > > join algorithms that exploit properties of the data types, 
>> > > > > possibly with some pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am asking for feedback on the interest & need in the 
>> > > > > community
>> for
>> > > > > this work, as well as any pointers to similar work. In 
>> > > > > particular,
>> I
>> > > > > would appreciate any answers people could give on the 
>> > > > > following
>> > > > questions:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Is my understanding of the state of the art in Hive and 
>> > > > > similar tools accurate? Are there groups currently working on 
>> > > > > similar or related issues, or tools that already accomplish 
>> > > > > some or all of
>> what
>> > I
>> > > > have proposed?
>> > > > >
>> > > > > - Is there significant value to the community in the support 
>> > > > > of
>> such
>> > a
>> > > > > feature? In other words, are the manual workarounds necessary
>> because
>> > > > > of the absence of non-equijoins such as these enough of a 
>> > > > > pain to justify the work I propose?
>> > > > >
>> > > > > - Being aware that the potential pre-scanning adds to the 
>> > > > > cost of
>> the
>> > > > > join, and that data could still blow-up in the worst case, am 
>> > > > > I missing any other important considerations and tradeoffs 
>> > > > > for this
>> > > > problem?
>> > > > >
>> > > > > - What would be a good avenue to contribute this feature to 
>> > > > > the community (e.g. as a standalone tool on top of Hadoop, or 
>> > > > > as a Hive extension or plugin)?
>> > > > >
>> > > > > - What is the best way to get started in working with the
>> community?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thanks for your attention and any info you can provide!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Andres Quiroz
>> > > > >
>> > > > >
>> > > > >
>> > > > > P.S. If you are interested in some context, and why/how I am
>> > proposing
>> > > > > to do this work, please read on.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am part of a small project team at PARC working on the 
>> > > > > general problems of data integration and automated ETL. We 
>> > > > > have proposed a tool called HiperFuse that is designed to 
>> > > > > accept declarative, high-level queries in order to produce 
>> > > > > joined (fused) data sets
>> from
>> > > > > multiple heterogeneous raw data sources. In our preliminary 
>> > > > > work, which you can find here (pointer to the paper), we 
>> > > > > designed the architecture of the tool and obtained some 
>> > > > > results separately on
>> the
>> > > > > problems of automated data cleansing, data type inference, 
>> > > > > and
>> query
>> > > > > planning. One of the planned prototype implementations of 
>> > > > > HiperFuse relies on Hadoop MR, and because the declarative 
>> > > > > language we
>> proposed
>> > > > > was closely related to SQL, we thought that we could exploit 
>> > > > > the existing work in Hive and/or other open-source tools for 
>> > > > > handling
>> the
>> > > > > SQL part and layer our work on top of that. For example, the 
>> > > > > query given in the paper could easily be expressed in 
>> > > > > SQL-like form with
>> a
>> > > > > non-equijoin
>> > > > > condition:
>> > > > >
>> > > > >
>> > > > >
>> > > > > SELECT web_access_log.ip, census.income
>> > > > >
>> > > > > FROM web_access_log, ip2zip, census
>> > > > >
>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, 
>> > > > > ip2zip.ip_high]
>> > > > >
>> > > > > AND ip2zip.zip = census.zip
>> > > > >
>> > > > >
>> > > > >
>> > > > > As you can see, the first impasse that we hit in order to 
>> > > > > bring the elements together to solve this query end-to-end 
>> > > > > was the
>> realization
>> > > > > and performance of the non-equality join in the query. The 
>> > > > > intent
>> now
>> > > > > is to tackle this problem in a general sense and provide a 
>> > > > > solution for a wide range of queries.
>> > > > >
>> > > > >
>> > > > >
>> > > > > The work I propose to do would be based on three main 
>> > > > > components within
>> > > > > HiperFuse:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Enhancements to the extensible data type framework in 
>> > > > > HiperFuse
>> > that
>> > > > > would categorize data types based on the properties needed to
>> support
>> > > > > the join algorithms, in order to write join-ready 
>> > > > > domain-specific
>> > data
>> > > > > type libraries.
>> > > > >
>> > > > > - The join algorithms themselves, based on Hive or directly 
>> > > > > on
>> Hadoop
>> > > MR.
>> > > > >
>> > > > > - A query planner, which would determine the right algorithm 
>> > > > > to
>> apply
>> > > > > and automatically schedule any necessary pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best,
>> > Chao
>> >
>>

Re: Request for feedback on work intent for non-equijoin support

Posted by Brock Noland <br...@apache.org>.
Nice, it'd be great if someone finally implemented this :)

On Wed, Apr 1, 2015 at 10:10 PM, Szehon Ho <sz...@cloudera.com> wrote:
> From Hive side, there has been some thought on the subject here:
> https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has some
> ideas but nobody has gotten around to giving it a try.  It might be of
> interest.
>
> Thanks
> Szehon
>
>
> On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz <le...@gmail.com>
> wrote:
>
>> D'oh!  Thanks Chao.
>>
>> -- Lefty
>>
>> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>>
>> > Hey Lefty,
>> >
>> > You need to use the ftp protocol, not http.
>> > After clicking the link, you'll need to remove "http://" from the
>> address
>> > bar.
>> >
>> > Best,
>> > Chao
>> >
>> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz <le...@gmail.com>
>> > wrote:
>> >
>> > > Andrés, I followed that link and got the dread 404 Not Found:
>> > >
>> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
>> > > found on this server."
>> > >
>> > > -- Lefty
>> > >
>> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>> > >
>> > > > Dear Lefty,
>> > > >
>> > > > Thank you very much for pointing that out and for your initial
>> > pointers.
>> > > > Here is the missing link:
>> > > >
>> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>> > > >
>> > > > Regards,
>> > > >
>> > > > Andrés
>> > > >
>> > > > -----Original Message-----
>> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
>> > > > Sent: Wednesday, April 01, 2015 12:48 AM
>> > > > To: dev@hive.apache.org
>> > > > Subject: Re: Request for feedback on work intent for non-equijoin
>> > support
>> > > >
>> > > > Hello Andres, the link to your paper is missing:
>> > > >
>> > > > In our preliminary work, which you can find here (pointer to the
>> paper)
>> > > ...
>> > > >
>> > > >
>> > > > You can find general information about contributing to Hive in the
>> > > > wiki:  Resources
>> > > > for Contributors
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
>> > > > >
>> > > > , How to Contribute
>> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>> > > >
>> > > > -- Lefty
>> > > >
>> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>> > > >
>> > > > >  Dear Hive development community members,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am interested in learning more about the current support for
>> > > > > non-equijoins in Hive and/or other Hadoop SQL engines, and in
>> getting
>> > > > > feedback about community interest in more extensive support for
>> such
>> > a
>> > > > > feature. I intend to work on this challenge, assuming people find
>> it
>> > > > > compelling, and I intend to contribute results to the community.
>> > Where
>> > > > > possible, it would be great to receive feedback and engage in
>> > > > > collaborations along the way (for a bit more context, see the
>> > > > > postscript of this message).
>> > > > >
>> > > > >
>> > > > >
>> > > > > My initial goal is to support query conditions such as the
>> following:
>> > > > >
>> > > > >
>> > > > >
>> > > > > A.x < B.y
>> > > > >
>> > > > > A.x in_range [B.y, B.z]
>> > > > >
>> > > > > distance(A.x, B.y) < D
>> > > > >
>> > > > >
>> > > > >
>> > > > > where A and B are distinct tables/files. It is my understanding
>> that
>> > > > > current support for performing non-equijoins like those above is
>> > quite
>> > > > > limited, and where some forms are supported (like in Cloudera's
>> > > > > Impala), this support is based on doing a potentially expensive
>> cross
>> > > > product join.
>> > > > > Depending on the data types involved, I believe that joins with
>> these
>> > > > > conditions can be made to be tractable (at least on the average)
>> with
>> > > > > join algorithms that exploit properties of the data types, possibly
>> > > > > with some pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am asking for feedback on the interest & need in the community
>> for
>> > > > > this work, as well as any pointers to similar work. In particular,
>> I
>> > > > > would appreciate any answers people could give on the following
>> > > > questions:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Is my understanding of the state of the art in Hive and similar
>> > > > > tools accurate? Are there groups currently working on similar or
>> > > > > related issues, or tools that already accomplish some or all of
>> what
>> > I
>> > > > have proposed?
>> > > > >
>> > > > > - Is there significant value to the community in the support of
>> such
>> > a
>> > > > > feature? In other words, are the manual workarounds necessary
>> because
>> > > > > of the absence of non-equijoins such as these enough of a pain to
>> > > > > justify the work I propose?
>> > > > >
>> > > > > - Being aware that the potential pre-scanning adds to the cost of
>> the
>> > > > > join, and that data could still blow-up in the worst case, am I
>> > > > > missing any other important considerations and tradeoffs for this
>> > > > problem?
>> > > > >
>> > > > > - What would be a good avenue to contribute this feature to the
>> > > > > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
>> > > > > extension or plugin)?
>> > > > >
>> > > > > - What is the best way to get started in working with the
>> community?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thanks for your attention and any info you can provide!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Andres Quiroz
>> > > > >
>> > > > >
>> > > > >
>> > > > > P.S. If you are interested in some context, and why/how I am
>> > proposing
>> > > > > to do this work, please read on.
>> > > > >
>> > > > >
>> > > > >
>> > > > > I am part of a small project team at PARC working on the general
>> > > > > problems of data integration and automated ETL. We have proposed a
>> > > > > tool called HiperFuse that is designed to accept declarative,
>> > > > > high-level queries in order to produce joined (fused) data sets
>> from
>> > > > > multiple heterogeneous raw data sources. In our preliminary work,
>> > > > > which you can find here (pointer to the paper), we designed the
>> > > > > architecture of the tool and obtained some results separately on
>> the
>> > > > > problems of automated data cleansing, data type inference, and
>> query
>> > > > > planning. One of the planned prototype implementations of HiperFuse
>> > > > > relies on Hadoop MR, and because the declarative language we
>> proposed
>> > > > > was closely related to SQL, we thought that we could exploit the
>> > > > > existing work in Hive and/or other open-source tools for handling
>> the
>> > > > > SQL part and layer our work on top of that. For example, the query
>> > > > > given in the paper could easily be expressed in SQL-like form with
>> a
>> > > > > non-equijoin
>> > > > > condition:
>> > > > >
>> > > > >
>> > > > >
>> > > > > SELECT web_access_log.ip, census.income
>> > > > >
>> > > > > FROM web_access_log, ip2zip, census
>> > > > >
>> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
>> > > > >
>> > > > > AND ip2zip.zip = census.zip
>> > > > >
>> > > > >
>> > > > >
>> > > > > As you can see, the first impasse that we hit in order to bring the
>> > > > > elements together to solve this query end-to-end was the
>> realization
>> > > > > and performance of the non-equality join in the query. The intent
>> now
>> > > > > is to tackle this problem in a general sense and provide a solution
>> > > > > for a wide range of queries.
>> > > > >
>> > > > >
>> > > > >
>> > > > > The work I propose to do would be based on three main components
>> > > > > within
>> > > > > HiperFuse:
>> > > > >
>> > > > >
>> > > > >
>> > > > > - Enhancements to the extensible data type framework in HiperFuse
>> > that
>> > > > > would categorize data types based on the properties needed to
>> support
>> > > > > the join algorithms, in order to write join-ready domain-specific
>> > data
>> > > > > type libraries.
>> > > > >
>> > > > > - The join algorithms themselves, based on Hive or directly on
>> Hadoop
>> > > MR.
>> > > > >
>> > > > > - A query planner, which would determine the right algorithm to
>> apply
>> > > > > and automatically schedule any necessary pre-scanning of the data.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best,
>> > Chao
>> >
>>

Re: Request for feedback on work intent for non-equijoin support

Posted by Szehon Ho <sz...@cloudera.com>.
>From Hive side, there has been some thought on the subject here:
https://cwiki.apache.org/confluence/display/Hive/Theta+Join, it has some
ideas but nobody has gotten around to giving it a try.  It might be of
interest.

Thanks
Szehon


On Wed, Apr 1, 2015 at 10:05 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> D'oh!  Thanks Chao.
>
> -- Lefty
>
> On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:
>
> > Hey Lefty,
> >
> > You need to use the ftp protocol, not http.
> > After clicking the link, you'll need to remove "http://" from the
> address
> > bar.
> >
> > Best,
> > Chao
> >
> > On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz <le...@gmail.com>
> > wrote:
> >
> > > Andrés, I followed that link and got the dread 404 Not Found:
> > >
> > > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
> > > found on this server."
> > >
> > > -- Lefty
> > >
> > > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
> > >
> > > > Dear Lefty,
> > > >
> > > > Thank you very much for pointing that out and for your initial
> > pointers.
> > > > Here is the missing link:
> > > >
> > > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
> > > >
> > > > Regards,
> > > >
> > > > Andrés
> > > >
> > > > -----Original Message-----
> > > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
> > > > Sent: Wednesday, April 01, 2015 12:48 AM
> > > > To: dev@hive.apache.org
> > > > Subject: Re: Request for feedback on work intent for non-equijoin
> > support
> > > >
> > > > Hello Andres, the link to your paper is missing:
> > > >
> > > > In our preliminary work, which you can find here (pointer to the
> paper)
> > > ...
> > > >
> > > >
> > > > You can find general information about contributing to Hive in the
> > > > wiki:  Resources
> > > > for Contributors
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> > > > >
> > > > , How to Contribute
> > > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
> > > >
> > > > -- Lefty
> > > >
> > > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
> > > >
> > > > >  Dear Hive development community members,
> > > > >
> > > > >
> > > > >
> > > > > I am interested in learning more about the current support for
> > > > > non-equijoins in Hive and/or other Hadoop SQL engines, and in
> getting
> > > > > feedback about community interest in more extensive support for
> such
> > a
> > > > > feature. I intend to work on this challenge, assuming people find
> it
> > > > > compelling, and I intend to contribute results to the community.
> > Where
> > > > > possible, it would be great to receive feedback and engage in
> > > > > collaborations along the way (for a bit more context, see the
> > > > > postscript of this message).
> > > > >
> > > > >
> > > > >
> > > > > My initial goal is to support query conditions such as the
> following:
> > > > >
> > > > >
> > > > >
> > > > > A.x < B.y
> > > > >
> > > > > A.x in_range [B.y, B.z]
> > > > >
> > > > > distance(A.x, B.y) < D
> > > > >
> > > > >
> > > > >
> > > > > where A and B are distinct tables/files. It is my understanding
> that
> > > > > current support for performing non-equijoins like those above is
> > quite
> > > > > limited, and where some forms are supported (like in Cloudera's
> > > > > Impala), this support is based on doing a potentially expensive
> cross
> > > > product join.
> > > > > Depending on the data types involved, I believe that joins with
> these
> > > > > conditions can be made to be tractable (at least on the average)
> with
> > > > > join algorithms that exploit properties of the data types, possibly
> > > > > with some pre-scanning of the data.
> > > > >
> > > > >
> > > > >
> > > > > I am asking for feedback on the interest & need in the community
> for
> > > > > this work, as well as any pointers to similar work. In particular,
> I
> > > > > would appreciate any answers people could give on the following
> > > > questions:
> > > > >
> > > > >
> > > > >
> > > > > - Is my understanding of the state of the art in Hive and similar
> > > > > tools accurate? Are there groups currently working on similar or
> > > > > related issues, or tools that already accomplish some or all of
> what
> > I
> > > > have proposed?
> > > > >
> > > > > - Is there significant value to the community in the support of
> such
> > a
> > > > > feature? In other words, are the manual workarounds necessary
> because
> > > > > of the absence of non-equijoins such as these enough of a pain to
> > > > > justify the work I propose?
> > > > >
> > > > > - Being aware that the potential pre-scanning adds to the cost of
> the
> > > > > join, and that data could still blow-up in the worst case, am I
> > > > > missing any other important considerations and tradeoffs for this
> > > > problem?
> > > > >
> > > > > - What would be a good avenue to contribute this feature to the
> > > > > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > > > > extension or plugin)?
> > > > >
> > > > > - What is the best way to get started in working with the
> community?
> > > > >
> > > > >
> > > > >
> > > > > Thanks for your attention and any info you can provide!
> > > > >
> > > > >
> > > > >
> > > > > Andres Quiroz
> > > > >
> > > > >
> > > > >
> > > > > P.S. If you are interested in some context, and why/how I am
> > proposing
> > > > > to do this work, please read on.
> > > > >
> > > > >
> > > > >
> > > > > I am part of a small project team at PARC working on the general
> > > > > problems of data integration and automated ETL. We have proposed a
> > > > > tool called HiperFuse that is designed to accept declarative,
> > > > > high-level queries in order to produce joined (fused) data sets
> from
> > > > > multiple heterogeneous raw data sources. In our preliminary work,
> > > > > which you can find here (pointer to the paper), we designed the
> > > > > architecture of the tool and obtained some results separately on
> the
> > > > > problems of automated data cleansing, data type inference, and
> query
> > > > > planning. One of the planned prototype implementations of HiperFuse
> > > > > relies on Hadoop MR, and because the declarative language we
> proposed
> > > > > was closely related to SQL, we thought that we could exploit the
> > > > > existing work in Hive and/or other open-source tools for handling
> the
> > > > > SQL part and layer our work on top of that. For example, the query
> > > > > given in the paper could easily be expressed in SQL-like form with
> a
> > > > > non-equijoin
> > > > > condition:
> > > > >
> > > > >
> > > > >
> > > > > SELECT web_access_log.ip, census.income
> > > > >
> > > > > FROM web_access_log, ip2zip, census
> > > > >
> > > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> > > > >
> > > > > AND ip2zip.zip = census.zip
> > > > >
> > > > >
> > > > >
> > > > > As you can see, the first impasse that we hit in order to bring the
> > > > > elements together to solve this query end-to-end was the
> realization
> > > > > and performance of the non-equality join in the query. The intent
> now
> > > > > is to tackle this problem in a general sense and provide a solution
> > > > > for a wide range of queries.
> > > > >
> > > > >
> > > > >
> > > > > The work I propose to do would be based on three main components
> > > > > within
> > > > > HiperFuse:
> > > > >
> > > > >
> > > > >
> > > > > - Enhancements to the extensible data type framework in HiperFuse
> > that
> > > > > would categorize data types based on the properties needed to
> support
> > > > > the join algorithms, in order to write join-ready domain-specific
> > data
> > > > > type libraries.
> > > > >
> > > > > - The join algorithms themselves, based on Hive or directly on
> Hadoop
> > > MR.
> > > > >
> > > > > - A query planner, which would determine the right algorithm to
> apply
> > > > > and automatically schedule any necessary pre-scanning of the data.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best,
> > Chao
> >
>

Re: Request for feedback on work intent for non-equijoin support

Posted by Lefty Leverenz <le...@gmail.com>.
D'oh!  Thanks Chao.

-- Lefty

On Thu, Apr 2, 2015 at 12:59 AM, Chao Sun <ch...@cloudera.com> wrote:

> Hey Lefty,
>
> You need to use the ftp protocol, not http.
> After clicking the link, you'll need to remove "http://" from the address
> bar.
>
> Best,
> Chao
>
> On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz <le...@gmail.com>
> wrote:
>
> > Andrés, I followed that link and got the dread 404 Not Found:
> >
> > "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
> > found on this server."
> >
> > -- Lefty
> >
> > On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
> >
> > > Dear Lefty,
> > >
> > > Thank you very much for pointing that out and for your initial
> pointers.
> > > Here is the missing link:
> > >
> > > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
> > >
> > > Regards,
> > >
> > > Andrés
> > >
> > > -----Original Message-----
> > > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
> > > Sent: Wednesday, April 01, 2015 12:48 AM
> > > To: dev@hive.apache.org
> > > Subject: Re: Request for feedback on work intent for non-equijoin
> support
> > >
> > > Hello Andres, the link to your paper is missing:
> > >
> > > In our preliminary work, which you can find here (pointer to the paper)
> > ...
> > >
> > >
> > > You can find general information about contributing to Hive in the
> > > wiki:  Resources
> > > for Contributors
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> > > >
> > > , How to Contribute
> > > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
> > >
> > > -- Lefty
> > >
> > > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
> > >
> > > >  Dear Hive development community members,
> > > >
> > > >
> > > >
> > > > I am interested in learning more about the current support for
> > > > non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> > > > feedback about community interest in more extensive support for such
> a
> > > > feature. I intend to work on this challenge, assuming people find it
> > > > compelling, and I intend to contribute results to the community.
> Where
> > > > possible, it would be great to receive feedback and engage in
> > > > collaborations along the way (for a bit more context, see the
> > > > postscript of this message).
> > > >
> > > >
> > > >
> > > > My initial goal is to support query conditions such as the following:
> > > >
> > > >
> > > >
> > > > A.x < B.y
> > > >
> > > > A.x in_range [B.y, B.z]
> > > >
> > > > distance(A.x, B.y) < D
> > > >
> > > >
> > > >
> > > > where A and B are distinct tables/files. It is my understanding that
> > > > current support for performing non-equijoins like those above is
> quite
> > > > limited, and where some forms are supported (like in Cloudera's
> > > > Impala), this support is based on doing a potentially expensive cross
> > > product join.
> > > > Depending on the data types involved, I believe that joins with these
> > > > conditions can be made to be tractable (at least on the average) with
> > > > join algorithms that exploit properties of the data types, possibly
> > > > with some pre-scanning of the data.
> > > >
> > > >
> > > >
> > > > I am asking for feedback on the interest & need in the community for
> > > > this work, as well as any pointers to similar work. In particular, I
> > > > would appreciate any answers people could give on the following
> > > questions:
> > > >
> > > >
> > > >
> > > > - Is my understanding of the state of the art in Hive and similar
> > > > tools accurate? Are there groups currently working on similar or
> > > > related issues, or tools that already accomplish some or all of what
> I
> > > have proposed?
> > > >
> > > > - Is there significant value to the community in the support of such
> a
> > > > feature? In other words, are the manual workarounds necessary because
> > > > of the absence of non-equijoins such as these enough of a pain to
> > > > justify the work I propose?
> > > >
> > > > - Being aware that the potential pre-scanning adds to the cost of the
> > > > join, and that data could still blow-up in the worst case, am I
> > > > missing any other important considerations and tradeoffs for this
> > > problem?
> > > >
> > > > - What would be a good avenue to contribute this feature to the
> > > > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > > > extension or plugin)?
> > > >
> > > > - What is the best way to get started in working with the community?
> > > >
> > > >
> > > >
> > > > Thanks for your attention and any info you can provide!
> > > >
> > > >
> > > >
> > > > Andres Quiroz
> > > >
> > > >
> > > >
> > > > P.S. If you are interested in some context, and why/how I am
> proposing
> > > > to do this work, please read on.
> > > >
> > > >
> > > >
> > > > I am part of a small project team at PARC working on the general
> > > > problems of data integration and automated ETL. We have proposed a
> > > > tool called HiperFuse that is designed to accept declarative,
> > > > high-level queries in order to produce joined (fused) data sets from
> > > > multiple heterogeneous raw data sources. In our preliminary work,
> > > > which you can find here (pointer to the paper), we designed the
> > > > architecture of the tool and obtained some results separately on the
> > > > problems of automated data cleansing, data type inference, and query
> > > > planning. One of the planned prototype implementations of HiperFuse
> > > > relies on Hadoop MR, and because the declarative language we proposed
> > > > was closely related to SQL, we thought that we could exploit the
> > > > existing work in Hive and/or other open-source tools for handling the
> > > > SQL part and layer our work on top of that. For example, the query
> > > > given in the paper could easily be expressed in SQL-like form with a
> > > > non-equijoin
> > > > condition:
> > > >
> > > >
> > > >
> > > > SELECT web_access_log.ip, census.income
> > > >
> > > > FROM web_access_log, ip2zip, census
> > > >
> > > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> > > >
> > > > AND ip2zip.zip = census.zip
> > > >
> > > >
> > > >
> > > > As you can see, the first impasse that we hit in order to bring the
> > > > elements together to solve this query end-to-end was the realization
> > > > and performance of the non-equality join in the query. The intent now
> > > > is to tackle this problem in a general sense and provide a solution
> > > > for a wide range of queries.
> > > >
> > > >
> > > >
> > > > The work I propose to do would be based on three main components
> > > > within
> > > > HiperFuse:
> > > >
> > > >
> > > >
> > > > - Enhancements to the extensible data type framework in HiperFuse
> that
> > > > would categorize data types based on the properties needed to support
> > > > the join algorithms, in order to write join-ready domain-specific
> data
> > > > type libraries.
> > > >
> > > > - The join algorithms themselves, based on Hive or directly on Hadoop
> > MR.
> > > >
> > > > - A query planner, which would determine the right algorithm to apply
> > > > and automatically schedule any necessary pre-scanning of the data.
> > > >
> > > >
> > > >
> > >
> >
>
>
>
> --
> Best,
> Chao
>

Re: Request for feedback on work intent for non-equijoin support

Posted by Chao Sun <ch...@cloudera.com>.
Hey Lefty,

You need to use the ftp protocol, not http.
After clicking the link, you'll need to remove "http://" from the address
bar.

Best,
Chao

On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> Andrés, I followed that link and got the dread 404 Not Found:
>
> "The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
> found on this server."
>
> -- Lefty
>
> On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:
>
> > Dear Lefty,
> >
> > Thank you very much for pointing that out and for your initial pointers.
> > Here is the missing link:
> >
> > ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
> >
> > Regards,
> >
> > Andrés
> >
> > -----Original Message-----
> > From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
> > Sent: Wednesday, April 01, 2015 12:48 AM
> > To: dev@hive.apache.org
> > Subject: Re: Request for feedback on work intent for non-equijoin support
> >
> > Hello Andres, the link to your paper is missing:
> >
> > In our preliminary work, which you can find here (pointer to the paper)
> ...
> >
> >
> > You can find general information about contributing to Hive in the
> > wiki:  Resources
> > for Contributors
> > <
> >
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> > >
> > , How to Contribute
> > <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
> >
> > -- Lefty
> >
> > On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
> >
> > >  Dear Hive development community members,
> > >
> > >
> > >
> > > I am interested in learning more about the current support for
> > > non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> > > feedback about community interest in more extensive support for such a
> > > feature. I intend to work on this challenge, assuming people find it
> > > compelling, and I intend to contribute results to the community. Where
> > > possible, it would be great to receive feedback and engage in
> > > collaborations along the way (for a bit more context, see the
> > > postscript of this message).
> > >
> > >
> > >
> > > My initial goal is to support query conditions such as the following:
> > >
> > >
> > >
> > > A.x < B.y
> > >
> > > A.x in_range [B.y, B.z]
> > >
> > > distance(A.x, B.y) < D
> > >
> > >
> > >
> > > where A and B are distinct tables/files. It is my understanding that
> > > current support for performing non-equijoins like those above is quite
> > > limited, and where some forms are supported (like in Cloudera's
> > > Impala), this support is based on doing a potentially expensive cross
> > product join.
> > > Depending on the data types involved, I believe that joins with these
> > > conditions can be made to be tractable (at least on the average) with
> > > join algorithms that exploit properties of the data types, possibly
> > > with some pre-scanning of the data.
> > >
> > >
> > >
> > > I am asking for feedback on the interest & need in the community for
> > > this work, as well as any pointers to similar work. In particular, I
> > > would appreciate any answers people could give on the following
> > questions:
> > >
> > >
> > >
> > > - Is my understanding of the state of the art in Hive and similar
> > > tools accurate? Are there groups currently working on similar or
> > > related issues, or tools that already accomplish some or all of what I
> > have proposed?
> > >
> > > - Is there significant value to the community in the support of such a
> > > feature? In other words, are the manual workarounds necessary because
> > > of the absence of non-equijoins such as these enough of a pain to
> > > justify the work I propose?
> > >
> > > - Being aware that the potential pre-scanning adds to the cost of the
> > > join, and that data could still blow-up in the worst case, am I
> > > missing any other important considerations and tradeoffs for this
> > problem?
> > >
> > > - What would be a good avenue to contribute this feature to the
> > > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > > extension or plugin)?
> > >
> > > - What is the best way to get started in working with the community?
> > >
> > >
> > >
> > > Thanks for your attention and any info you can provide!
> > >
> > >
> > >
> > > Andres Quiroz
> > >
> > >
> > >
> > > P.S. If you are interested in some context, and why/how I am proposing
> > > to do this work, please read on.
> > >
> > >
> > >
> > > I am part of a small project team at PARC working on the general
> > > problems of data integration and automated ETL. We have proposed a
> > > tool called HiperFuse that is designed to accept declarative,
> > > high-level queries in order to produce joined (fused) data sets from
> > > multiple heterogeneous raw data sources. In our preliminary work,
> > > which you can find here (pointer to the paper), we designed the
> > > architecture of the tool and obtained some results separately on the
> > > problems of automated data cleansing, data type inference, and query
> > > planning. One of the planned prototype implementations of HiperFuse
> > > relies on Hadoop MR, and because the declarative language we proposed
> > > was closely related to SQL, we thought that we could exploit the
> > > existing work in Hive and/or other open-source tools for handling the
> > > SQL part and layer our work on top of that. For example, the query
> > > given in the paper could easily be expressed in SQL-like form with a
> > > non-equijoin
> > > condition:
> > >
> > >
> > >
> > > SELECT web_access_log.ip, census.income
> > >
> > > FROM web_access_log, ip2zip, census
> > >
> > > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> > >
> > > AND ip2zip.zip = census.zip
> > >
> > >
> > >
> > > As you can see, the first impasse that we hit in order to bring the
> > > elements together to solve this query end-to-end was the realization
> > > and performance of the non-equality join in the query. The intent now
> > > is to tackle this problem in a general sense and provide a solution
> > > for a wide range of queries.
> > >
> > >
> > >
> > > The work I propose to do would be based on three main components
> > > within
> > > HiperFuse:
> > >
> > >
> > >
> > > - Enhancements to the extensible data type framework in HiperFuse that
> > > would categorize data types based on the properties needed to support
> > > the join algorithms, in order to write join-ready domain-specific data
> > > type libraries.
> > >
> > > - The join algorithms themselves, based on Hive or directly on Hadoop
> MR.
> > >
> > > - A query planner, which would determine the right algorithm to apply
> > > and automatically schedule any necessary pre-scanning of the data.
> > >
> > >
> > >
> >
>



-- 
Best,
Chao

Re: Request for feedback on work intent for non-equijoin support

Posted by Lefty Leverenz <le...@gmail.com>.
Andrés, I followed that link and got the dread 404 Not Found:

"The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not
found on this server."

-- Lefty

On Wed, Apr 1, 2015 at 7:23 PM, <An...@parc.com> wrote:

> Dear Lefty,
>
> Thank you very much for pointing that out and for your initial pointers.
> Here is the missing link:
>
> ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf
>
> Regards,
>
> Andrés
>
> -----Original Message-----
> From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]
> Sent: Wednesday, April 01, 2015 12:48 AM
> To: dev@hive.apache.org
> Subject: Re: Request for feedback on work intent for non-equijoin support
>
> Hello Andres, the link to your paper is missing:
>
> In our preliminary work, which you can find here (pointer to the paper) ...
>
>
> You can find general information about contributing to Hive in the
> wiki:  Resources
> for Contributors
> <
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
> >
> , How to Contribute
> <https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.
>
> -- Lefty
>
> On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:
>
> >  Dear Hive development community members,
> >
> >
> >
> > I am interested in learning more about the current support for
> > non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> > feedback about community interest in more extensive support for such a
> > feature. I intend to work on this challenge, assuming people find it
> > compelling, and I intend to contribute results to the community. Where
> > possible, it would be great to receive feedback and engage in
> > collaborations along the way (for a bit more context, see the
> > postscript of this message).
> >
> >
> >
> > My initial goal is to support query conditions such as the following:
> >
> >
> >
> > A.x < B.y
> >
> > A.x in_range [B.y, B.z]
> >
> > distance(A.x, B.y) < D
> >
> >
> >
> > where A and B are distinct tables/files. It is my understanding that
> > current support for performing non-equijoins like those above is quite
> > limited, and where some forms are supported (like in Cloudera's
> > Impala), this support is based on doing a potentially expensive cross
> product join.
> > Depending on the data types involved, I believe that joins with these
> > conditions can be made to be tractable (at least on the average) with
> > join algorithms that exploit properties of the data types, possibly
> > with some pre-scanning of the data.
> >
> >
> >
> > I am asking for feedback on the interest & need in the community for
> > this work, as well as any pointers to similar work. In particular, I
> > would appreciate any answers people could give on the following
> questions:
> >
> >
> >
> > - Is my understanding of the state of the art in Hive and similar
> > tools accurate? Are there groups currently working on similar or
> > related issues, or tools that already accomplish some or all of what I
> have proposed?
> >
> > - Is there significant value to the community in the support of such a
> > feature? In other words, are the manual workarounds necessary because
> > of the absence of non-equijoins such as these enough of a pain to
> > justify the work I propose?
> >
> > - Being aware that the potential pre-scanning adds to the cost of the
> > join, and that data could still blow-up in the worst case, am I
> > missing any other important considerations and tradeoffs for this
> problem?
> >
> > - What would be a good avenue to contribute this feature to the
> > community (e.g. as a standalone tool on top of Hadoop, or as a Hive
> > extension or plugin)?
> >
> > - What is the best way to get started in working with the community?
> >
> >
> >
> > Thanks for your attention and any info you can provide!
> >
> >
> >
> > Andres Quiroz
> >
> >
> >
> > P.S. If you are interested in some context, and why/how I am proposing
> > to do this work, please read on.
> >
> >
> >
> > I am part of a small project team at PARC working on the general
> > problems of data integration and automated ETL. We have proposed a
> > tool called HiperFuse that is designed to accept declarative,
> > high-level queries in order to produce joined (fused) data sets from
> > multiple heterogeneous raw data sources. In our preliminary work,
> > which you can find here (pointer to the paper), we designed the
> > architecture of the tool and obtained some results separately on the
> > problems of automated data cleansing, data type inference, and query
> > planning. One of the planned prototype implementations of HiperFuse
> > relies on Hadoop MR, and because the declarative language we proposed
> > was closely related to SQL, we thought that we could exploit the
> > existing work in Hive and/or other open-source tools for handling the
> > SQL part and layer our work on top of that. For example, the query
> > given in the paper could easily be expressed in SQL-like form with a
> > non-equijoin
> > condition:
> >
> >
> >
> > SELECT web_access_log.ip, census.income
> >
> > FROM web_access_log, ip2zip, census
> >
> > WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
> >
> > AND ip2zip.zip = census.zip
> >
> >
> >
> > As you can see, the first impasse that we hit in order to bring the
> > elements together to solve this query end-to-end was the realization
> > and performance of the non-equality join in the query. The intent now
> > is to tackle this problem in a general sense and provide a solution
> > for a wide range of queries.
> >
> >
> >
> > The work I propose to do would be based on three main components
> > within
> > HiperFuse:
> >
> >
> >
> > - Enhancements to the extensible data type framework in HiperFuse that
> > would categorize data types based on the properties needed to support
> > the join algorithms, in order to write join-ready domain-specific data
> > type libraries.
> >
> > - The join algorithms themselves, based on Hive or directly on Hadoop MR.
> >
> > - A query planner, which would determine the right algorithm to apply
> > and automatically schedule any necessary pre-scanning of the data.
> >
> >
> >
>

RE: Request for feedback on work intent for non-equijoin support

Posted by An...@parc.com.
Dear Lefty,

Thank you very much for pointing that out and for your initial pointers. Here is the missing link:

ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf

Regards,

Andrés

-----Original Message-----
From: Lefty Leverenz [mailto:leftyleverenz@gmail.com] 
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors
<https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors>
, How to Contribute
<https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:

>  Dear Hive development community members,
>
>
>
> I am interested in learning more about the current support for 
> non-equijoins in Hive and/or other Hadoop SQL engines, and in getting 
> feedback about community interest in more extensive support for such a 
> feature. I intend to work on this challenge, assuming people find it 
> compelling, and I intend to contribute results to the community. Where 
> possible, it would be great to receive feedback and engage in 
> collaborations along the way (for a bit more context, see the 
> postscript of this message).
>
>
>
> My initial goal is to support query conditions such as the following:
>
>
>
> A.x < B.y
>
> A.x in_range [B.y, B.z]
>
> distance(A.x, B.y) < D
>
>
>
> where A and B are distinct tables/files. It is my understanding that 
> current support for performing non-equijoins like those above is quite 
> limited, and where some forms are supported (like in Cloudera's 
> Impala), this support is based on doing a potentially expensive cross product join.
> Depending on the data types involved, I believe that joins with these 
> conditions can be made to be tractable (at least on the average) with 
> join algorithms that exploit properties of the data types, possibly 
> with some pre-scanning of the data.
>
>
>
> I am asking for feedback on the interest & need in the community for 
> this work, as well as any pointers to similar work. In particular, I 
> would appreciate any answers people could give on the following questions:
>
>
>
> - Is my understanding of the state of the art in Hive and similar 
> tools accurate? Are there groups currently working on similar or 
> related issues, or tools that already accomplish some or all of what I have proposed?
>
> - Is there significant value to the community in the support of such a 
> feature? In other words, are the manual workarounds necessary because 
> of the absence of non-equijoins such as these enough of a pain to 
> justify the work I propose?
>
> - Being aware that the potential pre-scanning adds to the cost of the 
> join, and that data could still blow-up in the worst case, am I 
> missing any other important considerations and tradeoffs for this problem?
>
> - What would be a good avenue to contribute this feature to the 
> community (e.g. as a standalone tool on top of Hadoop, or as a Hive 
> extension or plugin)?
>
> - What is the best way to get started in working with the community?
>
>
>
> Thanks for your attention and any info you can provide!
>
>
>
> Andres Quiroz
>
>
>
> P.S. If you are interested in some context, and why/how I am proposing 
> to do this work, please read on.
>
>
>
> I am part of a small project team at PARC working on the general 
> problems of data integration and automated ETL. We have proposed a 
> tool called HiperFuse that is designed to accept declarative, 
> high-level queries in order to produce joined (fused) data sets from 
> multiple heterogeneous raw data sources. In our preliminary work, 
> which you can find here (pointer to the paper), we designed the 
> architecture of the tool and obtained some results separately on the 
> problems of automated data cleansing, data type inference, and query 
> planning. One of the planned prototype implementations of HiperFuse 
> relies on Hadoop MR, and because the declarative language we proposed 
> was closely related to SQL, we thought that we could exploit the 
> existing work in Hive and/or other open-source tools for handling the 
> SQL part and layer our work on top of that. For example, the query 
> given in the paper could easily be expressed in SQL-like form with a 
> non-equijoin
> condition:
>
>
>
> SELECT web_access_log.ip, census.income
>
> FROM web_access_log, ip2zip, census
>
> WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
>
> AND ip2zip.zip = census.zip
>
>
>
> As you can see, the first impasse that we hit in order to bring the 
> elements together to solve this query end-to-end was the realization 
> and performance of the non-equality join in the query. The intent now 
> is to tackle this problem in a general sense and provide a solution 
> for a wide range of queries.
>
>
>
> The work I propose to do would be based on three main components 
> within
> HiperFuse:
>
>
>
> - Enhancements to the extensible data type framework in HiperFuse that 
> would categorize data types based on the properties needed to support 
> the join algorithms, in order to write join-ready domain-specific data 
> type libraries.
>
> - The join algorithms themselves, based on Hive or directly on Hadoop MR.
>
> - A query planner, which would determine the right algorithm to apply 
> and automatically schedule any necessary pre-scanning of the data.
>
>
>

Re: Request for feedback on work intent for non-equijoin support

Posted by Lefty Leverenz <le...@gmail.com>.
Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors
<https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors>
, How to Contribute
<https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM, <An...@parc.com> wrote:

>  Dear Hive development community members,
>
>
>
> I am interested in learning more about the current support for
> non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> feedback about community interest in more extensive support for such a
> feature. I intend to work on this challenge, assuming people find it
> compelling, and I intend to contribute results to the community. Where
> possible, it would be great to receive feedback and engage in
> collaborations along the way (for a bit more context, see the postscript of
> this message).
>
>
>
> My initial goal is to support query conditions such as the following:
>
>
>
> A.x < B.y
>
> A.x in_range [B.y, B.z]
>
> distance(A.x, B.y) < D
>
>
>
> where A and B are distinct tables/files. It is my understanding that
> current support for performing non-equijoins like those above is quite
> limited, and where some forms are supported (like in Cloudera's Impala),
> this support is based on doing a potentially expensive cross product join.
> Depending on the data types involved, I believe that joins with these
> conditions can be made to be tractable (at least on the average) with join
> algorithms that exploit properties of the data types, possibly with some
> pre-scanning of the data.
>
>
>
> I am asking for feedback on the interest & need in the community for this
> work, as well as any pointers to similar work. In particular, I would
> appreciate any answers people could give on the following questions:
>
>
>
> - Is my understanding of the state of the art in Hive and similar tools
> accurate? Are there groups currently working on similar or related issues,
> or tools that already accomplish some or all of what I have proposed?
>
> - Is there significant value to the community in the support of such a
> feature? In other words, are the manual workarounds necessary because of
> the absence of non-equijoins such as these enough of a pain to justify the
> work I propose?
>
> - Being aware that the potential pre-scanning adds to the cost of the
> join, and that data could still blow-up in the worst case, am I missing any
> other important considerations and tradeoffs for this problem?
>
> - What would be a good avenue to contribute this feature to the community
> (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or
> plugin)?
>
> - What is the best way to get started in working with the community?
>
>
>
> Thanks for your attention and any info you can provide!
>
>
>
> Andres Quiroz
>
>
>
> P.S. If you are interested in some context, and why/how I am proposing to
> do this work, please read on.
>
>
>
> I am part of a small project team at PARC working on the general problems
> of data integration and automated ETL. We have proposed a tool called
> HiperFuse that is designed to accept declarative, high-level queries in
> order to produce joined (fused) data sets from multiple heterogeneous raw
> data sources. In our preliminary work, which you can find here (pointer to
> the paper), we designed the architecture of the tool and obtained some
> results separately on the problems of automated data cleansing, data type
> inference, and query planning. One of the planned prototype implementations
> of HiperFuse relies on Hadoop MR, and because the declarative language we
> proposed was closely related to SQL, we thought that we could exploit the
> existing work in Hive and/or other open-source tools for handling the SQL
> part and layer our work on top of that. For example, the query given in the
> paper could easily be expressed in SQL-like form with a non-equijoin
> condition:
>
>
>
> SELECT web_access_log.ip, census.income
>
> FROM web_access_log, ip2zip, census
>
> WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
>
> AND ip2zip.zip = census.zip
>
>
>
> As you can see, the first impasse that we hit in order to bring the
> elements together to solve this query end-to-end was the realization and
> performance of the non-equality join in the query. The intent now is to
> tackle this problem in a general sense and provide a solution for a wide
> range of queries.
>
>
>
> The work I propose to do would be based on three main components within
> HiperFuse:
>
>
>
> - Enhancements to the extensible data type framework in HiperFuse that
> would categorize data types based on the properties needed to support the
> join algorithms, in order to write join-ready domain-specific data type
> libraries.
>
> - The join algorithms themselves, based on Hive or directly on Hadoop MR.
>
> - A query planner, which would determine the right algorithm to apply and
> automatically schedule any necessary pre-scanning of the data.
>
>
>