You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Rob Roland <ro...@simplymeasured.com> on 2014/05/14 23:30:38 UTC

MAP type?

Hi all,

I have a use-case that would benefit well from a MAP type, similar to how
Hive uses a map. I'd like to avoid using a JOIN, due to the amount of rows
on each side of the JOIN. In Hive's case, you would assign a column family
to the MAP field.

Essentially, I have an entity that I want to add quite a bit of dynamic
data that would be unknown at both initial query time and query time. A
contrived example is below:

I'd like to do something like this:

CREATE TABLE facebook_post (
id VARCHAR PRIMARY KEY,
post_body VARCHAR,
bitly_links MAP<VARCHAR, VARCHAR>
);

SELECT * FROM facebook_post;

As opposed to:

CREATE TABLE facebook_post (
id VARCHAR PRIMARY KEY,
post_body VARCHAR
);

CREATE TABLE facebook_post_links (
facebook_post_id VARCHAR,
bitly_id VARCHAR,
expanded_link VARCHAR,
CONSTRAINT pk PRIMARY KEY (facebook_post_id, bitly_id)
);

SELECT * FROM facebook_post AS fp
LEFT OUTER JOIN facebook_post_links AS fpl ON (fpl.facebook_post_id = fp.id
);

I realize that a MAP type is outside of the SQL standard, but it would be
excellent in my use-case. I could accomplish this with an ARRAY type, but
the expansion of bit.ly links happens as a post-processing step, and would
probably involve locking, as appending to an ARRAY has to happen
client-side.

I'd be willing to take a shot at making a MAP type, but I would want to
make sure this change would be welcomed into mainline Phoenix before
embarking on something like this.

Thanks,

Rob Roland

Re: MAP type?

Posted by James Taylor <ja...@apache.org>.
I think having JSON support would be great. If you're up for contributing
these set of built-in functions, I'm +1.

I think to get good performance, you'd need support for functional indexes
(PHOENIX-514). This would enable you to create an index like this (assuming
you have a book_store table with a column named json):
    CREATE INDEX json_book_idx ON book_store (json_extract(json,
'$.store.book'))
This would allow a query like this to be executed efficiently:
    SELECT json FROM book_store WHERE json_extract(json, '$.store.book') =
'Huckleberry Finn'
Another improvement would be if we had STRUCT support to optimize the
storage such that all the superfluous tag information isn't stored over and
over again (PHOENIX-477). We'd want to allow the STRUCT definition to be
supplied at query time (much like our dynamic columns) to handle use cases
where the JSON structure is not known in advance.
Another alternative would be to have a JSON type and optimize the storage
in other (warning: hand wavy) interesting ways (PHOENIX-628).

I think all this would be awesome - each part could be done independently,
in parallel, and phased in. It just needs someone to step up and own it.
Thanks,
James



On Thu, May 15, 2014 at 9:50 AM, Stephen Sprague <sp...@gmail.com> wrote:

> hate to bang on the json drum again but having a json_extract() function
> this or any other non-scalar datatype could be implemented using a varchar
> datatype.
>
> granted the syntax isn't as clean:
>
> column['key'] vs. json_extract(column,'$.key')
>
> but i think it gets the job done.
>
> presto uses json_extract_scalar() and json_extract() among others. cf.
> http://prestodb.io/docs/current/functions/json.html.  This could be used
> as a guide, perhaps.
>
>
>
>
> On Wed, May 14, 2014 at 2:30 PM, Rob Roland <ro...@simplymeasured.com>wrote:
>
>> Hi all,
>>
>> I have a use-case that would benefit well from a MAP type, similar to how
>> Hive uses a map. I'd like to avoid using a JOIN, due to the amount of rows
>> on each side of the JOIN. In Hive's case, you would assign a column family
>> to the MAP field.
>>
>> Essentially, I have an entity that I want to add quite a bit of dynamic
>> data that would be unknown at both initial query time and query time. A
>> contrived example is below:
>>
>> I'd like to do something like this:
>>
>> CREATE TABLE facebook_post (
>> id VARCHAR PRIMARY KEY,
>> post_body VARCHAR,
>>  bitly_links MAP<VARCHAR, VARCHAR>
>> );
>>
>> SELECT * FROM facebook_post;
>>
>> As opposed to:
>>
>> CREATE TABLE facebook_post (
>> id VARCHAR PRIMARY KEY,
>> post_body VARCHAR
>> );
>>
>> CREATE TABLE facebook_post_links (
>> facebook_post_id VARCHAR,
>> bitly_id VARCHAR,
>> expanded_link VARCHAR,
>>  CONSTRAINT pk PRIMARY KEY (facebook_post_id, bitly_id)
>> );
>>
>> SELECT * FROM facebook_post AS fp
>> LEFT OUTER JOIN facebook_post_links AS fpl ON (fpl.facebook_post_id =
>> fp.id);
>>
>> I realize that a MAP type is outside of the SQL standard, but it would be
>> excellent in my use-case. I could accomplish this with an ARRAY type, but
>> the expansion of bit.ly links happens as a post-processing step, and
>> would probably involve locking, as appending to an ARRAY has to happen
>> client-side.
>>
>> I'd be willing to take a shot at making a MAP type, but I would want to
>> make sure this change would be welcomed into mainline Phoenix before
>> embarking on something like this.
>>
>> Thanks,
>>
>> Rob Roland
>>
>
>

Re: MAP type?

Posted by Stephen Sprague <sp...@gmail.com>.
hate to bang on the json drum again but having a json_extract() function
this or any other non-scalar datatype could be implemented using a varchar
datatype.

granted the syntax isn't as clean:

column['key'] vs. json_extract(column,'$.key')

but i think it gets the job done.

presto uses json_extract_scalar() and json_extract() among others. cf.
http://prestodb.io/docs/current/functions/json.html.  This could be used as
a guide, perhaps.




On Wed, May 14, 2014 at 2:30 PM, Rob Roland <ro...@simplymeasured.com> wrote:

> Hi all,
>
> I have a use-case that would benefit well from a MAP type, similar to how
> Hive uses a map. I'd like to avoid using a JOIN, due to the amount of rows
> on each side of the JOIN. In Hive's case, you would assign a column family
> to the MAP field.
>
> Essentially, I have an entity that I want to add quite a bit of dynamic
> data that would be unknown at both initial query time and query time. A
> contrived example is below:
>
> I'd like to do something like this:
>
> CREATE TABLE facebook_post (
> id VARCHAR PRIMARY KEY,
> post_body VARCHAR,
>  bitly_links MAP<VARCHAR, VARCHAR>
> );
>
> SELECT * FROM facebook_post;
>
> As opposed to:
>
> CREATE TABLE facebook_post (
> id VARCHAR PRIMARY KEY,
> post_body VARCHAR
> );
>
> CREATE TABLE facebook_post_links (
> facebook_post_id VARCHAR,
> bitly_id VARCHAR,
> expanded_link VARCHAR,
>  CONSTRAINT pk PRIMARY KEY (facebook_post_id, bitly_id)
> );
>
> SELECT * FROM facebook_post AS fp
> LEFT OUTER JOIN facebook_post_links AS fpl ON (fpl.facebook_post_id =
> fp.id);
>
> I realize that a MAP type is outside of the SQL standard, but it would be
> excellent in my use-case. I could accomplish this with an ARRAY type, but
> the expansion of bit.ly links happens as a post-processing step, and
> would probably involve locking, as appending to an ARRAY has to happen
> client-side.
>
> I'd be willing to take a shot at making a MAP type, but I would want to
> make sure this change would be welcomed into mainline Phoenix before
> embarking on something like this.
>
> Thanks,
>
> Rob Roland
>

Re: MAP type?

Posted by Rob Roland <ro...@simplymeasured.com>.
I think we're going to end up implementing using an ARRAY, but I wanted to
support many concurrent update requests that can add to this array. Bit.ly
is just one example of this use-case in our app. It seems like we'll have
to limit concurrency at the update clients, or use locking, to avoid
last-write wins removing valid data.

Thanks,

Rob


On Thu, May 15, 2014 at 11:02 AM, James Taylor <ja...@apache.org>wrote:

> Hi Rob,
> Would two parallel arrays work (or a single array since it sounds like the
> raw and expanded bitly_links are both VARCHARs)? In Phoenix, an array must
> be submitted from the client in its entirety, so I don't think you'd need
> locking (if you're ok with last-update-wins semantics).
>
> Thanks,
> James
>
>
> On Wednesday, May 14, 2014, Rob Roland <ro...@simplymeasured.com> wrote:
>
>> Hi all,
>>
>> I have a use-case that would benefit well from a MAP type, similar to how
>> Hive uses a map. I'd like to avoid using a JOIN, due to the amount of rows
>> on each side of the JOIN. In Hive's case, you would assign a column family
>> to the MAP field.
>>
>> Essentially, I have an entity that I want to add quite a bit of dynamic
>> data that would be unknown at both initial query time and query time. A
>> contrived example is below:
>>
>> I'd like to do something like this:
>>
>> CREATE TABLE facebook_post (
>> id VARCHAR PRIMARY KEY,
>> post_body VARCHAR,
>>  bitly_links MAP<VARCHAR, VARCHAR>
>> );
>>
>> SELECT * FROM facebook_post;
>>
>> As opposed to:
>>
>> CREATE TABLE facebook_post (
>> id VARCHAR PRIMARY KEY,
>> post_body VARCHAR
>> );
>>
>> CREATE TABLE facebook_post_links (
>> facebook_post_id VARCHAR,
>> bitly_id VARCHAR,
>> expanded_link VARCHAR,
>>  CONSTRAINT pk PRIMARY KEY (facebook_post_id, bitly_id)
>> );
>>
>> SELECT * FROM facebook_post AS fp
>> LEFT OUTER JOIN facebook_post_links AS fpl ON (fpl.facebook_post_id =
>> fp.id);
>>
>> I realize that a MAP type is outside of the SQL standard, but it would be
>> excellent in my use-case. I could accomplish this with an ARRAY type, but
>> the expansion of bit.ly links happens as a post-processing step, and
>> would probably involve locking, as appending to an ARRAY has to happen
>> client-side.
>>
>> I'd be willing to take a shot at making a MAP type, but I would want to
>> make sure this change would be welcomed into mainline Phoenix before
>> embarking on something like this.
>>
>> Thanks,
>>
>> Rob Roland
>>
>

Re: MAP type?

Posted by James Taylor <ja...@apache.org>.
Hi Rob,
Would two parallel arrays work (or a single array since it sounds like the
raw and expanded bitly_links are both VARCHARs)? In Phoenix, an array must
be submitted from the client in its entirety, so I don't think you'd need
locking (if you're ok with last-update-wins semantics).

Thanks,
James

On Wednesday, May 14, 2014, Rob Roland <ro...@simplymeasured.com> wrote:

> Hi all,
>
> I have a use-case that would benefit well from a MAP type, similar to how
> Hive uses a map. I'd like to avoid using a JOIN, due to the amount of rows
> on each side of the JOIN. In Hive's case, you would assign a column family
> to the MAP field.
>
> Essentially, I have an entity that I want to add quite a bit of dynamic
> data that would be unknown at both initial query time and query time. A
> contrived example is below:
>
> I'd like to do something like this:
>
> CREATE TABLE facebook_post (
> id VARCHAR PRIMARY KEY,
> post_body VARCHAR,
>  bitly_links MAP<VARCHAR, VARCHAR>
> );
>
> SELECT * FROM facebook_post;
>
> As opposed to:
>
> CREATE TABLE facebook_post (
> id VARCHAR PRIMARY KEY,
> post_body VARCHAR
> );
>
> CREATE TABLE facebook_post_links (
> facebook_post_id VARCHAR,
> bitly_id VARCHAR,
> expanded_link VARCHAR,
>  CONSTRAINT pk PRIMARY KEY (facebook_post_id, bitly_id)
> );
>
> SELECT * FROM facebook_post AS fp
> LEFT OUTER JOIN facebook_post_links AS fpl ON (fpl.facebook_post_id =
> fp.id);
>
> I realize that a MAP type is outside of the SQL standard, but it would be
> excellent in my use-case. I could accomplish this with an ARRAY type, but
> the expansion of bit.ly links happens as a post-processing step, and
> would probably involve locking, as appending to an ARRAY has to happen
> client-side.
>
> I'd be willing to take a shot at making a MAP type, but I would want to
> make sure this change would be welcomed into mainline Phoenix before
> embarking on something like this.
>
> Thanks,
>
> Rob Roland
>