You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Wail Alkowaileet <wa...@gmail.com> on 2016/09/02 07:20:37 UTC

Indexing non-ADM data.

Hi Dev,

In the last year or so I have been more involved in AsterixDB. However, I'm
90% user and 10% developer (due to the nature of my work). I want to share
some of my (and my colleagues) experience with ADM. However, I might be too
obvious.

One of the challenges we face most of the time is Indexing non-ADM data.
Most of the data are either in JSON or CSV format which mean all ADM
richness are not usable.

For instance in load, I usually create External (or Temporary) Dataset,
query/transform and then insert it to my Internal Dataset, which takes more
time compared with load, as a result of flush/merge operations.

Another challenging case, The TwitterFeed example
<https://ci.apache.org/projects/asterixdb/feeds/tutorial.html>, the
*longitude* and *latitude* fields are not indexable and I need to ETL to
another dataset to transform (lon,lat) to a point type*.*

It would be awesome if we can bridge non-ADM to ADM types.


-- 

*Regards,*
Wail Alkowaileet

Re: Indexing non-ADM data.

Posted by Mike Carey <dt...@gmail.com>.
Wail,

Great inputs/requirements!  We should definitely think about how to 
address these.  One thing that could help with the second item would be 
"functional indexes" - supporting indexing on an expression rather than 
just base data - some systems (e.g., PostgreSQL) support that - not 
rocket science - and that could make data that's convertible to spatial 
data via a function call indexable spatially.  As for the first point - 
I'm not sure I "get it" - are external indexes not good enough?  Oh - 
wait - is the issue that we should offer per-object transformations 
during load?  (E.g., the ability to put a UDF on the load pipeline, like 
we do on the feed pipeline?)

Thx!

Mike


On 9/2/16 12:50 PM, Wail Alkowaileet wrote:
> Hi Dev,
>
> In the last year or so I have been more involved in AsterixDB. However, I'm
> 90% user and 10% developer (due to the nature of my work). I want to share
> some of my (and my colleagues) experience with ADM. However, I might be too
> obvious.
>
> One of the challenges we face most of the time is Indexing non-ADM data.
> Most of the data are either in JSON or CSV format which mean all ADM
> richness are not usable.
>
> For instance in load, I usually create External (or Temporary) Dataset,
> query/transform and then insert it to my Internal Dataset, which takes more
> time compared with load, as a result of flush/merge operations.
>
> Another challenging case, The TwitterFeed example
> <https://ci.apache.org/projects/asterixdb/feeds/tutorial.html>, the
> *longitude* and *latitude* fields are not indexable and I need to ETL to
> another dataset to transform (lon,lat) to a point type*.*
>
> It would be awesome if we can bridge non-ADM to ADM types.
>
>