You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Wail Alkowaileet <wa...@gmail.com> on 2016/09/02 07:20:37 UTC
Indexing non-ADM data.
Hi Dev,
In the last year or so I have been more involved in AsterixDB. However, I'm
90% user and 10% developer (due to the nature of my work). I want to share
some of my (and my colleagues) experience with ADM. However, I might be too
obvious.
One of the challenges we face most of the time is Indexing non-ADM data.
Most of the data are either in JSON or CSV format which mean all ADM
richness are not usable.
For instance in load, I usually create External (or Temporary) Dataset,
query/transform and then insert it to my Internal Dataset, which takes more
time compared with load, as a result of flush/merge operations.
Another challenging case, The TwitterFeed example
<https://ci.apache.org/projects/asterixdb/feeds/tutorial.html>, the
*longitude* and *latitude* fields are not indexable and I need to ETL to
another dataset to transform (lon,lat) to a point type*.*
It would be awesome if we can bridge non-ADM to ADM types.
--
*Regards,*
Wail Alkowaileet
Re: Indexing non-ADM data.
Posted by Mike Carey <dt...@gmail.com>.
Wail,
Great inputs/requirements! We should definitely think about how to
address these. One thing that could help with the second item would be
"functional indexes" - supporting indexing on an expression rather than
just base data - some systems (e.g., PostgreSQL) support that - not
rocket science - and that could make data that's convertible to spatial
data via a function call indexable spatially. As for the first point -
I'm not sure I "get it" - are external indexes not good enough? Oh -
wait - is the issue that we should offer per-object transformations
during load? (E.g., the ability to put a UDF on the load pipeline, like
we do on the feed pipeline?)
Thx!
Mike
On 9/2/16 12:50 PM, Wail Alkowaileet wrote:
> Hi Dev,
>
> In the last year or so I have been more involved in AsterixDB. However, I'm
> 90% user and 10% developer (due to the nature of my work). I want to share
> some of my (and my colleagues) experience with ADM. However, I might be too
> obvious.
>
> One of the challenges we face most of the time is Indexing non-ADM data.
> Most of the data are either in JSON or CSV format which mean all ADM
> richness are not usable.
>
> For instance in load, I usually create External (or Temporary) Dataset,
> query/transform and then insert it to my Internal Dataset, which takes more
> time compared with load, as a result of flush/merge operations.
>
> Another challenging case, The TwitterFeed example
> <https://ci.apache.org/projects/asterixdb/feeds/tutorial.html>, the
> *longitude* and *latitude* fields are not indexable and I need to ETL to
> another dataset to transform (lon,lat) to a point type*.*
>
> It would be awesome if we can bridge non-ADM to ADM types.
>
>