You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Sandeep Joshi <sa...@gmail.com> on 2016/02/14 08:17:21 UTC

external data set support

Can someone describe the level of support for External data sets and the
future roadmap ?

Let me divide the question into four broad issues:

1) Schema catalog : One would have implement IMetadataProvider,
IDataSource, IDataSourceIndex and other related classes.  Is there any
functionality missing from the current schema implementation for external
data sets ?

One of the papers says that one should add comparators and hash functions
for any new data types introduced by the external data set.  Which
interface does one have to implement for that ?

2) Query optimization : There is no cost-based optimizer yet within
Algebricks, therefore there is no API to support retrieval and use of table
statistics from an external data source.

Is something planned in this regard ?

3) Data fetch and update : The VLDB'14 paper states that external data sets
are read-only, static and without indices, but the current codebase has
support for IExternalIndex and IIndexibleExternalDataSource, so presumably
I can fetch records from an external data source (base table scan as well
as index).

Can I write to an external data source ?

4) Hyracks runtime : For data retrieval, is it sufficient to implement the
interfaces within asterix.external.api or does one also have to add some
Hyracks operators which are constructed via contributeRuntimeOperator ?

-Sandeep

Re: external data set support

Posted by Wail Alkowaileet <wa...@gmail.com>.
One of the papers says that one should add comparators and hash functions
for any new data types introduced by the external data set.  Which
interface does one have to implement for that ?

In addition to Abdullah's, I guess you need also to write SerDer
(ISerializerDeserializer) for those types as well as the output format for
(JSON, LOSSLESS JSON, CSV and ADM) using IPrinterFactory and IPrinter.

On Sun, Feb 14, 2016 at 10:44 AM, abdullah alamoudi <ba...@gmail.com>
wrote:

> Hi Sandeep,
> Here are the answers as per my understanding of the questions:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
> Schema information for external data already exists and we use the
> AqlMetadataProvider for both external and internal datasets.
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
> I am not sure which paper you're referring to but for adding new data types
> (regardless for use with internal or external. there is really no
> distinction) here is what needs to be done:
> 1. For complex types, one can simply define a type using the create type
> statement.
> 2. For completely new types, one needs to implement at least {IAType,
> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
> enough but that is a starting point.
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
> Cost based optimizer for internal datasets is being worked on (@Ildar might
> add here). As for external data, unfortunately right now, we don't even
> employ some easy rule based optimizations. For example, we can utilize RC
> files structure to push project into data source operator but we don't do
> that yet. Another optimization that can be done is lazy deserialization of
> records but again we don't do that. There are plans to do all of these but
> we have man power shortage. You are welcome to give them a shot and we can
> assist.
>
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
> Yes, we can access external data through indexes. probably by the time the
> VLDB'14 paper was published, we didn't have this feature yet. You can check
> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
> access and indexing.
>
> Can I write to an external data source ?
> Right now, this is not supported because we can't provide the same
> transactional guarantees we can with internal datasets. This point probably
> needs to be discussed with Mike before doing anything about it. I believe
> we offer some other thing that can be utilized which is righting query
> results into files but I am not sure.
>
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> For data retrieval, one only needs to implement IExternalDataSourceFactory
> along with IRecordReader<? extends T> or IInputStreamProvider (depending on
> whether the source produces a stream or a set of records).
>
> For data parsing, one only needs to implements IDataParserFactory along
> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
> parsed data source produces a stream or a set of records).
>
> Let me know if I can provide more information.
> Cheers,
> Abdullah.
>
> P.S,
> Thanks for doing your work before asking. This is a great sign :)
>
> Amoudi, Abdullah.
>
> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sa...@gmail.com>
> wrote:
>
> > Can someone describe the level of support for External data sets and the
> > future roadmap ?
> >
> > Let me divide the question into four broad issues:
> >
> > 1) Schema catalog : One would have implement IMetadataProvider,
> > IDataSource, IDataSourceIndex and other related classes.  Is there any
> > functionality missing from the current schema implementation for external
> > data sets ?
> >
> > One of the papers says that one should add comparators and hash functions
> > for any new data types introduced by the external data set.  Which
> > interface does one have to implement for that ?
> >
> > 2) Query optimization : There is no cost-based optimizer yet within
> > Algebricks, therefore there is no API to support retrieval and use of
> table
> > statistics from an external data source.
> >
> > Is something planned in this regard ?
> >
> > 3) Data fetch and update : The VLDB'14 paper states that external data
> sets
> > are read-only, static and without indices, but the current codebase has
> > support for IExternalIndex and IIndexibleExternalDataSource, so
> presumably
> > I can fetch records from an external data source (base table scan as well
> > as index).
> >
> > Can I write to an external data source ?
> >
> > 4) Hyracks runtime : For data retrieval, is it sufficient to implement
> the
> > interfaces within asterix.external.api or does one also have to add some
> > Hyracks operators which are constructed via contributeRuntimeOperator ?
> >
> > -Sandeep
> >
>



-- 

*Regards,*
Wail Alkowaileet

Re: external data set support

Posted by Wail Alkowaileet <wa...@gmail.com>.
>From a user perspective:

I think those little features would be really helpful:

About writing to external sources .. Currently I'm using Spark to write the
results to other data sources or in Parquet file format or even converting
CSV to JSON (to avoid Asterix CSV parser limitations). Round-tripping would
be really useful.

There is also the reading from sources ... I extended a bit on the
FileBased adapter to read folder's files ... this seems to be much
friendlier in case of reading large amount of "hdfs"-like (part0000) files.
It's  a bit tedious to write every path of each file. (currently works for
localfs)

Also ... (on my free-time) I'm working in the case of ingesting one large
file from local/NC system ... It seems Asterix spawn threads depends on the
number of files are loaded. So in the case of having one large file ...
there will be only one thread to parse it. I'm not sure about the other way
around (i.e when we have several thousands of files). Would it create
several thousands of threads? it seems like it does, as my Eclipse debugger
crashed.

Within Hyracks: if we can still maintain the parallelism of group by,
distinct by, order by and limit, this would be really awesome! I always try
to avoid them for performance purposes.


On Mon, Feb 15, 2016 at 1:38 AM, Mike Carey <dt...@gmail.com> wrote:

> Sandeep,
>
> http://dl.acm.org/citation.cfm?id=2806428 is another useful paper to look
> at.
> (This one covers the external data support in more detail than the earlier
> papers.)
>
> We would absolutely love to do some of the things that Abdullah raised
> here, e.g.,
> pushing more selection/projection into the accesses to file formats (like
> Parquet)
> that support and would benefit from that.  What we have now makes it
> possible to
> treat external files as queryable data, but there's lots of room for
> improvement in
> terms of ultimate efficiency - and it would be cool to get others working
> on that.
> A bunch of that, as Abdullah says, doesn't require cost-based optimization
> - just
> optimizer rules to push the pushable criteria into the file access itself
> (as well as
> the runtime support to make that pushing possible).
>
> It would be cool to offer writing to external sources someday - but -
> there are a
> lot of questions that would have to be answered first.  (Producing results
> in various
> file formats would be a great first / non-transaction-requiring step.)
>
> Cheers,
> Mike
>
>
> On 2/13/16 11:44 PM, abdullah alamoudi wrote:
>
>> Hi Sandeep,
>> Here are the answers as per my understanding of the questions:
>>
>> 1) Schema catalog : One would have implement IMetadataProvider,
>> IDataSource, IDataSourceIndex and other related classes.  Is there any
>> functionality missing from the current schema implementation for external
>> data sets ?
>> Schema information for external data already exists and we use the
>> AqlMetadataProvider for both external and internal datasets.
>>
>> One of the papers says that one should add comparators and hash functions
>> for any new data types introduced by the external data set.  Which
>> interface does one have to implement for that ?
>> I am not sure which paper you're referring to but for adding new data
>> types
>> (regardless for use with internal or external. there is really no
>> distinction) here is what needs to be done:
>> 1. For complex types, one can simply define a type using the create type
>> statement.
>> 2. For completely new types, one needs to implement at least {IAType,
>> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
>> enough but that is a starting point.
>>
>> 2) Query optimization : There is no cost-based optimizer yet within
>> Algebricks, therefore there is no API to support retrieval and use of
>> table
>> statistics from an external data source.
>>
>> Is something planned in this regard ?
>> Cost based optimizer for internal datasets is being worked on (@Ildar
>> might
>> add here). As for external data, unfortunately right now, we don't even
>> employ some easy rule based optimizations. For example, we can utilize RC
>> files structure to push project into data source operator but we don't do
>> that yet. Another optimization that can be done is lazy deserialization of
>> records but again we don't do that. There are plans to do all of these but
>> we have man power shortage. You are welcome to give them a shot and we can
>> assist.
>>
>>
>> 3) Data fetch and update : The VLDB'14 paper states that external data
>> sets
>> are read-only, static and without indices, but the current codebase has
>> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
>> I can fetch records from an external data source (base table scan as well
>> as index).
>> Yes, we can access external data through indexes. probably by the time the
>> VLDB'14 paper was published, we didn't have this feature yet. You can
>> check
>> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
>> access and indexing.
>>
>> Can I write to an external data source ?
>> Right now, this is not supported because we can't provide the same
>> transactional guarantees we can with internal datasets. This point
>> probably
>> needs to be discussed with Mike before doing anything about it. I believe
>> we offer some other thing that can be utilized which is righting query
>> results into files but I am not sure.
>>
>>
>> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
>> interfaces within asterix.external.api or does one also have to add some
>> Hyracks operators which are constructed via contributeRuntimeOperator ?
>>
>> For data retrieval, one only needs to implement IExternalDataSourceFactory
>> along with IRecordReader<? extends T> or IInputStreamProvider (depending
>> on
>> whether the source produces a stream or a set of records).
>>
>> For data parsing, one only needs to implements IDataParserFactory along
>> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
>> parsed data source produces a stream or a set of records).
>>
>> Let me know if I can provide more information.
>> Cheers,
>> Abdullah.
>>
>> P.S,
>> Thanks for doing your work before asking. This is a great sign :)
>>
>> Amoudi, Abdullah.
>>
>> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sa...@gmail.com>
>> wrote:
>>
>> Can someone describe the level of support for External data sets and the
>>> future roadmap ?
>>>
>>> Let me divide the question into four broad issues:
>>>
>>> 1) Schema catalog : One would have implement IMetadataProvider,
>>> IDataSource, IDataSourceIndex and other related classes.  Is there any
>>> functionality missing from the current schema implementation for external
>>> data sets ?
>>>
>>> One of the papers says that one should add comparators and hash functions
>>> for any new data types introduced by the external data set.  Which
>>> interface does one have to implement for that ?
>>>
>>> 2) Query optimization : There is no cost-based optimizer yet within
>>> Algebricks, therefore there is no API to support retrieval and use of
>>> table
>>> statistics from an external data source.
>>>
>>> Is something planned in this regard ?
>>>
>>> 3) Data fetch and update : The VLDB'14 paper states that external data
>>> sets
>>> are read-only, static and without indices, but the current codebase has
>>> support for IExternalIndex and IIndexibleExternalDataSource, so
>>> presumably
>>> I can fetch records from an external data source (base table scan as well
>>> as index).
>>>
>>> Can I write to an external data source ?
>>>
>>> 4) Hyracks runtime : For data retrieval, is it sufficient to implement
>>> the
>>> interfaces within asterix.external.api or does one also have to add some
>>> Hyracks operators which are constructed via contributeRuntimeOperator ?
>>>
>>> -Sandeep
>>>
>>>
>


-- 

*Regards,*
Wail Alkowaileet

Re: external data set support

Posted by Mike Carey <dt...@gmail.com>.
Sandeep,

http://dl.acm.org/citation.cfm?id=2806428 is another useful paper to 
look at.
(This one covers the external data support in more detail than the 
earlier papers.)

We would absolutely love to do some of the things that Abdullah raised 
here, e.g.,
pushing more selection/projection into the accesses to file formats 
(like Parquet)
that support and would benefit from that.  What we have now makes it 
possible to
treat external files as queryable data, but there's lots of room for 
improvement in
terms of ultimate efficiency - and it would be cool to get others 
working on that.
A bunch of that, as Abdullah says, doesn't require cost-based 
optimization - just
optimizer rules to push the pushable criteria into the file access 
itself (as well as
the runtime support to make that pushing possible).

It would be cool to offer writing to external sources someday - but - 
there are a
lot of questions that would have to be answered first.  (Producing 
results in various
file formats would be a great first / non-transaction-requiring step.)

Cheers,
Mike

On 2/13/16 11:44 PM, abdullah alamoudi wrote:
> Hi Sandeep,
> Here are the answers as per my understanding of the questions:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
> Schema information for external data already exists and we use the
> AqlMetadataProvider for both external and internal datasets.
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
> I am not sure which paper you're referring to but for adding new data types
> (regardless for use with internal or external. there is really no
> distinction) here is what needs to be done:
> 1. For complex types, one can simply define a type using the create type
> statement.
> 2. For completely new types, one needs to implement at least {IAType,
> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
> enough but that is a starting point.
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
> Cost based optimizer for internal datasets is being worked on (@Ildar might
> add here). As for external data, unfortunately right now, we don't even
> employ some easy rule based optimizations. For example, we can utilize RC
> files structure to push project into data source operator but we don't do
> that yet. Another optimization that can be done is lazy deserialization of
> records but again we don't do that. There are plans to do all of these but
> we have man power shortage. You are welcome to give them a shot and we can
> assist.
>
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
> Yes, we can access external data through indexes. probably by the time the
> VLDB'14 paper was published, we didn't have this feature yet. You can check
> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
> access and indexing.
>
> Can I write to an external data source ?
> Right now, this is not supported because we can't provide the same
> transactional guarantees we can with internal datasets. This point probably
> needs to be discussed with Mike before doing anything about it. I believe
> we offer some other thing that can be utilized which is righting query
> results into files but I am not sure.
>
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> For data retrieval, one only needs to implement IExternalDataSourceFactory
> along with IRecordReader<? extends T> or IInputStreamProvider (depending on
> whether the source produces a stream or a set of records).
>
> For data parsing, one only needs to implements IDataParserFactory along
> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
> parsed data source produces a stream or a set of records).
>
> Let me know if I can provide more information.
> Cheers,
> Abdullah.
>
> P.S,
> Thanks for doing your work before asking. This is a great sign :)
>
> Amoudi, Abdullah.
>
> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sa...@gmail.com> wrote:
>
>> Can someone describe the level of support for External data sets and the
>> future roadmap ?
>>
>> Let me divide the question into four broad issues:
>>
>> 1) Schema catalog : One would have implement IMetadataProvider,
>> IDataSource, IDataSourceIndex and other related classes.  Is there any
>> functionality missing from the current schema implementation for external
>> data sets ?
>>
>> One of the papers says that one should add comparators and hash functions
>> for any new data types introduced by the external data set.  Which
>> interface does one have to implement for that ?
>>
>> 2) Query optimization : There is no cost-based optimizer yet within
>> Algebricks, therefore there is no API to support retrieval and use of table
>> statistics from an external data source.
>>
>> Is something planned in this regard ?
>>
>> 3) Data fetch and update : The VLDB'14 paper states that external data sets
>> are read-only, static and without indices, but the current codebase has
>> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
>> I can fetch records from an external data source (base table scan as well
>> as index).
>>
>> Can I write to an external data source ?
>>
>> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
>> interfaces within asterix.external.api or does one also have to add some
>> Hyracks operators which are constructed via contributeRuntimeOperator ?
>>
>> -Sandeep
>>


Re: external data set support

Posted by Sandeep Joshi <sa...@gmail.com>.
Comments in text..

On Sun, Feb 14, 2016 at 1:14 PM, abdullah alamoudi <ba...@gmail.com>
wrote:

> Hi Sandeep,
> Here are the answers as per my understanding of the questions:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
> Schema information for external data already exists and we use the
> AqlMetadataProvider for both external and internal datasets.
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
> I am not sure which paper you're referring to but for adding new data types
> (regardless for use with internal or external. there is really no
> distinction) here is what needs to be done:
> 1. For complex types, one can simply define a type using the create type
> statement.
> 2. For completely new types, one needs to implement at least {IAType,
> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
> enough but that is a starting point.
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
> Cost based optimizer for internal datasets is being worked on (@Ildar might
> add here). As for external data, unfortunately right now, we don't even
> employ some easy rule based optimizations. For example, we can utilize RC
> files structure to push project into data source operator but we don't do
> that yet. Another optimization that can be done is lazy deserialization of
> records but again we don't do that. There are plans to do all of these but
> we have man power shortage. You are welcome to give them a shot and we can
> assist.
>

I will get back on that...


>
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
> Yes, we can access external data through indexes. probably by the time the
> VLDB'14 paper was published, we didn't have this feature yet. You can check
> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
> access and indexing.
>
>
Could you please add this paper to the Publications page ?

https://asterixdb.ics.uci.edu/publications.html

I was going by that information when I asked questions



> Can I write to an external data source ?
> Right now, this is not supported because we can't provide the same
> transactional guarantees we can with internal datasets. This point probably
> needs to be discussed with Mike before doing anything about it. I believe
> we offer some other thing that can be utilized which is righting query
> results into files but I am not sure.
>
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> For data retrieval, one only needs to implement IExternalDataSourceFactory
> along with IRecordReader<? extends T> or IInputStreamProvider (depending on
> whether the source produces a stream or a set of records).
>
> For data parsing, one only needs to implements IDataParserFactory along
> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
> parsed data source produces a stream or a set of records).
>
> Let me know if I can provide more information.
> Cheers,
> Abdullah.
>
> P.S,
> Thanks for doing your work before asking. This is a great sign :)
>
> Amoudi, Abdullah.
>
> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sa...@gmail.com>
> wrote:
>
> > Can someone describe the level of support for External data sets and the
> > future roadmap ?
> >
> > Let me divide the question into four broad issues:
> >
> > 1) Schema catalog : One would have implement IMetadataProvider,
> > IDataSource, IDataSourceIndex and other related classes.  Is there any
> > functionality missing from the current schema implementation for external
> > data sets ?
> >
> > One of the papers says that one should add comparators and hash functions
> > for any new data types introduced by the external data set.  Which
> > interface does one have to implement for that ?
> >
> > 2) Query optimization : There is no cost-based optimizer yet within
> > Algebricks, therefore there is no API to support retrieval and use of
> table
> > statistics from an external data source.
> >
> > Is something planned in this regard ?
> >
> > 3) Data fetch and update : The VLDB'14 paper states that external data
> sets
> > are read-only, static and without indices, but the current codebase has
> > support for IExternalIndex and IIndexibleExternalDataSource, so
> presumably
> > I can fetch records from an external data source (base table scan as well
> > as index).
> >
> > Can I write to an external data source ?
> >
> > 4) Hyracks runtime : For data retrieval, is it sufficient to implement
> the
> > interfaces within asterix.external.api or does one also have to add some
> > Hyracks operators which are constructed via contributeRuntimeOperator ?
> >
> > -Sandeep
> >
>

Re: external data set support

Posted by abdullah alamoudi <ba...@gmail.com>.
Hi Sandeep,
Here are the answers as per my understanding of the questions:

1) Schema catalog : One would have implement IMetadataProvider,
IDataSource, IDataSourceIndex and other related classes.  Is there any
functionality missing from the current schema implementation for external
data sets ?
Schema information for external data already exists and we use the
AqlMetadataProvider for both external and internal datasets.

One of the papers says that one should add comparators and hash functions
for any new data types introduced by the external data set.  Which
interface does one have to implement for that ?
I am not sure which paper you're referring to but for adding new data types
(regardless for use with internal or external. there is really no
distinction) here is what needs to be done:
1. For complex types, one can simply define a type using the create type
statement.
2. For completely new types, one needs to implement at least {IAType,
IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
enough but that is a starting point.

2) Query optimization : There is no cost-based optimizer yet within
Algebricks, therefore there is no API to support retrieval and use of table
statistics from an external data source.

Is something planned in this regard ?
Cost based optimizer for internal datasets is being worked on (@Ildar might
add here). As for external data, unfortunately right now, we don't even
employ some easy rule based optimizations. For example, we can utilize RC
files structure to push project into data source operator but we don't do
that yet. Another optimization that can be done is lazy deserialization of
records but again we don't do that. There are plans to do all of these but
we have man power shortage. You are welcome to give them a shot and we can
assist.


3) Data fetch and update : The VLDB'14 paper states that external data sets
are read-only, static and without indices, but the current codebase has
support for IExternalIndex and IIndexibleExternalDataSource, so presumably
I can fetch records from an external data source (base table scan as well
as index).
Yes, we can access external data through indexes. probably by the time the
VLDB'14 paper was published, we didn't have this feature yet. You can check
http://dl.acm.org/citation.cfm?id=2806428 which is about external data
access and indexing.

Can I write to an external data source ?
Right now, this is not supported because we can't provide the same
transactional guarantees we can with internal datasets. This point probably
needs to be discussed with Mike before doing anything about it. I believe
we offer some other thing that can be utilized which is righting query
results into files but I am not sure.


4) Hyracks runtime : For data retrieval, is it sufficient to implement the
interfaces within asterix.external.api or does one also have to add some
Hyracks operators which are constructed via contributeRuntimeOperator ?

For data retrieval, one only needs to implement IExternalDataSourceFactory
along with IRecordReader<? extends T> or IInputStreamProvider (depending on
whether the source produces a stream or a set of records).

For data parsing, one only needs to implements IDataParserFactory along
with IRecordDataParser<T> or IStreamDataParser (depending on whether the
parsed data source produces a stream or a set of records).

Let me know if I can provide more information.
Cheers,
Abdullah.

P.S,
Thanks for doing your work before asking. This is a great sign :)

Amoudi, Abdullah.

On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sa...@gmail.com> wrote:

> Can someone describe the level of support for External data sets and the
> future roadmap ?
>
> Let me divide the question into four broad issues:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
>
> Can I write to an external data source ?
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> -Sandeep
>