You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Xikui Wang <xi...@uci.edu> on 2016/03/04 20:47:30 UTC

Do we have a method to append local files to existed dataset?

Hi,

I want to import data from multiple adm files into a same dataset. Merging
them together and then loading from localfs can be a viable solution, but
this may become a problem when the number become too large. I am wondering
is there a way to append adm file to existed dataset?

Thank you.

Best,
Xikui

Re: Do we have a method to append local files to existed dataset?

Posted by Mike Carey <dt...@gmail.com>.

+1

On 3/4/16 2:37 PM, Yingyi Bu wrote:
> Sounds good:-)
> I'm thinking that we can simply use the same HDFS adapter for localfs
> files.  The HDFS API always works for local files. (The only thing needs to
> be done is to change to another URL prefix for local files.) In this way,
> we don't need to worry about how to split a super-large local file:-)
>
> Best,
> Yingyi
>
> On Fri, Mar 4, 2016 at 2:31 PM, Mike Carey <dt...@gmail.com> wrote:
>
>> It would be nice to have the parallelism of loading be
>> dataset-property-determined rather than number-of-input-files determined
>> (e.g., min(number of partitions, number of input files)) and then have the
>> leaves of the load job each handle a delegated list of files.  How hard
>> would that be?  :-)
>>
>>
>> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>>
>>> That makes sense.
>>>
>>> Cheers,
>>> Young-Seok
>>>
>>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>
>>> Young-Seok,
>>>> That works when the number of local files is relatively small.
>>>> However, when the number of localfs files is 1000,  the 1000 files will
>>>> be
>>>> loaded in parallel simultaneously, which will exhaust all system
>>>> resources.
>>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>>> file
>>>> splits will be queued into each parallel loader.
>>>>
>>>> Best,
>>>> Yingyi
>>>>
>>>>
>>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
>>>> wrote:
>>>>
>>>> You can also load multiple adm files into a same dataset with a single
>>>> AQL
>>>>
>>>>> as follows:
>>>>>
>>>>> load dataset Tweets
>>>>>
>>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>>
>>>>> (("path"=
>>>>>
>>>>> "130.149.249.60
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>>
>>>>> 130.149.249.53
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>>
>>>>> 130.149.249.54
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>>
>>>>> 130.149.249.55
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>>
>>>>> 130.149.249.56
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>>
>>>>> 130.149.249.57
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>>
>>>>> 130.149.249.58
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>>
>>>>> 130.149.249.59
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>>
>>>>> ("format"="adm"));
>>>>>
>>>>>
>>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Young-Seok
>>>>>
>>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>
>>>>> Hi Yingyi,
>>>>>> Thanks for your reply. I think the external dataset with scan query is
>>>>>>
>>>>> a
>>>>> good solution.
>>>>>> I will try that. Thank you.
>>>>>>
>>>>>> Best,
>>>>>> Xikui
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>>>
>>>>>> Xikui,
>>>>>>> If the number of localfs files is too large,  a solution could be to
>>>>>>>
>>>>>> put
>>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>>>> fixed
>>>>>>
>>>>>>> degree of parallelism regardless of the number of files.
>>>>>>>
>>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>> dataset?
>>>>> You can create an external dataset and then write an insert statement
>>>>>> where
>>>>>>
>>>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>>>>>>>
>>>>>> own
>>>>> storage for an external dataset but just keeps file paths.
>>>>>>> Here is a manual for external datasets:
>>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>>
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>>>>>
>>>>>>> Merging
>>>>>>>
>>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>>
>>>>>>> solution,
>>>>> but
>>>>>>> this may become a problem when the number become too large. I am
>>>>>>> wondering
>>>>>>>
>>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Xikui
>>>>>>>>
>>>>>>>>

Re: Do we have a method to append local files to existed dataset?

Posted by Yingyi Bu <bu...@gmail.com>.

Sounds good:-)
I'm thinking that we can simply use the same HDFS adapter for localfs
files.  The HDFS API always works for local files. (The only thing needs to
be done is to change to another URL prefix for local files.) In this way,
we don't need to worry about how to split a super-large local file:-)

Best,
Yingyi

On Fri, Mar 4, 2016 at 2:31 PM, Mike Carey <dt...@gmail.com> wrote:

> It would be nice to have the parallelism of loading be
> dataset-property-determined rather than number-of-input-files determined
> (e.g., min(number of partitions, number of input files)) and then have the
> leaves of the load job each handle a delegated list of files.  How hard
> would that be?  :-)
>
>
> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>
>> That makes sense.
>>
>> Cheers,
>> Young-Seok
>>
>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>
>> Young-Seok,
>>>
>>> That works when the number of local files is relatively small.
>>> However, when the number of localfs files is 1000,  the 1000 files will
>>> be
>>> loaded in parallel simultaneously, which will exhaust all system
>>> resources.
>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>> file
>>> splits will be queued into each parallel loader.
>>>
>>> Best,
>>> Yingyi
>>>
>>>
>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
>>> wrote:
>>>
>>> You can also load multiple adm files into a same dataset with a single
>>>>
>>> AQL
>>>
>>>> as follows:
>>>>
>>>> load dataset Tweets
>>>>
>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>
>>>> (("path"=
>>>>
>>>> "130.149.249.60
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>
>>>> 130.149.249.53
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>
>>>> 130.149.249.54
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>
>>>> 130.149.249.55
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>
>>>> 130.149.249.56
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>
>>>> 130.149.249.57
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>
>>>> 130.149.249.58
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>
>>>> 130.149.249.59
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>
>>>> ("format"="adm"));
>>>>
>>>>
>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Young-Seok
>>>>
>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>>>>
>>>> Hi Yingyi,
>>>>>
>>>>> Thanks for your reply. I think the external dataset with scan query is
>>>>>
>>>> a
>>>
>>>> good solution.
>>>>> I will try that. Thank you.
>>>>>
>>>>> Best,
>>>>> Xikui
>>>>>
>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>>
>>>>> Xikui,
>>>>>>
>>>>>> If the number of localfs files is too large,  a solution could be to
>>>>>>
>>>>> put
>>>>
>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>>>>
>>>>> fixed
>>>>>
>>>>>> degree of parallelism regardless of the number of files.
>>>>>>
>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>>
>>>>>>> dataset?
>>>
>>>> You can create an external dataset and then write an insert statement
>>>>>>
>>>>> where
>>>>>
>>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>>>>>>
>>>>> own
>>>
>>>> storage for an external dataset but just keeps file paths.
>>>>>> Here is a manual for external datasets:
>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>
>>>>>> Best,
>>>>>> Yingyi
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>>>>
>>>>>> Merging
>>>>>>
>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>
>>>>>> solution,
>>>
>>>> but
>>>>>
>>>>>> this may become a problem when the number become too large. I am
>>>>>>>
>>>>>> wondering
>>>>>>
>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xikui
>>>>>>>
>>>>>>>
>

Re: Do we have a method to append local files to existed dataset?

Posted by Mike Carey <dt...@gmail.com>.

:-)  Thx!!

On 3/5/16 2:12 AM, abdullah alamoudi wrote:
> Not hard at all. (about 5 minutes of work).
>
> Will create a change for it.
>
> On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <dt...@gmail.com> wrote:
>
>> It would be nice to have the parallelism of loading be
>> dataset-property-determined rather than number-of-input-files determined
>> (e.g., min(number of partitions, number of input files)) and then have the
>> leaves of the load job each handle a delegated list of files.  How hard
>> would that be?  :-)
>>
>> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>>
>>> That makes sense.
>>>
>>> Cheers,
>>> Young-Seok
>>>
>>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>>
>>> Young-Seok,
>>>> That works when the number of local files is relatively small.
>>>> However, when the number of localfs files is 1000,  the 1000 files will
>>>> be
>>>> loaded in parallel simultaneously, which will exhaust all system
>>>> resources.
>>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>>> file
>>>> splits will be queued into each parallel loader.
>>>>
>>>> Best,
>>>> Yingyi
>>>>
>>>>
>>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
>>>> wrote:
>>>>
>>>> You can also load multiple adm files into a same dataset with a single
>>>> AQL
>>>>
>>>>> as follows:
>>>>>
>>>>> load dataset Tweets
>>>>>
>>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>>
>>>>> (("path"=
>>>>>
>>>>> "130.149.249.60
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>>
>>>>> 130.149.249.53
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>>
>>>>> 130.149.249.54
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>>
>>>>> 130.149.249.55
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>>
>>>>> 130.149.249.56
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>>
>>>>> 130.149.249.57
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>>
>>>>> 130.149.249.58
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>>
>>>>> 130.149.249.59
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>>
>>>>> ("format"="adm"));
>>>>>
>>>>>
>>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Young-Seok
>>>>>
>>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>
>>>>> Hi Yingyi,
>>>>>> Thanks for your reply. I think the external dataset with scan query is
>>>>>>
>>>>> a
>>>>> good solution.
>>>>>> I will try that. Thank you.
>>>>>>
>>>>>> Best,
>>>>>> Xikui
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>>>
>>>>>> Xikui,
>>>>>>> If the number of localfs files is too large,  a solution could be to
>>>>>>>
>>>>>> put
>>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>>>> fixed
>>>>>>
>>>>>>> degree of parallelism regardless of the number of files.
>>>>>>>
>>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>> dataset?
>>>>> You can create an external dataset and then write an insert statement
>>>>>> where
>>>>>>
>>>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>>>>>>>
>>>>>> own
>>>>> storage for an external dataset but just keeps file paths.
>>>>>>> Here is a manual for external datasets:
>>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>>
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>>>>>
>>>>>>> Merging
>>>>>>>
>>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>>
>>>>>>> solution,
>>>>> but
>>>>>>> this may become a problem when the number become too large. I am
>>>>>>> wondering
>>>>>>>
>>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Xikui
>>>>>>>>
>>>>>>>>

Re: Do we have a method to append local files to existed dataset?

Posted by abdullah alamoudi <ba...@gmail.com>.

Not hard at all. (about 5 minutes of work).

Will create a change for it.

On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <dt...@gmail.com> wrote:

> It would be nice to have the parallelism of loading be
> dataset-property-determined rather than number-of-input-files determined
> (e.g., min(number of partitions, number of input files)) and then have the
> leaves of the load job each handle a delegated list of files.  How hard
> would that be?  :-)
>
> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>
>> That makes sense.
>>
>> Cheers,
>> Young-Seok
>>
>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>>
>> Young-Seok,
>>>
>>> That works when the number of local files is relatively small.
>>> However, when the number of localfs files is 1000,  the 1000 files will
>>> be
>>> loaded in parallel simultaneously, which will exhaust all system
>>> resources.
>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>> file
>>> splits will be queued into each parallel loader.
>>>
>>> Best,
>>> Yingyi
>>>
>>>
>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
>>> wrote:
>>>
>>> You can also load multiple adm files into a same dataset with a single
>>>>
>>> AQL
>>>
>>>> as follows:
>>>>
>>>> load dataset Tweets
>>>>
>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>
>>>> (("path"=
>>>>
>>>> "130.149.249.60
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>
>>>> 130.149.249.53
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>
>>>> 130.149.249.54
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>
>>>> 130.149.249.55
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>
>>>> 130.149.249.56
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>
>>>> 130.149.249.57
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>
>>>> 130.149.249.58
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>
>>>> 130.149.249.59
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>
>>>> ("format"="adm"));
>>>>
>>>>
>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Young-Seok
>>>>
>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>>>>
>>>> Hi Yingyi,
>>>>>
>>>>> Thanks for your reply. I think the external dataset with scan query is
>>>>>
>>>> a
>>>
>>>> good solution.
>>>>> I will try that. Thank you.
>>>>>
>>>>> Best,
>>>>> Xikui
>>>>>
>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>>
>>>>> Xikui,
>>>>>>
>>>>>> If the number of localfs files is too large,  a solution could be to
>>>>>>
>>>>> put
>>>>
>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>>>>
>>>>> fixed
>>>>>
>>>>>> degree of parallelism regardless of the number of files.
>>>>>>
>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>>
>>>>>>> dataset?
>>>
>>>> You can create an external dataset and then write an insert statement
>>>>>>
>>>>> where
>>>>>
>>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>>>>>>
>>>>> own
>>>
>>>> storage for an external dataset but just keeps file paths.
>>>>>> Here is a manual for external datasets:
>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>
>>>>>> Best,
>>>>>> Yingyi
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>>>>
>>>>>> Merging
>>>>>>
>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>
>>>>>> solution,
>>>
>>>> but
>>>>>
>>>>>> this may become a problem when the number become too large. I am
>>>>>>>
>>>>>> wondering
>>>>>>
>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xikui
>>>>>>>
>>>>>>>
>

Re: Do we have a method to append local files to existed dataset?

Posted by Mike Carey <dt...@gmail.com>.

It would be nice to have the parallelism of loading be 
dataset-property-determined rather than number-of-input-files determined 
(e.g., min(number of partitions, number of input files)) and then have 
the leaves of the load job each handle a delegated list of files.  How 
hard would that be?  :-)

On 3/4/16 2:04 PM, Young-Seok Kim wrote:
> That makes sense.
>
> Cheers,
> Young-Seok
>
> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>
>> Young-Seok,
>>
>> That works when the number of local files is relatively small.
>> However, when the number of localfs files is 1000,  the 1000 files will be
>> loaded in parallel simultaneously, which will exhaust all system resources.
>> Loading from HDFS doesn't have the problem because the 1000 (or more) file
>> splits will be queued into each parallel loader.
>>
>> Best,
>> Yingyi
>>
>>
>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com> wrote:
>>
>>> You can also load multiple adm files into a same dataset with a single
>> AQL
>>> as follows:
>>>
>>> load dataset Tweets
>>>
>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>
>>> (("path"=
>>>
>>> "130.149.249.60
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>> 130.149.249.53
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>> 130.149.249.54
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>> 130.149.249.55
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>> 130.149.249.56
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>> 130.149.249.57
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>> 130.149.249.58
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>> 130.149.249.59
>>>
>>>
>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>> ("format"="adm"));
>>>
>>>
>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>
>>>
>>> Cheers,
>>>
>>> Young-Seok
>>>
>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>>>
>>>> Hi Yingyi,
>>>>
>>>> Thanks for your reply. I think the external dataset with scan query is
>> a
>>>> good solution.
>>>> I will try that. Thank you.
>>>>
>>>> Best,
>>>> Xikui
>>>>
>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>>>>
>>>>> Xikui,
>>>>>
>>>>> If the number of localfs files is too large,  a solution could be to
>>> put
>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>> fixed
>>>>> degree of parallelism regardless of the number of files.
>>>>>
>>>>>>> I am wondering is there a way to append adm file to existed
>> dataset?
>>>>> You can create an external dataset and then write an insert statement
>>>> where
>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>> own
>>>>> storage for an external dataset but just keeps file paths.
>>>>> Here is a manual for external datasets:
>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>
>>>>> Best,
>>>>> Yingyi
>>>>>
>>>>>
>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>> Merging
>>>>>> them together and then loading from localfs can be a viable
>> solution,
>>>> but
>>>>>> this may become a problem when the number become too large. I am
>>>>> wondering
>>>>>> is there a way to append adm file to existed dataset?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Best,
>>>>>> Xikui
>>>>>>

Re: Do we have a method to append local files to existed dataset?

Posted by abdullah alamoudi <ba...@gmail.com>.

You shouldn't get that error even if you're using the localfs. I will
double check that.

On Sat, Mar 5, 2016 at 2:41 AM, Xikui Wang <xi...@uci.edu> wrote:

> Hi,
>
> @Young-Seok, Thanks for noticing. This is quite convenient for loading
> small batch files.
>
> @Yingyi, Thanks for pointing out the limitations. I tried with my datasets
> (700 x 50MB per file),
> and it drained all system resources as you expected. Actually the mechanism
> that you mentioned
> HDFS like localfs is what I am looking for. That would be useful for
> standalone users. Or maybe we just
> don't care standalone users since they are too small. :)
>
> @abdullah, I tried directory path, but it doesn't go through. It raises '
> xxx is a directory error'. I guess it's
> because I am using localfs?
>
> Best,
> Xikui
>
> On Fri, Mar 4, 2016 at 2:28 PM, abdullah alamoudi <ba...@gmail.com>
> wrote:
>
> > You can however specify the directory in the path parameter and not the
> > individual files and they will be processed sequentially (or 1 thread per
> > specified path).
> >
> > On Sat, Mar 5, 2016 at 1:04 AM, Young-Seok Kim <ki...@gmail.com>
> wrote:
> >
> > > That makes sense.
> > >
> > > Cheers,
> > > Young-Seok
> > >
> > > On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
> > >
> > > > Young-Seok,
> > > >
> > > > That works when the number of local files is relatively small.
> > > > However, when the number of localfs files is 1000,  the 1000 files
> will
> > > be
> > > > loaded in parallel simultaneously, which will exhaust all system
> > > resources.
> > > > Loading from HDFS doesn't have the problem because the 1000 (or more)
> > > file
> > > > splits will be queued into each parallel loader.
> > > >
> > > > Best,
> > > > Yingyi
> > > >
> > > >
> > > > On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
> > > wrote:
> > > >
> > > > > You can also load multiple adm files into a same dataset with a
> > single
> > > > AQL
> > > > > as follows:
> > > > >
> > > > > load dataset Tweets
> > > > >
> > > > > using
> > "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
> > > > >
> > > > > (("path"=
> > > > >
> > > > > "130.149.249.60
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
> > > > >
> > > > > 130.149.249.53
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
> > > > >
> > > > > 130.149.249.54
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
> > > > >
> > > > > 130.149.249.55
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
> > > > >
> > > > > 130.149.249.56
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
> > > > >
> > > > > 130.149.249.57
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
> > > > >
> > > > > 130.149.249.58
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
> > > > >
> > > > > 130.149.249.59
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
> > > > >
> > > > > ("format"="adm"));
> > > > >
> > > > >
> > > > > The above AQL loads 8 adm files into a single dataset named Tweets.
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Young-Seok
> > > > >
> > > > > On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu>
> wrote:
> > > > >
> > > > > > Hi Yingyi,
> > > > > >
> > > > > > Thanks for your reply. I think the external dataset with scan
> query
> > > is
> > > > a
> > > > > > good solution.
> > > > > > I will try that. Thank you.
> > > > > >
> > > > > > Best,
> > > > > > Xikui
> > > > > >
> > > > > > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Xikui,
> > > > > > >
> > > > > > > If the number of localfs files is too large,  a solution could
> be
> > > to
> > > > > put
> > > > > > > your files on HDFS and then load it.  Loading from HDFS always
> > has
> > > a
> > > > > > fixed
> > > > > > > degree of parallelism regardless of the number of files.
> > > > > > >
> > > > > > > >> I am wondering is there a way to append adm file to existed
> > > > dataset?
> > > > > > > You can create an external dataset and then write an insert
> > > statement
> > > > > > where
> > > > > > > the body is a scan query. AsterixDB doesn't load any data into
> > its
> > > > own
> > > > > > > storage for an external dataset but just keeps file paths.
> > > > > > > Here is a manual for external datasets:
> > > > > > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > > > > > >
> > > > > > > Best,
> > > > > > > Yingyi
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu>
> > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I want to import data from multiple adm files into a same
> > > dataset.
> > > > > > > Merging
> > > > > > > > them together and then loading from localfs can be a viable
> > > > solution,
> > > > > > but
> > > > > > > > this may become a problem when the number become too large. I
> > am
> > > > > > > wondering
> > > > > > > > is there a way to append adm file to existed dataset?
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xikui
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Xikui Wang <xi...@uci.edu>.

Hi,

@Young-Seok, Thanks for noticing. This is quite convenient for loading
small batch files.

@Yingyi, Thanks for pointing out the limitations. I tried with my datasets
(700 x 50MB per file),
and it drained all system resources as you expected. Actually the mechanism
that you mentioned
HDFS like localfs is what I am looking for. That would be useful for
standalone users. Or maybe we just
don't care standalone users since they are too small. :)

@abdullah, I tried directory path, but it doesn't go through. It raises '
xxx is a directory error'. I guess it's
because I am using localfs?

Best,
Xikui

On Fri, Mar 4, 2016 at 2:28 PM, abdullah alamoudi <ba...@gmail.com>
wrote:

> You can however specify the directory in the path parameter and not the
> individual files and they will be processed sequentially (or 1 thread per
> specified path).
>
> On Sat, Mar 5, 2016 at 1:04 AM, Young-Seok Kim <ki...@gmail.com> wrote:
>
> > That makes sense.
> >
> > Cheers,
> > Young-Seok
> >
> > On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >
> > > Young-Seok,
> > >
> > > That works when the number of local files is relatively small.
> > > However, when the number of localfs files is 1000,  the 1000 files will
> > be
> > > loaded in parallel simultaneously, which will exhaust all system
> > resources.
> > > Loading from HDFS doesn't have the problem because the 1000 (or more)
> > file
> > > splits will be queued into each parallel loader.
> > >
> > > Best,
> > > Yingyi
> > >
> > >
> > > On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
> > wrote:
> > >
> > > > You can also load multiple adm files into a same dataset with a
> single
> > > AQL
> > > > as follows:
> > > >
> > > > load dataset Tweets
> > > >
> > > > using
> "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
> > > >
> > > > (("path"=
> > > >
> > > > "130.149.249.60
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
> > > >
> > > > 130.149.249.53
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
> > > >
> > > > 130.149.249.54
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
> > > >
> > > > 130.149.249.55
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
> > > >
> > > > 130.149.249.56
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
> > > >
> > > > 130.149.249.57
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
> > > >
> > > > 130.149.249.58
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
> > > >
> > > > 130.149.249.59
> > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
> > > >
> > > > ("format"="adm"));
> > > >
> > > >
> > > > The above AQL loads 8 adm files into a single dataset named Tweets.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Young-Seok
> > > >
> > > > On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
> > > >
> > > > > Hi Yingyi,
> > > > >
> > > > > Thanks for your reply. I think the external dataset with scan query
> > is
> > > a
> > > > > good solution.
> > > > > I will try that. Thank you.
> > > > >
> > > > > Best,
> > > > > Xikui
> > > > >
> > > > > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Xikui,
> > > > > >
> > > > > > If the number of localfs files is too large,  a solution could be
> > to
> > > > put
> > > > > > your files on HDFS and then load it.  Loading from HDFS always
> has
> > a
> > > > > fixed
> > > > > > degree of parallelism regardless of the number of files.
> > > > > >
> > > > > > >> I am wondering is there a way to append adm file to existed
> > > dataset?
> > > > > > You can create an external dataset and then write an insert
> > statement
> > > > > where
> > > > > > the body is a scan query. AsterixDB doesn't load any data into
> its
> > > own
> > > > > > storage for an external dataset but just keeps file paths.
> > > > > > Here is a manual for external datasets:
> > > > > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > > > > >
> > > > > > Best,
> > > > > > Yingyi
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu>
> > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I want to import data from multiple adm files into a same
> > dataset.
> > > > > > Merging
> > > > > > > them together and then loading from localfs can be a viable
> > > solution,
> > > > > but
> > > > > > > this may become a problem when the number become too large. I
> am
> > > > > > wondering
> > > > > > > is there a way to append adm file to existed dataset?
> > > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > > Best,
> > > > > > > Xikui
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by abdullah alamoudi <ba...@gmail.com>.

You can however specify the directory in the path parameter and not the
individual files and they will be processed sequentially (or 1 thread per
specified path).

On Sat, Mar 5, 2016 at 1:04 AM, Young-Seok Kim <ki...@gmail.com> wrote:

> That makes sense.
>
> Cheers,
> Young-Seok
>
> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:
>
> > Young-Seok,
> >
> > That works when the number of local files is relatively small.
> > However, when the number of localfs files is 1000,  the 1000 files will
> be
> > loaded in parallel simultaneously, which will exhaust all system
> resources.
> > Loading from HDFS doesn't have the problem because the 1000 (or more)
> file
> > splits will be queued into each parallel loader.
> >
> > Best,
> > Yingyi
> >
> >
> > On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com>
> wrote:
> >
> > > You can also load multiple adm files into a same dataset with a single
> > AQL
> > > as follows:
> > >
> > > load dataset Tweets
> > >
> > > using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
> > >
> > > (("path"=
> > >
> > > "130.149.249.60
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
> > >
> > > 130.149.249.53
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
> > >
> > > 130.149.249.54
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
> > >
> > > 130.149.249.55
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
> > >
> > > 130.149.249.56
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
> > >
> > > 130.149.249.57
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
> > >
> > > 130.149.249.58
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
> > >
> > > 130.149.249.59
> > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
> > >
> > > ("format"="adm"));
> > >
> > >
> > > The above AQL loads 8 adm files into a single dataset named Tweets.
> > >
> > >
> > > Cheers,
> > >
> > > Young-Seok
> > >
> > > On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
> > >
> > > > Hi Yingyi,
> > > >
> > > > Thanks for your reply. I think the external dataset with scan query
> is
> > a
> > > > good solution.
> > > > I will try that. Thank you.
> > > >
> > > > Best,
> > > > Xikui
> > > >
> > > > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com>
> wrote:
> > > >
> > > > > Xikui,
> > > > >
> > > > > If the number of localfs files is too large,  a solution could be
> to
> > > put
> > > > > your files on HDFS and then load it.  Loading from HDFS always has
> a
> > > > fixed
> > > > > degree of parallelism regardless of the number of files.
> > > > >
> > > > > >> I am wondering is there a way to append adm file to existed
> > dataset?
> > > > > You can create an external dataset and then write an insert
> statement
> > > > where
> > > > > the body is a scan query. AsterixDB doesn't load any data into its
> > own
> > > > > storage for an external dataset but just keeps file paths.
> > > > > Here is a manual for external datasets:
> > > > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > > > >
> > > > > Best,
> > > > > Yingyi
> > > > >
> > > > >
> > > > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu>
> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I want to import data from multiple adm files into a same
> dataset.
> > > > > Merging
> > > > > > them together and then loading from localfs can be a viable
> > solution,
> > > > but
> > > > > > this may become a problem when the number become too large. I am
> > > > > wondering
> > > > > > is there a way to append adm file to existed dataset?
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Best,
> > > > > > Xikui
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Young-Seok Kim <ki...@gmail.com>.

That makes sense.

Cheers,
Young-Seok

On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <bu...@gmail.com> wrote:

> Young-Seok,
>
> That works when the number of local files is relatively small.
> However, when the number of localfs files is 1000,  the 1000 files will be
> loaded in parallel simultaneously, which will exhaust all system resources.
> Loading from HDFS doesn't have the problem because the 1000 (or more) file
> splits will be queued into each parallel loader.
>
> Best,
> Yingyi
>
>
> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com> wrote:
>
> > You can also load multiple adm files into a same dataset with a single
> AQL
> > as follows:
> >
> > load dataset Tweets
> >
> > using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
> >
> > (("path"=
> >
> > "130.149.249.60
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
> >
> > 130.149.249.53
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
> >
> > 130.149.249.54
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
> >
> > 130.149.249.55
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
> >
> > 130.149.249.56
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
> >
> > 130.149.249.57
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
> >
> > 130.149.249.58
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
> >
> > 130.149.249.59
> >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
> >
> > ("format"="adm"));
> >
> >
> > The above AQL loads 8 adm files into a single dataset named Tweets.
> >
> >
> > Cheers,
> >
> > Young-Seok
> >
> > On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
> >
> > > Hi Yingyi,
> > >
> > > Thanks for your reply. I think the external dataset with scan query is
> a
> > > good solution.
> > > I will try that. Thank you.
> > >
> > > Best,
> > > Xikui
> > >
> > > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
> > >
> > > > Xikui,
> > > >
> > > > If the number of localfs files is too large,  a solution could be to
> > put
> > > > your files on HDFS and then load it.  Loading from HDFS always has a
> > > fixed
> > > > degree of parallelism regardless of the number of files.
> > > >
> > > > >> I am wondering is there a way to append adm file to existed
> dataset?
> > > > You can create an external dataset and then write an insert statement
> > > where
> > > > the body is a scan query. AsterixDB doesn't load any data into its
> own
> > > > storage for an external dataset but just keeps file paths.
> > > > Here is a manual for external datasets:
> > > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > > >
> > > > Best,
> > > > Yingyi
> > > >
> > > >
> > > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I want to import data from multiple adm files into a same dataset.
> > > > Merging
> > > > > them together and then loading from localfs can be a viable
> solution,
> > > but
> > > > > this may become a problem when the number become too large. I am
> > > > wondering
> > > > > is there a way to append adm file to existed dataset?
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Best,
> > > > > Xikui
> > > > >
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Yingyi Bu <bu...@gmail.com>.

Young-Seok,

That works when the number of local files is relatively small.
However, when the number of localfs files is 1000,  the 1000 files will be
loaded in parallel simultaneously, which will exhaust all system resources.
Loading from HDFS doesn't have the problem because the 1000 (or more) file
splits will be queued into each parallel loader.

Best,
Yingyi


On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <ki...@gmail.com> wrote:

> You can also load multiple adm files into a same dataset with a single AQL
> as follows:
>
> load dataset Tweets
>
> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>
> (("path"=
>
> "130.149.249.60
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>
> 130.149.249.53
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>
> 130.149.249.54
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>
> 130.149.249.55
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>
> 130.149.249.56
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>
> 130.149.249.57
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>
> 130.149.249.58
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>
> 130.149.249.59
>
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>
> ("format"="adm"));
>
>
> The above AQL loads 8 adm files into a single dataset named Tweets.
>
>
> Cheers,
>
> Young-Seok
>
> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:
>
> > Hi Yingyi,
> >
> > Thanks for your reply. I think the external dataset with scan query is a
> > good solution.
> > I will try that. Thank you.
> >
> > Best,
> > Xikui
> >
> > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
> >
> > > Xikui,
> > >
> > > If the number of localfs files is too large,  a solution could be to
> put
> > > your files on HDFS and then load it.  Loading from HDFS always has a
> > fixed
> > > degree of parallelism regardless of the number of files.
> > >
> > > >> I am wondering is there a way to append adm file to existed dataset?
> > > You can create an external dataset and then write an insert statement
> > where
> > > the body is a scan query. AsterixDB doesn't load any data into its own
> > > storage for an external dataset but just keeps file paths.
> > > Here is a manual for external datasets:
> > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > >
> > > Best,
> > > Yingyi
> > >
> > >
> > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
> > >
> > > > Hi,
> > > >
> > > > I want to import data from multiple adm files into a same dataset.
> > > Merging
> > > > them together and then loading from localfs can be a viable solution,
> > but
> > > > this may become a problem when the number become too large. I am
> > > wondering
> > > > is there a way to append adm file to existed dataset?
> > > >
> > > > Thank you.
> > > >
> > > > Best,
> > > > Xikui
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Young-Seok Kim <ki...@gmail.com>.

You can also load multiple adm files into a same dataset with a single AQL
as follows:

load dataset Tweets

using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"

(("path"=

"130.149.249.60
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,

130.149.249.53
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,

130.149.249.54
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,

130.149.249.55
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,

130.149.249.56
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,

130.149.249.57
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,

130.149.249.58
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,

130.149.249.59
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),

("format"="adm"));


The above AQL loads 8 adm files into a single dataset named Tweets.


Cheers,

Young-Seok

On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xi...@uci.edu> wrote:

> Hi Yingyi,
>
> Thanks for your reply. I think the external dataset with scan query is a
> good solution.
> I will try that. Thank you.
>
> Best,
> Xikui
>
> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:
>
> > Xikui,
> >
> > If the number of localfs files is too large,  a solution could be to put
> > your files on HDFS and then load it.  Loading from HDFS always has a
> fixed
> > degree of parallelism regardless of the number of files.
> >
> > >> I am wondering is there a way to append adm file to existed dataset?
> > You can create an external dataset and then write an insert statement
> where
> > the body is a scan query. AsterixDB doesn't load any data into its own
> > storage for an external dataset but just keeps file paths.
> > Here is a manual for external datasets:
> > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> >
> > Best,
> > Yingyi
> >
> >
> > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
> >
> > > Hi,
> > >
> > > I want to import data from multiple adm files into a same dataset.
> > Merging
> > > them together and then loading from localfs can be a viable solution,
> but
> > > this may become a problem when the number become too large. I am
> > wondering
> > > is there a way to append adm file to existed dataset?
> > >
> > > Thank you.
> > >
> > > Best,
> > > Xikui
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Xikui Wang <xi...@uci.edu>.

Hi Yingyi,

Thanks for your reply. I think the external dataset with scan query is a
good solution.
I will try that. Thank you.

Best,
Xikui

On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <bu...@gmail.com> wrote:

> Xikui,
>
> If the number of localfs files is too large,  a solution could be to put
> your files on HDFS and then load it.  Loading from HDFS always has a fixed
> degree of parallelism regardless of the number of files.
>
> >> I am wondering is there a way to append adm file to existed dataset?
> You can create an external dataset and then write an insert statement where
> the body is a scan query. AsterixDB doesn't load any data into its own
> storage for an external dataset but just keeps file paths.
> Here is a manual for external datasets:
> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>
> Best,
> Yingyi
>
>
> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:
>
> > Hi,
> >
> > I want to import data from multiple adm files into a same dataset.
> Merging
> > them together and then loading from localfs can be a viable solution, but
> > this may become a problem when the number become too large. I am
> wondering
> > is there a way to append adm file to existed dataset?
> >
> > Thank you.
> >
> > Best,
> > Xikui
> >
>

Re: Do we have a method to append local files to existed dataset?

Posted by Yingyi Bu <bu...@gmail.com>.

Xikui,

If the number of localfs files is too large,  a solution could be to put
your files on HDFS and then load it.  Loading from HDFS always has a fixed
degree of parallelism regardless of the number of files.

>> I am wondering is there a way to append adm file to existed dataset?
You can create an external dataset and then write an insert statement where
the body is a scan query. AsterixDB doesn't load any data into its own
storage for an external dataset but just keeps file paths.
Here is a manual for external datasets:
https://ci.apache.org/projects/asterixdb/aql/externaldata.html

Best,
Yingyi


On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xi...@uci.edu> wrote:

> Hi,
>
> I want to import data from multiple adm files into a same dataset. Merging
> them together and then loading from localfs can be a viable solution, but
> this may become a problem when the number become too large. I am wondering
> is there a way to append adm file to existed dataset?
>
> Thank you.
>
> Best,
> Xikui
>