You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@vxquery.apache.org by Efi <ef...@gmail.com> on 2015/05/17 22:15:14 UTC

[#131]Supporting Hadoop data and cluster management

Hello everyone,

This is my update on what I have been doing this last week:

Created an XMLInputFormat java class with the functionalities that Hamza 
described in the issue [1] .The class reads from blocks located in HDFS 
and returns complete items according to a specified xml tag.
I also tested this class in a standalone hadoop cluster with xml files 
of various sizes, the smallest being a single file of 400 MB and the 
largest a collection of 5 files totalling 6.1 GB.

This week I will create another implementation of the XMLInputFormat 
with a different way of reading and delivering files, the way I 
described in the same issue and I will test both solutions in a 
standalone and a small hadoop cluster (5-6 nodes).

You can see this week's results here [2] .I will keep updating this file 
about the other tests.

Best regards,
Efi

[1] https://issues.apache.org/jira/browse/VXQUERY-131
[2] 
https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Thank you Eldon for the suggestion I also think this is a good idea!I 
will have more news on that matter in the following days.

Efi

On 04/06/2015 08:50 μμ, Eldon Carman wrote:
> We have a set of JUnit tests to validate VXQuery. I think it would be a
> good idea to add test cases that validate the HDFS code your adding to the
> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
> Catalog holds all the vxquery test cases [1]. You could add a new HDFS test
> group to this list catalog.
>
> 1.
> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml
>
> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> This week Preston and Steven helped me with the vxquery code and
>> specifically where my parser and two more functionalities will fit in the
>> code.
>>
>> Along with the hdfs parallel parser that I have been working on these past
>> weeks,two more methods will be implemented.They will both read whole files
>> from hdfs and not just blocks.The one will read all the files located in a
>> directory in hdfs and the other will read a single document.
>>
>> The reading of files from a directory is completed and for the next week I
>> will focus on testing it and implementing/testing the second method,
>> reading of a single document.
>>
>> Best regards,
>> Efi
>>

[Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Hello everyone,

The update for this week consists of two parts,the CollectionWithTagRule 
that is about reading the blocks from HDFS using the XMLInputFormat 
class.This rule informs the parser that it needs to read its data in 
blocks from HDFS and passes some additional information that are needed 
in order to read the items correctly.I made one change in the 
XMLInputFormat class, the class reads a block from HDFS and looks for 
the starting and closing tag that the user specified in his query.Until 
now I did not take into account that in the opening tag there may be 
more information refarding the item, for example:
<book name="something">
...
...
</book>

but I was only looking for tags like:
<book>
...
...
</book>

I changed that to take into account that the opening tag may contain 
additional information and to include it in the returning item.

The second part of the update is about the YARN applications, slider and 
twill that I tested this week and my conclusions about which can be used 
with vxquery better.
    - Slider: Requires mostly configuration files and python scripts for 
the application to work which I find very good and generic because with 
little changes to the configuration you can use the same work in similar 
projects.
    - Twill: Requires zookeeper installed along with YARN in order to 
work.This application needs mostly changes in the code of the project 
you want to use with Twill.

Based on these I find slider, yet again, a better candidate.Still if 
anyone has more experience with any of these systems I would like to 
give me some feedback on my observations and of course which one is best.

Thank you,
Efi

[Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Hello everyone,

A lot for this week,first of all we had a discussion with Steven,Preston 
and Ian about the YARN implementation.Ian referenced 3 projects that 
help you use YARN cluster management with your application, these are 
Slider,Twill and Kitten.Slider seems to be the most promising and stable 
of the three so this is the one I started learning and testing first.We 
discussed the limitations and the benefits and we delieve it better to 
use one of them instead of writing our own classes that will use 
YARN,like Apache Flink does.As I said I started testing them and of 
course if they do not work out the way we want it I will work on 
creating a YARN connector for VXQuery as planned at first.

In regards to the parallel reading from HDFS, Preston explained to me a 
lot about the collection rules and how to implement the rule for 
CollectionWithTag which is actually what the user will include in his 
query if he wants the parallelization of the data from HDFS.The rule is 
almost ready and needs mostly testing,especially with a real distributed 
cluster.For these tests I set up a distributed HDFS cluster of 3 VMs, 
one master and two slaves and I will run the tests on them this week.

Any insight and thoughts on these subjects is more than welcome!

Cheers,
Efi

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Greetings everyone,

     This week the implementation for the miniDFSCluster that will run 
the tests of HDFS for vxquery is completed.In the cluster.properties 
configuration file of the vxquery server I added another property that 
gives the path to the configuration file of the HDFS.By default this 
value is set to the configuration folder of the miniDFSCluster.The user 
that want to run his queries on his DFS cluster will have to change that 
value to the configuration path of his HDFS cluster.
     This functionality along with some minor changes in the code will 
be added in my next pull request.
     Also, the split scheduler is added to the my vxquery codebase and I 
am currently trying to make the XMLParser parse the data blocks from 
HDFS.It needs more work since the parser expects well formed xml 
documents and the blocks returned from the HDFS are just parts of the 
complete file.This is the part that I will focus on completing for the 
next week.

Thank you,
Efi

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Hello everyone,

     This week's update is about the changes that I mentioned in my last 
update.The JUnit test is not completed yet,I am using a MiniDFSCluster 
implementation for the tests but I havent managed to get it to work 
correctly yet.I believe the problems are trivial and have not reported 
them in the ticket so far.I will create a ticket if I continue to 
receive the same errors.

     About the input splits, I have implemented a scheduler that maps 
which split should be processed by which node according to the split's 
location and the number of splits - nodes.I need to test this as well 
before I commit it.
Thats all for this week.

Best regards,
Efi

On 25/06/2015 07:30 μμ, Efi wrote:
> Thank you Eldon, that's was very helpful and I had completely 
> overlooked it when I first setup up my eclipse for vxquery.
>
> This week I continue working on reading blocks from HDFS, I used some 
> of the hyracks-hdfs-core classes and methods and I was able to get the 
> splits of input files from HDFS without having to use a Map function.I 
> will continue working on how to distribute and read correctly the 
> splits between the nodes of the vxquery cluster.
>
> I will also do some changes to the JUnit tests for HDFS.They will 
> start a temporary dfs cluster in order to run the tests instead of 
> just failing when the user does not have an HDFS cluster.
>
> Cheers,
> Efi
>
> On 16/06/2015 08:42 μμ, Eldon Carman wrote:
>> Looks good. One quick comment, take a look at our code format and style
>> guidelines. You can set up eclipse to format your code for you using our
>> sister project's code format profile [1].
>>
>> [1] http://vxquery.apache.org/development_eclipse_setup.html
>>
>> On Sat, Jun 13, 2015 at 11:03 AM, Michael Carey <mj...@ics.uci.edu> 
>> wrote:
>>
>>> Very cool!!
>>>
>>>
>>> On 6/13/15 9:38 AM, Efi wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> The reading of a single document and a collection of documents from 
>>>> HDFS
>>>> is completed and tested.New JUnit tests are added in the xtest 
>>>> project,
>>>> they are just copies of the aggregate tests, that I changed a bit 
>>>> to run
>>>> for the collection reading from HDFS.
>>>>
>>>> I added another option in the xtest in order for the HDFS tests to run
>>>> successfully.It is a boolean option called /hdfs/ and it enables 
>>>> the tests
>>>> for HDFS to run.
>>>>
>>>> You can view these in the branch /hdfs2_read/ in my github fork of
>>>> vxquery. [1]
>>>>
>>>> I will continue with the parallel reading from HDFS.
>>>>
>>>> Best Regards,
>>>> Efi
>>>>
>>>> [1] https://github.com/efikalti/vxquery/tree/hdfs2_read
>>>>
>>>> On 04/06/2015 08:50 μμ, Eldon Carman wrote:
>>>>
>>>>> We have a set of JUnit tests to validate VXQuery. I think it would 
>>>>> be a
>>>>> good idea to add test cases that validate the HDFS code your 
>>>>> adding to
>>>>> the
>>>>> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
>>>>> Catalog holds all the vxquery test cases [1]. You could add a new 
>>>>> HDFS
>>>>> test
>>>>> group to this list catalog.
>>>>>
>>>>> 1.
>>>>>
>>>>> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml 
>>>>>
>>>>>
>>>>> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>>>>>
>>>>>   Hello everyone,
>>>>>> This week Preston and Steven helped me with the vxquery code and
>>>>>> specifically where my parser and two more functionalities will 
>>>>>> fit in
>>>>>> the
>>>>>> code.
>>>>>>
>>>>>> Along with the hdfs parallel parser that I have been working on 
>>>>>> these
>>>>>> past
>>>>>> weeks,two more methods will be implemented.They will both read whole
>>>>>> files
>>>>>> from hdfs and not just blocks.The one will read all the files 
>>>>>> located
>>>>>> in a
>>>>>> directory in hdfs and the other will read a single document.
>>>>>>
>>>>>> The reading of files from a directory is completed and for the next
>>>>>> week I
>>>>>> will focus on testing it and implementing/testing the second method,
>>>>>> reading of a single document.
>>>>>>
>>>>>> Best regards,
>>>>>> Efi
>>>>>>
>>>>>>
>>>>
>

[Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Thank you Eldon, that's was very helpful and I had completely overlooked 
it when I first setup up my eclipse for vxquery.

This week I continue working on reading blocks from HDFS, I used some of 
the hyracks-hdfs-core classes and methods and I was able to get the 
splits of input files from HDFS without having to use a Map function.I 
will continue working on how to distribute and read correctly the splits 
between the nodes of the vxquery cluster.

I will also do some changes to the JUnit tests for HDFS.They will start 
a temporary dfs cluster in order to run the tests instead of just 
failing when the user does not have an HDFS cluster.

Cheers,
Efi

On 16/06/2015 08:42 μμ, Eldon Carman wrote:
> Looks good. One quick comment, take a look at our code format and style
> guidelines. You can set up eclipse to format your code for you using our
> sister project's code format profile [1].
>
> [1] http://vxquery.apache.org/development_eclipse_setup.html
>
> On Sat, Jun 13, 2015 at 11:03 AM, Michael Carey <mj...@ics.uci.edu> wrote:
>
>> Very cool!!
>>
>>
>> On 6/13/15 9:38 AM, Efi wrote:
>>
>>> Hello everyone,
>>>
>>> The reading of a single document and a collection of documents from HDFS
>>> is completed and tested.New JUnit tests are added in the xtest project,
>>> they are just copies of the aggregate tests, that I changed a bit to run
>>> for the collection reading from HDFS.
>>>
>>> I added another option in the xtest in order for the HDFS tests to run
>>> successfully.It is a boolean option called /hdfs/ and it enables the tests
>>> for HDFS to run.
>>>
>>> You can view these in the branch /hdfs2_read/ in my github fork of
>>> vxquery. [1]
>>>
>>> I will continue with the parallel reading from HDFS.
>>>
>>> Best Regards,
>>> Efi
>>>
>>> [1] https://github.com/efikalti/vxquery/tree/hdfs2_read
>>>
>>> On 04/06/2015 08:50 μμ, Eldon Carman wrote:
>>>
>>>> We have a set of JUnit tests to validate VXQuery. I think it would be a
>>>> good idea to add test cases that validate the HDFS code your adding to
>>>> the
>>>> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
>>>> Catalog holds all the vxquery test cases [1]. You could add a new HDFS
>>>> test
>>>> group to this list catalog.
>>>>
>>>> 1.
>>>>
>>>> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml
>>>>
>>>> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>>>>
>>>>   Hello everyone,
>>>>> This week Preston and Steven helped me with the vxquery code and
>>>>> specifically where my parser and two more functionalities will fit in
>>>>> the
>>>>> code.
>>>>>
>>>>> Along with the hdfs parallel parser that I have been working on these
>>>>> past
>>>>> weeks,two more methods will be implemented.They will both read whole
>>>>> files
>>>>> from hdfs and not just blocks.The one will read all the files located
>>>>> in a
>>>>> directory in hdfs and the other will read a single document.
>>>>>
>>>>> The reading of files from a directory is completed and for the next
>>>>> week I
>>>>> will focus on testing it and implementing/testing the second method,
>>>>> reading of a single document.
>>>>>
>>>>> Best regards,
>>>>> Efi
>>>>>
>>>>>
>>>

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Eldon Carman <ec...@ucr.edu>.

Looks good. One quick comment, take a look at our code format and style
guidelines. You can set up eclipse to format your code for you using our
sister project's code format profile [1].

[1] http://vxquery.apache.org/development_eclipse_setup.html

On Sat, Jun 13, 2015 at 11:03 AM, Michael Carey <mj...@ics.uci.edu> wrote:

> Very cool!!
>
>
> On 6/13/15 9:38 AM, Efi wrote:
>
>> Hello everyone,
>>
>> The reading of a single document and a collection of documents from HDFS
>> is completed and tested.New JUnit tests are added in the xtest project,
>> they are just copies of the aggregate tests, that I changed a bit to run
>> for the collection reading from HDFS.
>>
>> I added another option in the xtest in order for the HDFS tests to run
>> successfully.It is a boolean option called /hdfs/ and it enables the tests
>> for HDFS to run.
>>
>> You can view these in the branch /hdfs2_read/ in my github fork of
>> vxquery. [1]
>>
>> I will continue with the parallel reading from HDFS.
>>
>> Best Regards,
>> Efi
>>
>> [1] https://github.com/efikalti/vxquery/tree/hdfs2_read
>>
>> On 04/06/2015 08:50 μμ, Eldon Carman wrote:
>>
>>> We have a set of JUnit tests to validate VXQuery. I think it would be a
>>> good idea to add test cases that validate the HDFS code your adding to
>>> the
>>> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
>>> Catalog holds all the vxquery test cases [1]. You could add a new HDFS
>>> test
>>> group to this list catalog.
>>>
>>> 1.
>>>
>>> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml
>>>
>>> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>>>
>>>  Hello everyone,
>>>>
>>>> This week Preston and Steven helped me with the vxquery code and
>>>> specifically where my parser and two more functionalities will fit in
>>>> the
>>>> code.
>>>>
>>>> Along with the hdfs parallel parser that I have been working on these
>>>> past
>>>> weeks,two more methods will be implemented.They will both read whole
>>>> files
>>>> from hdfs and not just blocks.The one will read all the files located
>>>> in a
>>>> directory in hdfs and the other will read a single document.
>>>>
>>>> The reading of files from a directory is completed and for the next
>>>> week I
>>>> will focus on testing it and implementing/testing the second method,
>>>> reading of a single document.
>>>>
>>>> Best regards,
>>>> Efi
>>>>
>>>>
>>
>>
>

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Michael Carey <mj...@ics.uci.edu>.

Very cool!!

On 6/13/15 9:38 AM, Efi wrote:
> Hello everyone,
>
> The reading of a single document and a collection of documents from 
> HDFS is completed and tested.New JUnit tests are added in the xtest 
> project, they are just copies of the aggregate tests, that I changed a 
> bit to run for the collection reading from HDFS.
>
> I added another option in the xtest in order for the HDFS tests to run 
> successfully.It is a boolean option called /hdfs/ and it enables the 
> tests for HDFS to run.
>
> You can view these in the branch /hdfs2_read/ in my github fork of 
> vxquery. [1]
>
> I will continue with the parallel reading from HDFS.
>
> Best Regards,
> Efi
>
> [1] https://github.com/efikalti/vxquery/tree/hdfs2_read
>
> On 04/06/2015 08:50 μμ, Eldon Carman wrote:
>> We have a set of JUnit tests to validate VXQuery. I think it would be a
>> good idea to add test cases that validate the HDFS code your adding 
>> to the
>> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
>> Catalog holds all the vxquery test cases [1]. You could add a new 
>> HDFS test
>> group to this list catalog.
>>
>> 1.
>> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml 
>>
>>
>> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> This week Preston and Steven helped me with the vxquery code and
>>> specifically where my parser and two more functionalities will fit 
>>> in the
>>> code.
>>>
>>> Along with the hdfs parallel parser that I have been working on 
>>> these past
>>> weeks,two more methods will be implemented.They will both read whole 
>>> files
>>> from hdfs and not just blocks.The one will read all the files 
>>> located in a
>>> directory in hdfs and the other will read a single document.
>>>
>>> The reading of files from a directory is completed and for the next 
>>> week I
>>> will focus on testing it and implementing/testing the second method,
>>> reading of a single document.
>>>
>>> Best regards,
>>> Efi
>>>
>
>

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Hello everyone,

The reading of a single document and a collection of documents from HDFS 
is completed and tested.New JUnit tests are added in the xtest project, 
they are just copies of the aggregate tests, that I changed a bit to run 
for the collection reading from HDFS.

I added another option in the xtest in order for the HDFS tests to run 
successfully.It is a boolean option called /hdfs/ and it enables the 
tests for HDFS to run.

You can view these in the branch /hdfs2_read/ in my github fork of 
vxquery. [1]

I will continue with the parallel reading from HDFS.

Best Regards,
Efi

[1] https://github.com/efikalti/vxquery/tree/hdfs2_read

On 04/06/2015 08:50 μμ, Eldon Carman wrote:
> We have a set of JUnit tests to validate VXQuery. I think it would be a
> good idea to add test cases that validate the HDFS code your adding to the
> code base. Take a look at the vxquery-xtest sub-project. The VXQuery
> Catalog holds all the vxquery test cases [1]. You could add a new HDFS test
> group to this list catalog.
>
> 1.
> https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml
>
> On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> This week Preston and Steven helped me with the vxquery code and
>> specifically where my parser and two more functionalities will fit in the
>> code.
>>
>> Along with the hdfs parallel parser that I have been working on these past
>> weeks,two more methods will be implemented.They will both read whole files
>> from hdfs and not just blocks.The one will read all the files located in a
>> directory in hdfs and the other will read a single document.
>>
>> The reading of files from a directory is completed and for the next week I
>> will focus on testing it and implementing/testing the second method,
>> reading of a single document.
>>
>> Best regards,
>> Efi
>>

Re: [Supporting Hadoop data and cluster management] weekly update

Posted by Eldon Carman <ec...@ucr.edu>.

We have a set of JUnit tests to validate VXQuery. I think it would be a
good idea to add test cases that validate the HDFS code your adding to the
code base. Take a look at the vxquery-xtest sub-project. The VXQuery
Catalog holds all the vxquery test cases [1]. You could add a new HDFS test
group to this list catalog.

1.
https://github.com/apache/vxquery/blob/master/vxquery-xtest/src/test/resources/VXQueryCatalog.xml

On Thu, Jun 4, 2015 at 10:26 AM, Efi <ef...@gmail.com> wrote:

> Hello everyone,
>
> This week Preston and Steven helped me with the vxquery code and
> specifically where my parser and two more functionalities will fit in the
> code.
>
> Along with the hdfs parallel parser that I have been working on these past
> weeks,two more methods will be implemented.They will both read whole files
> from hdfs and not just blocks.The one will read all the files located in a
> directory in hdfs and the other will read a single document.
>
> The reading of files from a directory is completed and for the next week I
> will focus on testing it and implementing/testing the second method,
> reading of a single document.
>
> Best regards,
> Efi
>

[Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

Hello everyone,

This week Preston and Steven helped me with the vxquery code and 
specifically where my parser and two more functionalities will fit in 
the code.

Along with the hdfs parallel parser that I have been working on these 
past weeks,two more methods will be implemented.They will both read 
whole files from hdfs and not just blocks.The one will read all the 
files located in a directory in hdfs and the other will read a single 
document.

The reading of files from a directory is completed and for the next week 
I will focus on testing it and implementing/testing the second method, 
reading of a single document.

Best regards,
Efi

[Supporting Hadoop data and cluster management] weekly update

Posted by Efi <ef...@gmail.com>.

For this week I studied the VXQuery and Hyracks code in detail, in order 
to add my parser to the project.

I will continue working on adding my code to vxquery and try to 
implement some tests for it as well.Also I am looking into ways to use 
the Hyracks hdfs code for the hdfs parser.

Thank you,
Efi

Re: [#131] Supporting Hadoop data and cluster management

Posted by Efi <ef...@gmail.com>.

If I understand correctly the problem you did not understand, is the one 
about the block assignment to mappers.

I believe this is a Hadoop functionality, the number of mappers it 
assigns for a job is equal to the available cpus cores.If the number of 
blocks is less than the number of mappers, the same block will be 
assigned to more than one mapper for parsing.The problem is that the 
mappers in the shame machine share the allowed memory, which means the 
more mappers the less memory for each one of them.

For example, in a machine with 4 cores, hadoop will assign 4 mappers. If 
our input has only two blocks,block1 and block2, they will be given to 
all mappers for parsing.So mapper1 and mapper2 will both get block1, 
mapper3 and mapper4 will get block2.So the available memory of this node 
will be distributed among 4 mappers that will parse 2 blocks.I want to 
make it so that there would be only 2 mappers in order to get more 
memory for each one.

As an alternative solution, I am also viewing the hyracks code about 
hdfs and hdfs2 in order to use that instead of the map-reduce framework 
for reading the blocks.

I hope I answered your question.

Best regards,
Efi

On 25/05/2015 07:55 πμ, Till Westmann wrote:
>
> On 22 May 2015, at 3:26, Efi wrote:
>
>> Thank you for the recursively tag check, Steven told me about it 
>> yesterday as well.I hadnt thought of it so far but I will think of 
>> ways to implement it for these methods so it does not create problems.
>>
>> My question was not exactly that, I was considering if the query 
>> engine could parse data that have complete elements but miss other 
>> tags from greater elements.
>> For example, one data that comes from either of these methods can 
>> look like this:
>>
>> <books>
>> ....
>> <book>
>> ...
>> </book>
>>
>> And another one like this:
>>
>> <book>
>> ....
>> </book>
>> ...
>> </books>
>>
>> The query is about data inside the element book, will these work with 
>> the query engine?
>
> I would hope so. I assume, that everything before the fist <book> and 
> between a </book> and the next <book> should be ignored. And 
> everything between a <book> and a </book> is probably parsed and 
> passed to the query engine.
> Does that make sense?
>
>> About your answer for the scenario where a block does not contain the 
>> tags in question, it can mean two things.It is not part of the 
>> element we want to work with,so we simply ignore it, or it is part of 
>> the element but the starting and ending tags are in previous/next 
>> blocks. So this block contains only part of the body that we want.In 
>> that case it will be parsed only by the readers that are assigned to 
>> read the block that contains the starting tag of this element.
>
> Yes, that sounds right.
>
>> On that note, I am currently working on a way to assign only one 
>> reader to each block, because hdfs assigns readers according to the 
>> available cores of the CPUs you use.That means the same block can be 
>> assigned to more than one readers and in our case that can lead to 
>> memory problems.
>
> I'm not sure I fully understand the current design. Could you explain 
> in a little more detail in which case you see which problem coming up 
> (I can imagine a number of problems with memory ...)?
>
> Cheers,
> Till
>
>> On 22/05/2015 06:53 πμ, Till Westmann wrote:
>>> (1) I agree that [1] looks better (thanks for the diagrams - we 
>>> should add them to the docs!).
>>> (2) I think that it’s ok to have the restriction, that the given tag
>>>   (a) identifies the root element of the elements that we want to 
>>> work with and
>>>   (b) is not used recursively (and I would check this condition and 
>>> fail if it doesn’t hold).
>>>
>>> If we have a few really big nodes in the file, we anyway do not have 
>>> a way to process them in parallel, so the chosen tags should split 
>>> the document into a large number of smaller pieces for VXQuery to 
>>> work well.
>>>
>>> Wrt. to the question what happens if we start reading a block that 
>>> does not contain the tag(s) in question (I think that that’s the 
>>> last question - please correct me if I’m wrong) it would probably be 
>>> read without producing any nodes that will be processed by the query 
>>> engine. So the effort to do that would be wasted, but I would expect 
>>> that the block would then be parsed again as the continuation of 
>>> another block that contained a start tag.
>>>
>>> Till
>>>
>>>> On May 21, 2015, at 2:59 PM, Steven Jacobs <sj...@ucr.edu> wrote:
>>>>
>>>> This seems correct to me. Since our objective in implementing HDFS 
>>>> is to
>>>> deal with very large XML files, I think we should avoid any size
>>>> limitations. Regarding the tags, does anyone have any thoughts on 
>>>> this? In
>>>> the case of searching for all elements with a given name regardless of
>>>> depth, this method will work fine, but if we want a specific path, 
>>>> we could
>>>> end up opening lots of Blocks to guarantee path correctness, the 
>>>> entire
>>>> file in fact.
>>>> Steven
>>>>
>>>> On Thu, May 21, 2015 at 10:20 AM, Efi <ef...@gmail.com> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> For this week the two different methods for reading complete items
>>>>> according to a specific tag are completed and tested in standalone 
>>>>> hdfs
>>>>> deployment.In detail what each method does:
>>>>>
>>>>> The first method, I call it One Buffer Method, reads a block, 
>>>>> saves it in
>>>>> a buffer, and continues reading from the other blocks until it 
>>>>> finds a
>>>>> specific closing tag.It shows good results and good times in the 
>>>>> tests.
>>>>>
>>>>> The second method, called Shared File Method, reads only the complete
>>>>> items contained in the block and the incomplete items from the 
>>>>> start and
>>>>> end of the block are send to a shared file in the hdfs Distributed 
>>>>> Cache.
>>>>> Now this method could work only for relatively small inputs, since 
>>>>> the
>>>>> Distributed Cache is limited and in the case of hundreds/thousands of
>>>>> blocks the shared file can exceed the limit.
>>>>>
>>>>> I took the liberty of creating diagrams that show in example what 
>>>>> each
>>>>> method does.
>>>>> [1] One Buffer Method
>>>>> [2] Shared File Method
>>>>>
>>>>> Every insight and feedback is more than welcome about these two 
>>>>> methods.In
>>>>> my opinion the One Buffer method is simpler and more effective 
>>>>> since it can
>>>>> be used for both small and large datasets.
>>>>>
>>>>> There is also a question, can the parser work on data that are 
>>>>> missing
>>>>> some tags?For example the first and last tag of the xml file that are
>>>>> located in different blocks.
>>>>>
>>>>> Best regards,
>>>>> Efi
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing 
>>>>>
>>>>>
>>>>> [2]
>>>>> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 05/19/2015 12:43 AM, Michael Carey wrote:
>>>>>
>>>>>> +1 Sounds great!
>>>>>>
>>>>>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>>>>>
>>>>>>> Great work!
>>>>>>> Steven
>>>>>>>
>>>>>>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hello everyone,
>>>>>>>> This is my update on what I have been doing this last week:
>>>>>>>>
>>>>>>>> Created an XMLInputFormat java class with the functionalities 
>>>>>>>> that Hamza
>>>>>>>> described in the issue [1] .The class reads from blocks located 
>>>>>>>> in HDFS
>>>>>>>> and
>>>>>>>> returns complete items according to a specified xml tag.
>>>>>>>> I also tested this class in a standalone hadoop cluster with 
>>>>>>>> xml files
>>>>>>>> of
>>>>>>>> various sizes, the smallest being a single file of 400 MB and the
>>>>>>>> largest a
>>>>>>>> collection of 5 files totalling 6.1 GB.
>>>>>>>>
>>>>>>>> This week I will create another implementation of the 
>>>>>>>> XMLInputFormat
>>>>>>>> with
>>>>>>>> a different way of reading and delivering files, the way I 
>>>>>>>> described in
>>>>>>>> the
>>>>>>>> same issue and I will test both solutions in a standalone and a 
>>>>>>>> small
>>>>>>>> hadoop cluster (5-6 nodes).
>>>>>>>>
>>>>>>>> You can see this week's results here [2] .I will keep updating 
>>>>>>>> this file
>>>>>>>> about the other tests.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Efi
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>>>>>> [2]
>>>>>>>>
>>>>>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>

Re: [#131] Supporting Hadoop data and cluster management

Posted by Till Westmann <ti...@apache.org>.

On 22 May 2015, at 3:26, Efi wrote:

> Thank you for the recursively tag check, Steven told me about it 
> yesterday as well.I hadnt thought of it so far but I will think of 
> ways to implement it for these methods so it does not create problems.
>
> My question was not exactly that, I was considering if the query 
> engine could parse data that have complete elements but miss other 
> tags from greater elements.
> For example, one data that comes from either of these methods can look 
> like this:
>
> <books>
> ....
> <book>
> ...
> </book>
>
> And another one like this:
>
> <book>
> ....
> </book>
> ...
> </books>
>
> The query is about data inside the element book, will these work with 
> the query engine?

I would hope so. I assume, that everything before the fist <book> and 
between a </book> and the next <book> should be ignored. And everything 
between a <book> and a </book> is probably parsed and passed to the 
query engine.
Does that make sense?

> About your answer for the scenario where a block does not contain the 
> tags in question, it can mean two things.It is not part of the element 
> we want to work with,so we simply ignore it, or it is part of the 
> element but the starting and ending tags are in previous/next blocks. 
> So this block contains only part of the body that we want.In that case 
> it will be parsed only by the readers that are assigned to read the 
> block that contains the starting tag of this element.

Yes, that sounds right.

> On that note, I am currently working on a way to assign only one 
> reader to each block, because hdfs assigns readers according to the 
> available cores of the CPUs you use.That means the same block can be 
> assigned to more than one readers and in our case that can lead to 
> memory problems.

I'm not sure I fully understand the current design. Could you explain in 
a little more detail in which case you see which problem coming up (I 
can imagine a number of problems with memory ...)?

Cheers,
Till

> On 22/05/2015 06:53 πμ, Till Westmann wrote:
>> (1) I agree that [1] looks better (thanks for the diagrams - we 
>> should add them to the docs!).
>> (2) I think that it’s ok to have the restriction, that the given 
>> tag
>>   (a) identifies the root element of the elements that we want to 
>> work with and
>>   (b) is not used recursively (and I would check this condition and 
>> fail if it doesn’t hold).
>>
>> If we have a few really big nodes in the file, we anyway do not have 
>> a way to process them in parallel, so the chosen tags should split 
>> the document into a large number of smaller pieces for VXQuery to 
>> work well.
>>
>> Wrt. to the question what happens if we start reading a block that 
>> does not contain the tag(s) in question (I think that that’s the 
>> last question - please correct me if I’m wrong) it would probably 
>> be read without producing any nodes that will be processed by the 
>> query engine. So the effort to do that would be wasted, but I would 
>> expect that the block would then be parsed again as the continuation 
>> of another block that contained a start tag.
>>
>> Till
>>
>>> On May 21, 2015, at 2:59 PM, Steven Jacobs <sj...@ucr.edu> wrote:
>>>
>>> This seems correct to me. Since our objective in implementing HDFS 
>>> is to
>>> deal with very large XML files, I think we should avoid any size
>>> limitations. Regarding the tags, does anyone have any thoughts on 
>>> this? In
>>> the case of searching for all elements with a given name regardless 
>>> of
>>> depth, this method will work fine, but if we want a specific path, 
>>> we could
>>> end up opening lots of Blocks to guarantee path correctness, the 
>>> entire
>>> file in fact.
>>> Steven
>>>
>>> On Thu, May 21, 2015 at 10:20 AM, Efi <ef...@gmail.com> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> For this week the two different methods for reading complete items
>>>> according to a specific tag are completed and tested in standalone 
>>>> hdfs
>>>> deployment.In detail what each method does:
>>>>
>>>> The first method, I call it One Buffer Method, reads a block, saves 
>>>> it in
>>>> a buffer, and continues reading from the other blocks until it 
>>>> finds a
>>>> specific closing tag.It shows good results and good times in the 
>>>> tests.
>>>>
>>>> The second method, called Shared File Method, reads only the 
>>>> complete
>>>> items contained in the block and the incomplete items from the 
>>>> start and
>>>> end of the block are send to a shared file in the hdfs Distributed 
>>>> Cache.
>>>> Now this method could work only for relatively small inputs, since 
>>>> the
>>>> Distributed Cache is limited and in the case of hundreds/thousands 
>>>> of
>>>> blocks the shared file can exceed the limit.
>>>>
>>>> I took the liberty of creating diagrams that show in example what 
>>>> each
>>>> method does.
>>>> [1] One Buffer Method
>>>> [2] Shared File Method
>>>>
>>>> Every insight and feedback is more than welcome about these two 
>>>> methods.In
>>>> my opinion the One Buffer method is simpler and more effective 
>>>> since it can
>>>> be used for both small and large datasets.
>>>>
>>>> There is also a question, can the parser work on data that are 
>>>> missing
>>>> some tags?For example the first and last tag of the xml file that 
>>>> are
>>>> located in different blocks.
>>>>
>>>> Best regards,
>>>> Efi
>>>>
>>>> [1]
>>>> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>>>>
>>>> [2]
>>>> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>>>>
>>>>
>>>>
>>>>
>>>> On 05/19/2015 12:43 AM, Michael Carey wrote:
>>>>
>>>>> +1 Sounds great!
>>>>>
>>>>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>>>>
>>>>>> Great work!
>>>>>> Steven
>>>>>>
>>>>>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>>>>>
>>>>>> Hello everyone,
>>>>>>> This is my update on what I have been doing this last week:
>>>>>>>
>>>>>>> Created an XMLInputFormat java class with the functionalities 
>>>>>>> that Hamza
>>>>>>> described in the issue [1] .The class reads from blocks located 
>>>>>>> in HDFS
>>>>>>> and
>>>>>>> returns complete items according to a specified xml tag.
>>>>>>> I also tested this class in a standalone hadoop cluster with xml 
>>>>>>> files
>>>>>>> of
>>>>>>> various sizes, the smallest being a single file of 400 MB and 
>>>>>>> the
>>>>>>> largest a
>>>>>>> collection of 5 files totalling 6.1 GB.
>>>>>>>
>>>>>>> This week I will create another implementation of the 
>>>>>>> XMLInputFormat
>>>>>>> with
>>>>>>> a different way of reading and delivering files, the way I 
>>>>>>> described in
>>>>>>> the
>>>>>>> same issue and I will test both solutions in a standalone and a 
>>>>>>> small
>>>>>>> hadoop cluster (5-6 nodes).
>>>>>>>
>>>>>>> You can see this week's results here [2] .I will keep updating 
>>>>>>> this file
>>>>>>> about the other tests.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Efi
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>>>>> [2]
>>>>>>>
>>>>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Efi <ef...@gmail.com>.

Thank you for the recursively tag check, Steven told me about it 
yesterday as well.I hadnt thought of it so far but I will think of ways 
to implement it for these methods so it does not create problems.

My question was not exactly that, I was considering if the query engine 
could parse data that have complete elements but miss other tags from 
greater elements.
For example, one data that comes from either of these methods can look 
like this:

<books>
....
<book>
...
</book>

And another one like this:

<book>
....
</book>
...
</books>

The query is about data inside the element book, will these work with 
the query engine?

About your answer for the scenario where a block does not contain the 
tags in question, it can mean two things.It is not part of the element 
we want to work with,so we simply ignore it, or it is part of the 
element but the starting and ending tags are in previous/next blocks. So 
this block contains only part of the body that we want.In that case it 
will be parsed only by the readers that are assigned to read the block 
that contains the starting tag of this element.

On that note, I am currently working on a way to assign only one reader 
to each block, because hdfs assigns readers according to the available 
cores of the CPUs you use.That means the same block can be assigned to 
more than one readers and in our case that can lead to memory problems.

Efi

On 22/05/2015 06:53 πμ, Till Westmann wrote:
> (1) I agree that [1] looks better (thanks for the diagrams - we should add them to the docs!).
> (2) I think that it’s ok to have the restriction, that the given tag
>       (a) identifies the root element of the elements that we want to work with and
>       (b) is not used recursively (and I would check this condition and fail if it doesn’t hold).
>
> If we have a few really big nodes in the file, we anyway do not have a way to process them in parallel, so the chosen tags should split the document into a large number of smaller pieces for VXQuery to work well.
>
> Wrt. to the question what happens if we start reading a block that does not contain the tag(s) in question (I think that that’s the last question - please correct me if I’m wrong) it would probably be read without producing any nodes that will be processed by the query engine. So the effort to do that would be wasted, but I would expect that the block would then be parsed again as the continuation of another block that contained a start tag.
>
> Till
>
>> On May 21, 2015, at 2:59 PM, Steven Jacobs <sj...@ucr.edu> wrote:
>>
>> This seems correct to me. Since our objective in implementing HDFS is to
>> deal with very large XML files, I think we should avoid any size
>> limitations. Regarding the tags, does anyone have any thoughts on this? In
>> the case of searching for all elements with a given name regardless of
>> depth, this method will work fine, but if we want a specific path, we could
>> end up opening lots of Blocks to guarantee path correctness, the entire
>> file in fact.
>> Steven
>>
>> On Thu, May 21, 2015 at 10:20 AM, Efi <ef...@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> For this week the two different methods for reading complete items
>>> according to a specific tag are completed and tested in standalone hdfs
>>> deployment.In detail what each method does:
>>>
>>> The first method, I call it One Buffer Method, reads a block, saves it in
>>> a buffer, and continues reading from the other blocks until it finds a
>>> specific closing tag.It shows good results and good times in the tests.
>>>
>>> The second method, called Shared File Method, reads only the complete
>>> items contained in the block and the incomplete items from the start and
>>> end of the block are send to a shared file in the hdfs Distributed Cache.
>>> Now this method could work only for relatively small inputs, since the
>>> Distributed Cache is limited and in the case of hundreds/thousands of
>>> blocks the shared file can exceed the limit.
>>>
>>> I took the liberty of creating diagrams that show in example what each
>>> method does.
>>> [1] One Buffer Method
>>> [2] Shared File Method
>>>
>>> Every insight and feedback is more than welcome about these two methods.In
>>> my opinion the One Buffer method is simpler and more effective since it can
>>> be used for both small and large datasets.
>>>
>>> There is also a question, can the parser work on data that are missing
>>> some tags?For example the first and last tag of the xml file that are
>>> located in different blocks.
>>>
>>> Best regards,
>>> Efi
>>>
>>> [1]
>>> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>>>
>>> [2]
>>> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>>>
>>>
>>>
>>>
>>> On 05/19/2015 12:43 AM, Michael Carey wrote:
>>>
>>>> +1 Sounds great!
>>>>
>>>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>>>
>>>>> Great work!
>>>>> Steven
>>>>>
>>>>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>> This is my update on what I have been doing this last week:
>>>>>>
>>>>>> Created an XMLInputFormat java class with the functionalities that Hamza
>>>>>> described in the issue [1] .The class reads from blocks located in HDFS
>>>>>> and
>>>>>> returns complete items according to a specified xml tag.
>>>>>> I also tested this class in a standalone hadoop cluster with xml files
>>>>>> of
>>>>>> various sizes, the smallest being a single file of 400 MB and the
>>>>>> largest a
>>>>>> collection of 5 files totalling 6.1 GB.
>>>>>>
>>>>>> This week I will create another implementation of the XMLInputFormat
>>>>>> with
>>>>>> a different way of reading and delivering files, the way I described in
>>>>>> the
>>>>>> same issue and I will test both solutions in a standalone and a small
>>>>>> hadoop cluster (5-6 nodes).
>>>>>>
>>>>>> You can see this week's results here [2] .I will keep updating this file
>>>>>> about the other tests.
>>>>>>
>>>>>> Best regards,
>>>>>> Efi
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>>>> [2]
>>>>>>
>>>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>>>
>>>>>>
>>>>>>
>>>>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Till Westmann <ti...@apache.org>.

(1) I agree that [1] looks better (thanks for the diagrams - we should add them to the docs!).
(2) I think that it’s ok to have the restriction, that the given tag
     (a) identifies the root element of the elements that we want to work with and
     (b) is not used recursively (and I would check this condition and fail if it doesn’t hold).

If we have a few really big nodes in the file, we anyway do not have a way to process them in parallel, so the chosen tags should split the document into a large number of smaller pieces for VXQuery to work well. 

Wrt. to the question what happens if we start reading a block that does not contain the tag(s) in question (I think that that’s the last question - please correct me if I’m wrong) it would probably be read without producing any nodes that will be processed by the query engine. So the effort to do that would be wasted, but I would expect that the block would then be parsed again as the continuation of another block that contained a start tag. 

Till

> On May 21, 2015, at 2:59 PM, Steven Jacobs <sj...@ucr.edu> wrote:
> 
> This seems correct to me. Since our objective in implementing HDFS is to
> deal with very large XML files, I think we should avoid any size
> limitations. Regarding the tags, does anyone have any thoughts on this? In
> the case of searching for all elements with a given name regardless of
> depth, this method will work fine, but if we want a specific path, we could
> end up opening lots of Blocks to guarantee path correctness, the entire
> file in fact.
> Steven
> 
> On Thu, May 21, 2015 at 10:20 AM, Efi <ef...@gmail.com> wrote:
> 
>> Hello everyone,
>> 
>> For this week the two different methods for reading complete items
>> according to a specific tag are completed and tested in standalone hdfs
>> deployment.In detail what each method does:
>> 
>> The first method, I call it One Buffer Method, reads a block, saves it in
>> a buffer, and continues reading from the other blocks until it finds a
>> specific closing tag.It shows good results and good times in the tests.
>> 
>> The second method, called Shared File Method, reads only the complete
>> items contained in the block and the incomplete items from the start and
>> end of the block are send to a shared file in the hdfs Distributed Cache.
>> Now this method could work only for relatively small inputs, since the
>> Distributed Cache is limited and in the case of hundreds/thousands of
>> blocks the shared file can exceed the limit.
>> 
>> I took the liberty of creating diagrams that show in example what each
>> method does.
>> [1] One Buffer Method
>> [2] Shared File Method
>> 
>> Every insight and feedback is more than welcome about these two methods.In
>> my opinion the One Buffer method is simpler and more effective since it can
>> be used for both small and large datasets.
>> 
>> There is also a question, can the parser work on data that are missing
>> some tags?For example the first and last tag of the xml file that are
>> located in different blocks.
>> 
>> Best regards,
>> Efi
>> 
>> [1]
>> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>> 
>> [2]
>> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>> 
>> 
>> 
>> 
>> On 05/19/2015 12:43 AM, Michael Carey wrote:
>> 
>>> +1 Sounds great!
>>> 
>>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>> 
>>>> Great work!
>>>> Steven
>>>> 
>>>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>>> 
>>>> Hello everyone,
>>>>> 
>>>>> This is my update on what I have been doing this last week:
>>>>> 
>>>>> Created an XMLInputFormat java class with the functionalities that Hamza
>>>>> described in the issue [1] .The class reads from blocks located in HDFS
>>>>> and
>>>>> returns complete items according to a specified xml tag.
>>>>> I also tested this class in a standalone hadoop cluster with xml files
>>>>> of
>>>>> various sizes, the smallest being a single file of 400 MB and the
>>>>> largest a
>>>>> collection of 5 files totalling 6.1 GB.
>>>>> 
>>>>> This week I will create another implementation of the XMLInputFormat
>>>>> with
>>>>> a different way of reading and delivering files, the way I described in
>>>>> the
>>>>> same issue and I will test both solutions in a standalone and a small
>>>>> hadoop cluster (5-6 nodes).
>>>>> 
>>>>> You can see this week's results here [2] .I will keep updating this file
>>>>> about the other tests.
>>>>> 
>>>>> Best regards,
>>>>> Efi
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>>> [2]
>>>>> 
>>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Steven Jacobs <sj...@ucr.edu>.

This seems correct to me. Since our objective in implementing HDFS is to
deal with very large XML files, I think we should avoid any size
limitations. Regarding the tags, does anyone have any thoughts on this? In
the case of searching for all elements with a given name regardless of
depth, this method will work fine, but if we want a specific path, we could
end up opening lots of Blocks to guarantee path correctness, the entire
file in fact.
Steven

On Thu, May 21, 2015 at 10:20 AM, Efi <ef...@gmail.com> wrote:

> Hello everyone,
>
> For this week the two different methods for reading complete items
> according to a specific tag are completed and tested in standalone hdfs
> deployment.In detail what each method does:
>
> The first method, I call it One Buffer Method, reads a block, saves it in
> a buffer, and continues reading from the other blocks until it finds a
> specific closing tag.It shows good results and good times in the tests.
>
> The second method, called Shared File Method, reads only the complete
> items contained in the block and the incomplete items from the start and
> end of the block are send to a shared file in the hdfs Distributed Cache.
> Now this method could work only for relatively small inputs, since the
> Distributed Cache is limited and in the case of hundreds/thousands of
> blocks the shared file can exceed the limit.
>
> I took the liberty of creating diagrams that show in example what each
> method does.
> [1] One Buffer Method
> [2] Shared File Method
>
> Every insight and feedback is more than welcome about these two methods.In
> my opinion the One Buffer method is simpler and more effective since it can
> be used for both small and large datasets.
>
> There is also a question, can the parser work on data that are missing
> some tags?For example the first and last tag of the xml file that are
> located in different blocks.
>
> Best regards,
> Efi
>
> [1]
> https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing
>
> [2]
> https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing
>
>
>
>
> On 05/19/2015 12:43 AM, Michael Carey wrote:
>
>> +1 Sounds great!
>>
>> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>>
>>> Great work!
>>> Steven
>>>
>>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>>
>>>  Hello everyone,
>>>>
>>>> This is my update on what I have been doing this last week:
>>>>
>>>> Created an XMLInputFormat java class with the functionalities that Hamza
>>>> described in the issue [1] .The class reads from blocks located in HDFS
>>>> and
>>>> returns complete items according to a specified xml tag.
>>>> I also tested this class in a standalone hadoop cluster with xml files
>>>> of
>>>> various sizes, the smallest being a single file of 400 MB and the
>>>> largest a
>>>> collection of 5 files totalling 6.1 GB.
>>>>
>>>> This week I will create another implementation of the XMLInputFormat
>>>> with
>>>> a different way of reading and delivering files, the way I described in
>>>> the
>>>> same issue and I will test both solutions in a standalone and a small
>>>> hadoop cluster (5-6 nodes).
>>>>
>>>> You can see this week's results here [2] .I will keep updating this file
>>>> about the other tests.
>>>>
>>>> Best regards,
>>>> Efi
>>>>
>>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>>> [2]
>>>>
>>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>>>
>>>>
>>>>
>>
>>
>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Efi <ef...@gmail.com>.

Hello everyone,

For this week the two different methods for reading complete items 
according to a specific tag are completed and tested in standalone hdfs 
deployment.In detail what each method does:

The first method, I call it One Buffer Method, reads a block, saves it 
in a buffer, and continues reading from the other blocks until it finds 
a specific closing tag.It shows good results and good times in the tests.

The second method, called Shared File Method, reads only the complete 
items contained in the block and the incomplete items from the start and 
end of the block are send to a shared file in the hdfs Distributed 
Cache. Now this method could work only for relatively small inputs, 
since the Distributed Cache is limited and in the case of 
hundreds/thousands of blocks the shared file can exceed the limit.

I took the liberty of creating diagrams that show in example what each 
method does.
[1] One Buffer Method
[2] Shared File Method

Every insight and feedback is more than welcome about these two 
methods.In my opinion the One Buffer method is simpler and more 
effective since it can be used for both small and large datasets.

There is also a question, can the parser work on data that are missing 
some tags?For example the first and last tag of the xml file that are 
located in different blocks.

Best regards,
Efi

[1] 
https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing

[2] 
https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing

On 05/19/2015 12:43 AM, Michael Carey wrote:
> +1 Sounds great!
>
> On 5/18/15 8:33 AM, Steven Jacobs wrote:
>> Great work!
>> Steven
>>
>> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> This is my update on what I have been doing this last week:
>>>
>>> Created an XMLInputFormat java class with the functionalities that 
>>> Hamza
>>> described in the issue [1] .The class reads from blocks located in 
>>> HDFS and
>>> returns complete items according to a specified xml tag.
>>> I also tested this class in a standalone hadoop cluster with xml 
>>> files of
>>> various sizes, the smallest being a single file of 400 MB and the 
>>> largest a
>>> collection of 5 files totalling 6.1 GB.
>>>
>>> This week I will create another implementation of the XMLInputFormat 
>>> with
>>> a different way of reading and delivering files, the way I described 
>>> in the
>>> same issue and I will test both solutions in a standalone and a small
>>> hadoop cluster (5-6 nodes).
>>>
>>> You can see this week's results here [2] .I will keep updating this 
>>> file
>>> about the other tests.
>>>
>>> Best regards,
>>> Efi
>>>
>>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>>> [2]
>>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing 
>>>
>>>
>>>
>
>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Michael Carey <mj...@ics.uci.edu>.

+1 Sounds great!

On 5/18/15 8:33 AM, Steven Jacobs wrote:
> Great work!
> Steven
>
> On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> This is my update on what I have been doing this last week:
>>
>> Created an XMLInputFormat java class with the functionalities that Hamza
>> described in the issue [1] .The class reads from blocks located in HDFS and
>> returns complete items according to a specified xml tag.
>> I also tested this class in a standalone hadoop cluster with xml files of
>> various sizes, the smallest being a single file of 400 MB and the largest a
>> collection of 5 files totalling 6.1 GB.
>>
>> This week I will create another implementation of the XMLInputFormat with
>> a different way of reading and delivering files, the way I described in the
>> same issue and I will test both solutions in a standalone and a small
>> hadoop cluster (5-6 nodes).
>>
>> You can see this week's results here [2] .I will keep updating this file
>> about the other tests.
>>
>> Best regards,
>> Efi
>>
>> [1] https://issues.apache.org/jira/browse/VXQUERY-131
>> [2]
>> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>>
>>

Re: [#131]Supporting Hadoop data and cluster management

Posted by Steven Jacobs <sj...@ucr.edu>.

Great work!
Steven

On Sun, May 17, 2015 at 1:15 PM, Efi <ef...@gmail.com> wrote:

> Hello everyone,
>
> This is my update on what I have been doing this last week:
>
> Created an XMLInputFormat java class with the functionalities that Hamza
> described in the issue [1] .The class reads from blocks located in HDFS and
> returns complete items according to a specified xml tag.
> I also tested this class in a standalone hadoop cluster with xml files of
> various sizes, the smallest being a single file of 400 MB and the largest a
> collection of 5 files totalling 6.1 GB.
>
> This week I will create another implementation of the XMLInputFormat with
> a different way of reading and delivering files, the way I described in the
> same issue and I will test both solutions in a standalone and a small
> hadoop cluster (5-6 nodes).
>
> You can see this week's results here [2] .I will keep updating this file
> about the other tests.
>
> Best regards,
> Efi
>
> [1] https://issues.apache.org/jira/browse/VXQUERY-131
> [2]
> https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
>
>