You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@vxquery.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2016/05/18 21:59:13 UTC

[jira] [Commented] (VXQUERY-131) Supporting Hadoop and Yarn

    [ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289932#comment-15289932 ] 

ASF subversion and git services commented on VXQUERY-131:
---------------------------------------------------------

Commit 3fc2d6c2f74083513f5a45ccdef2bde8ad23d9a8 in vxquery's branch refs/heads/master from [~sjaco002]
[ https://git-wip-us.apache.org/repos/asf?p=vxquery.git;h=3fc2d6c ]

VXQUERY-131: Support for reading HDFS XML files.

1. User can choose between local and HDFS file systems for data input
2. New collection-with-tag function that reads from HDFS by blocks
3. Collection and Document can both read HDFS
4. Custom InputFormatClass for XML in HDFS
5. Parsing of data from HDFS as whole files
6. Unit tests for HDFS
7. MiniDFS cluster for unit tests
8. Documentation
9. Allow user to pass HDFS config folder as a param to queries
Author: Efi Kaltirimidou github: efikalti


> Supporting Hadoop and Yarn
> --------------------------
>
>                 Key: VXQUERY-131
>                 URL: https://issues.apache.org/jira/browse/VXQUERY-131
>             Project: VXQuery
>          Issue Type: Improvement
>            Reporter: Preston Carman
>            Assignee: Steven Jacobs
>              Labels: gsoc, gsoc2015, hadoop, java, mentor, xml
>
> Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. 
> We could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management.
> Goal
> - Read XML from HDFS
> - Manage cluster with YARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)