You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Magnus Pierre (JIRA)" <ji...@apache.org> on 2015/11/05 11:04:27 UTC

[jira] [Commented] (DRILL-3878) Support XML Querying (selects/projections, no writing)

    [ https://issues.apache.org/jira/browse/DRILL-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991442#comment-14991442 ] 

Magnus Pierre commented on DRILL-3878:
--------------------------------------

Hello,
I have a simple implementation of a format converter that converts XML to JSON and run it through Drill JSONRecordReader which works fine for the test data I have available. The concept works well and the performance is decent, but it will build the complete JSON document in memory before handing it over to the JSONRecordReader and that is an issue for larger documents. Currently I am using a home-grown sax parser that builds the JSON document using org.JSON classes. However, there are dom variants that also can do XSD validations and so on. in order to be able to plug directly into JSONRecordReader without having to duplicate the code, embeddedInfo, hadoopPath, and stream need either to be changed from private to protected, or getters and setters need to be provided. 

Regarding XSD's I am considering if in dfs configuration if an additional option per workspace referring to the file type XML, can have a XSD list/array so any document in that workspace should adhere to the XSD's referred to otherwise they will not be considered by Drill.

I will fill in the document, but I believe adding information in the jira itself makes it more visible to other people in the community.

Best regards,
Magnus

> Support XML Querying (selects/projections, no writing)
> ------------------------------------------------------
>
>                 Key: DRILL-3878
>                 URL: https://issues.apache.org/jira/browse/DRILL-3878
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: Future
>            Reporter: Edmon Begoli
>              Labels: features
>             Fix For: Future
>
>   Original Estimate: 3,360h
>  Remaining Estimate: 3,360h
>
> Support querying of the XML documents (as read-only selects, 
> Writing should be implemented as a different feature that brings its own set of challenges.)
> To consider is reading of the trivial, schema-less, XML documents, DTD-oriented ones and also of schema-defined ones.
> Also, we should consider direct querying vs. using converter tools to change the representation from XML to JSON, CSV, etc.
> Design and Implementation discussion, notes, ideas and implementation suggestions should be captured here:
> https://docs.google.com/document/d/1oS-cObSaTlAmuW_XghDLmHbBEorLl0z-axaHnjy7vg0/edit?usp=sharing 
> (no vandalism, please)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)