You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jacques Nadeau (JIRA)" <ji...@apache.org> on 2015/11/17 22:31:11 UTC

[jira] [Comment Edited] (DRILL-3423) Add New HTTPD format plugin

    [ https://issues.apache.org/jira/browse/DRILL-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009557#comment-15009557 ] 

Jacques Nadeau edited comment on DRILL-3423 at 11/17/15 9:30 PM:
-----------------------------------------------------------------

Here is my alternative proposal: 

With the log format above: 
"%h %t \"%r\" %>s %b \"%{Referer}i\""

I propose a user gets the following fields (in order)

remote_host (varchar)
request_receive_time (drill timestamp)
request_method (varchar)
request_uri (varchar)
response_status (int)
response_bytes (bigint)
header_referer

Additionally, I think we should provide two new functions: 

parse_url(varchar url)
parse_url_query(varchar querystring, varchar pairDelimiter, varchar keyValueDelimiter)

parse_url(varchar) would provide an output of map type similar to: 
{code}
{
  protocol: ...,
  user: ...,
  password: ...,
  host: ...,
  port: 
  path: 
  query:
  fragment:
}
{code}

parse_url_query(...) would return an array of key values:
{code}
[
  {key: "...", value: "..."},
  {key: "...", value: "..."},
  {key: "...", value: "..."},
  {key: "...", value: "..."}
]
{code}
In response to your proposal: I don't think it makes sense to return many fields for a date field. Drill already provides functionality to get parts of a date. I also don't think it makes sense to prefix a field with its datatype, we don't do that anywhere else in Drill. We should also expose parsing an optional behavior in Drill.  Note also that my proposal substantially reduces the number of fields exposed to the user. I think this proposal has much better usability in the context of sql.

If you want to take advantage of the underlying formats capabilities, you can treat that as a pushdown of a particular function (data part or the url parsing functions above).






was (Author: jnadeau):
Here is my alternative proposal: 

With the log format above: 
"%h %t \"%r\" %>s %b \"%{Referer}i\""

I propose a user gets the following fields (in order)

remote_host (varchar)
request_receive_time (drill timestamp)
request_method (varchar)
request_uri (varchar)
response_status (int)
response_bytes (bigint)
header_referer

Additionally, I think we should provide two new functions: 

parse_url(varchar url)
parse_url_query(varchar querystring, varchar pairDelimiter, varchar keyValueDelimiter)

parse_url(varchar) would provide an output of map type similar to: 
{code}
{
  protocol: ...,
  user: ...,
  password: ...,
  host: ...,
  port: 
  path: 
  query:
  fragment:
}
{code}

parse_url_query(...) would return an array of key values:
[
  {key: "...", value: "..."},
  {key: "...", value: "..."},
  {key: "...", value: "..."},
  {key: "...", value: "..."}
]

In response to your proposal: I don't think it makes sense to return many fields for a date field. Drill already provides functionality to get parts of a date. I also don't think it makes sense to prefix a field with its datatype, we don't do that anywhere else in Drill. We should also expose parsing an optional behavior in Drill.  Note also that my proposal substantially reduces the number of fields exposed to the user. I think this proposal has much better usability in the context of sql.

If you want to take advantage of the underlying formats capabilities, you can treat that as a pushdown of a particular function (data part or the url parsing functions above).





> Add New HTTPD format plugin
> ---------------------------
>
>                 Key: DRILL-3423
>                 URL: https://issues.apache.org/jira/browse/DRILL-3423
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Other
>            Reporter: Jacques Nadeau
>            Assignee: Jim Scott
>             Fix For: 1.4.0
>
>
> Add an HTTPD logparser based format plugin.  The author has been kind enough to move the logparser project to be released under the Apache License.  Can find it here:
> <dependency>
>     <groupId>nl.basjes.parse.httpdlog</groupId>
>     <artifactId>httpdlog-parser</artifactId>
>     <version>2.0</version>
> </dependency>
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)