You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Suresh Antony (JIRA)" <ji...@apache.org> on 2009/06/16 02:56:07 UTC

[jira] Created: (HIVE-563) UDF for parsing the URL

UDF for parsing the URL
-----------------------

                 Key: HIVE-563
                 URL: https://issues.apache.org/jira/browse/HIVE-563
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Server Infrastructure
            Reporter: Suresh Antony
            Assignee: Suresh Antony


Needs a udf to extract the parts of url from url string. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-563) UDF for parsing the URL

Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Antony updated HIVE-563:
-------------------------------

    Attachment: patch_563.txt.1

 UDF to extract specific parts from URL
 parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
 parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return '/path/p1.php'
 parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') will return 'query=1'
 parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF') will return 'Ref'
 parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL') will return 'http'
 Possible values are HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
 Also you can get a value of particular key in QUERY, using syntax QUERY:<KEY_NAME> eg: QUERY:k1. 

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-563) UDF for parsing the URL

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719889#action_12719889 ] 

Zheng Shao commented on HIVE-563:
---------------------------------

@patch_563.txt:
Inadvertent change to HiveHistory.java?

Can you test all possible values for HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO?
The test case not only serves as a test case but also an example for users.


> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-563) UDF for parsing the URL

Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Antony updated HIVE-563:
-------------------------------

    Attachment: patch_563.txt

parse_url -- udf
 Format;
  pasre_url( utl, URL_PART_NAME).
Possible url Parts are: HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
example:
parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return 'path/p1.php'

Definition of parts can be obtained from:
http://www.j2ee.me/j2se/1.4.2/docs/api/java/net/URL.html

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-563) UDF for parsing the URL

Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Antony updated HIVE-563:
-------------------------------

    Attachment: patch_563.txt.2

Removed String.split()
-- Added second eventuate method where user can specify 'Query' key as separate argument.
eg:-
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k2') ,
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') ,


> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-563) UDF for parsing the URL

Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Antony updated HIVE-563:
-------------------------------

    Attachment:     (was: patch_563.txt.1)

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-563) UDF for parsing the URL

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HIVE-563.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.4.0
     Release Note: HIVE-563. UDF for parsing the URL: parse_url. (Suresh Antony via zshao)
     Hadoop Flags: [Reviewed]

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>             Fix For: 0.4.0
>
>         Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-563) UDF for parsing the URL

Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Antony updated HIVE-563:
-------------------------------

    Attachment: patch_563.txt.1

 * UDF to extract specific parts from URL
 * parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
 * parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return '/path/p1.php'
 * parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') will return 'query=1'
 * parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF') will return 'Ref'
 * parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL') will return 'http'
 * Possible values are HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
 * Also you can get a value of particular key in QUERY, using syntax QUERY:<KEY_NAME> eg: QUERY:k1. 

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-563) UDF for parsing the URL

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719907#action_12719907 ] 

Raghotham Murthy commented on HIVE-563:
---------------------------------------

looks like there are several string comparisons for each evaluate call. One possiblity is to construct a static hashmap of partToExtract to an integer which is populated when evaluate is called the first time. Then, the rest of evaluate can just be a switch statement on that integer. Also, you can parse out the QUERY:<key> during the first call into a static variable to evaluate and then use that for the rest of the calls.

There are also a few typos: 
'partToExtarct' should be partToExtract
missing spaces : catch(Exception e){


> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-563) UDF for parsing the URL

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721552#action_12721552 ] 

Zheng Shao commented on HIVE-563:
---------------------------------

Committed. Thanks Suresh!

> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>             Fix For: 0.4.0
>
>         Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-563) UDF for parsing the URL

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719926#action_12719926 ] 

Zheng Shao commented on HIVE-563:
---------------------------------

Agree with Raghu. While the String comparisons are still OK (I think moving to the static hashmap will definitely help but it's optional to do), doing "String.split()" is really a big performance hit (this is part of the reason that scripting languages are somehow slower - just because people like to use String.split() in those languages)

Can we cache "partToExtract" from last call, and avoid doing String.split again if the "partToExtract" didn't change (which is the normal case).
Can we do a loop through the query string instead of calling String.split?


> UDF for parsing the URL
> -----------------------
>
>                 Key: HIVE-563
>                 URL: https://issues.apache.org/jira/browse/HIVE-563
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Server Infrastructure
>            Reporter: Suresh Antony
>            Assignee: Suresh Antony
>         Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.