You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Suresh Antony (JIRA)" <ji...@apache.org> on 2009/06/16 02:56:07 UTC
[jira] Created: (HIVE-563) UDF for parsing the URL
UDF for parsing the URL
-----------------------
Key: HIVE-563
URL: https://issues.apache.org/jira/browse/HIVE-563
Project: Hadoop Hive
Issue Type: New Feature
Components: Server Infrastructure
Reporter: Suresh Antony
Assignee: Suresh Antony
Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-563) UDF for parsing the URL
Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Antony updated HIVE-563:
-------------------------------
Attachment: patch_563.txt.1
UDF to extract specific parts from URL
parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return '/path/p1.php'
parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') will return 'query=1'
parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF') will return 'Ref'
parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL') will return 'http'
Possible values are HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
Also you can get a value of particular key in QUERY, using syntax QUERY:<KEY_NAME> eg: QUERY:k1.
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-563) UDF for parsing the URL
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719889#action_12719889 ]
Zheng Shao commented on HIVE-563:
---------------------------------
@patch_563.txt:
Inadvertent change to HiveHistory.java?
Can you test all possible values for HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO?
The test case not only serves as a test case but also an example for users.
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-563) UDF for parsing the URL
Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Antony updated HIVE-563:
-------------------------------
Attachment: patch_563.txt
parse_url -- udf
Format;
pasre_url( utl, URL_PART_NAME).
Possible url Parts are: HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
example:
parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return 'path/p1.php'
Definition of parts can be obtained from:
http://www.j2ee.me/j2se/1.4.2/docs/api/java/net/URL.html
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-563) UDF for parsing the URL
Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Antony updated HIVE-563:
-------------------------------
Attachment: patch_563.txt.2
Removed String.split()
-- Added second eventuate method where user can specify 'Query' key as separate argument.
eg:-
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k2') ,
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') ,
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-563) UDF for parsing the URL
Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Antony updated HIVE-563:
-------------------------------
Attachment: (was: patch_563.txt.1)
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-563) UDF for parsing the URL
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao resolved HIVE-563.
-----------------------------
Resolution: Fixed
Fix Version/s: 0.4.0
Release Note: HIVE-563. UDF for parsing the URL: parse_url. (Suresh Antony via zshao)
Hadoop Flags: [Reviewed]
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Fix For: 0.4.0
>
> Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-563) UDF for parsing the URL
Posted by "Suresh Antony (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Antony updated HIVE-563:
-------------------------------
Attachment: patch_563.txt.1
* UDF to extract specific parts from URL
* parse_url('http://facebook.com/path/p1.php?query=1', 'HOST') will return 'facebook.com'
* parse_url('http://facebook.com/path/p1.php?query=1', 'PATH') will return '/path/p1.php'
* parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY') will return 'query=1'
* parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF') will return 'Ref'
* parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL') will return 'http'
* Possible values are HOST,PATH,QUERY,REF,PROTOCOL,AUTHORITY,FILE,USERINFO
* Also you can get a value of particular key in QUERY, using syntax QUERY:<KEY_NAME> eg: QUERY:k1.
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-563) UDF for parsing the URL
Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719907#action_12719907 ]
Raghotham Murthy commented on HIVE-563:
---------------------------------------
looks like there are several string comparisons for each evaluate call. One possiblity is to construct a static hashmap of partToExtract to an integer which is populated when evaluate is called the first time. Then, the rest of evaluate can just be a switch statement on that integer. Also, you can parse out the QUERY:<key> during the first call into a static variable to evaluate and then use that for the rest of the calls.
There are also a few typos:
'partToExtarct' should be partToExtract
missing spaces : catch(Exception e){
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-563) UDF for parsing the URL
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721552#action_12721552 ]
Zheng Shao commented on HIVE-563:
---------------------------------
Committed. Thanks Suresh!
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Fix For: 0.4.0
>
> Attachments: patch_563.txt, patch_563.txt.1, patch_563.txt.2
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-563) UDF for parsing the URL
Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719926#action_12719926 ]
Zheng Shao commented on HIVE-563:
---------------------------------
Agree with Raghu. While the String comparisons are still OK (I think moving to the static hashmap will definitely help but it's optional to do), doing "String.split()" is really a big performance hit (this is part of the reason that scripting languages are somehow slower - just because people like to use String.split() in those languages)
Can we cache "partToExtract" from last call, and avoid doing String.split again if the "partToExtract" didn't change (which is the normal case).
Can we do a loop through the query string instead of calling String.split?
> UDF for parsing the URL
> -----------------------
>
> Key: HIVE-563
> URL: https://issues.apache.org/jira/browse/HIVE-563
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Server Infrastructure
> Reporter: Suresh Antony
> Assignee: Suresh Antony
> Attachments: patch_563.txt, patch_563.txt.1
>
>
> Needs a udf to extract the parts of url from url string.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.