You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sanjay Subramanian <sa...@yahoo.com> on 2015/04/20 19:41:22 UTC

Using Hive as a file comparison and grep-ping tool

hey guys
As data wranglers and programmers we often need quick tools. One such tool I need almost everyday is one that greps a file based on contents of another file. One can write this in perl, python but since I am already using hadoop ecosystem extensively, I said why not do this in Hive ? 
Perhaps you guys already know this and have better solutions....nevertheless :-) here goes...

Best regards
sanjay(Hive super-fan)
I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/
In case the blog URL does not work for any reason, here is the logic
Using Hive as a file comparison and grep-ping tool==================================================1. Logon to your linux terminal where u run Hive queries from
2. Create a database called "myutils" in Hive   Create two hive tables myutils.file1 and myutils.file2 in Hive    - each of these tables will have a partition called "fn"     ----> fn is short for "filename"    - each of these tables will have just one column called "ln" ----> ln is short for "line"   An easy script to help do that would be as follows     for r in 1 2 ; do hive -e "CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done
3. Create a permanent base location folder in HDFS   hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/   hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/   USECASE 1 : ===========Search if a bunch of IP addresses exist in another file containing (larger) bunch of IPs
[1] registeredIPs.txt    10.456.34.90    123.675.654.1    21.87.657.456    234.109.34.234        visitorIPs.txt    10.456.34.90    12.367.54.23    218.7.657.456    23.4.109.3   [2] Output which IPs in File1 are present in File2
[3] Put each file in a separate HDFS location        hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put VisitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
    hdfs dfs -put registeredIPs.txt  /workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put visitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
[4] Add partition to  myutils.file1    For simplicity keep the partition names identical to the file names themselves      hive -e "USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt'"
    hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt'"    [5] Check that partitions can be accesd by Hive
    # This should give u the same answer as    # wc -l registeredIPs.txt    hive -e "select count(*) from myutils.file1 where fn='registeredIPs.txt'"
    # This should give u the same answer as    # wc -l visitorIPs.txt    hive -e "select count(*) from myutils.file2 where fn='visitorIPs.txt'"
[6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt
# This dumps to a local file systemhive -e "SELECT f1.ln FROM (SELECT ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)" > ./registered_in_visitors_list.txt
# This dumps to a new "internally-managed-by-hive" table # Make sure u already dont have some valuable hive table called "myutils.registered_in_visitors_list" - else this will overwrite that hive table with the results of this hive query hive -e "USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"
# This dumps to a directory on HDFS# Make sure u already dont have some valuable directory called "registered_in_visitors_list" - else this will overwrite that director and all its contents with the results of this hive query hive -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"