You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/01/21 16:15:55 UTC
File search by hashes in Hadoop
Hi all,
Apologies for the nature of this question.
Someone asked me whether it is possible to perform file search by hashes in
Hadoop.
I am thinking that he means wildcard searches in HDFS?
Anyone has ideas what file search by hash means in Hadoop?
regards,
Mich
RE: File search by hashes in Hadoop
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks Ritesh
I see there are two options here
1. Use UNIX like commands on hdfs to find the relevant files
hdfs dfs -ls -R |grep sales
drwxr-xr-x - hduser supergroup 0 2015-12-27 06:02 sales
-rw-r--r-- 2 hduser supergroup 0 2015-12-27 06:02 sales/_SUCCESS
2. Index based searching using Apache Lucene.
a. Download Apache Lucene. For example lucene-5.4.0.gz. gunzip it, move it to lucene-5.4.0.tar and untar it. 2 minutes job
b. Create LUCENE_HOME somewhere where you untarred the files --> export LUCENE_HOME=/usr/lib/lucene
c. Make sure that your CLASSPATH has the following jar files
d. CLASSPATH=$CLASSPATH:${LUCENE_HOME}/core/lucene-core-5.4.0.jar:${LUCENE_HOME}/demo/lucene-demo-5.4.0.jar:${LUCENE_HOME}/analysis/common/lucene-analyzers-common-5.4.0.jar:${LUCENE_HOME}/queryparser/lucene-queryparser-5.4.0.jar
e. Create an index for the directory you want to search. In my case $HADOOP_HOME/etc/Hadoop. When you run the java code below, you will see a directory called index created where you ran the command
f. java -cp $CLASSPATH org.apache.lucene.demo.IndexFiles -docs $HADOOP_HOME/etc/hadoop
g. Then you can conduct search in index directory. For example I am looking for word ‘yarn’
h. java -cp $CLASSPATH org.apache.lucene.demo.SearchFiles
Enter query:
yarn
Searching for: yarn
9 total matching documents
1. /home/hduser/hadoop-2.6.0/etc/hadoop/keep/mapred-site.xml
2. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-env.cmd
3. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-env.sh
4. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_ok
5. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-site.xml
6. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml
7. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_pre
8. /home/hduser/hadoop-2.6.0/etc/hadoop/hadoop-policy.xml
9. /home/hduser/hadoop-2.6.0/etc/hadoop/log4j.properties
Press (q)uit or enter number to jump to a page.
Pretty useful
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.
From: Ritesh Kumar Singh [mailto:riteshoneinamillion@gmail.com]
Sent: 21 January 2016 17:10
To: user@hive.apache.org
Subject: Re: File search by hashes in Hadoop
Yes, it's possible to do both
1. Index based searching : http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3
2. Wildcard based / Expression based searching : https://stackoverflow.com/questions/6297533/search-find-a-file-and-file-content-in-hadoop
Thanks,
Ritesh Kumar Singh,
<https://riteshtoday.wordpress.com/> https://riteshtoday.wordpress.com/
On Thu, Jan 21, 2016 at 4:15 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi all,
Apologies for the nature of this question.
Someone asked me whether it is possible to perform file search by hashes in Hadoop.
I am thinking that he means wildcard searches in HDFS?
Anyone has ideas what file search by hash means in Hadoop?
regards,
Mich
Re: File search by hashes in Hadoop
Posted by Ritesh Kumar Singh <ri...@gmail.com>.
Yes, it's possible to do both
1. Index based searching :
http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3
2. Wildcard based / Expression based searching :
https://stackoverflow.com/questions/6297533/search-find-a-file-and-file-content-in-hadoop
Thanks,
*Ritesh Kumar Singh,*
*https://riteshtoday.wordpress.com/* <https://riteshtoday.wordpress.com/>
On Thu, Jan 21, 2016 at 4:15 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Hi all,
>
>
>
> Apologies for the nature of this question.
>
>
>
> Someone asked me whether it is possible to perform file search by hashes
> in Hadoop.
>
>
>
> I am thinking that he means wildcard searches in HDFS?
>
>
>
> Anyone has ideas what file search by hash means in Hadoop?
>
>
>
> regards,
>
>
>
> Mich
>
>
>