You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/01/21 16:15:55 UTC

File search by hashes in Hadoop

Hi all,

 

Apologies for the nature of this question. 

 

Someone asked me whether it is possible to perform file search by hashes in
Hadoop.

 

I am thinking that he means wildcard searches in HDFS?

 

Anyone has ideas what file search by hash means in Hadoop?

 

regards,

 

Mich

 


RE: File search by hashes in Hadoop

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks Ritesh

 

 

I see there are two options here

 

1.    Use UNIX like commands on hdfs to find the relevant files

hdfs dfs -ls -R |grep sales

drwxr-xr-x   - hduser supergroup          0 2015-12-27 06:02 sales

-rw-r--r--   2 hduser supergroup          0 2015-12-27 06:02 sales/_SUCCESS

 

2.    Index based searching using Apache Lucene.

 

a.    Download Apache Lucene. For example lucene-5.4.0.gz. gunzip it, move it to lucene-5.4.0.tar and untar it. 2 minutes job

b.  Create LUCENE_HOME somewhere where you untarred the files --> export LUCENE_HOME=/usr/lib/lucene

c.     Make sure that your CLASSPATH has the following jar files

d.  CLASSPATH=$CLASSPATH:${LUCENE_HOME}/core/lucene-core-5.4.0.jar:${LUCENE_HOME}/demo/lucene-demo-5.4.0.jar:${LUCENE_HOME}/analysis/common/lucene-analyzers-common-5.4.0.jar:${LUCENE_HOME}/queryparser/lucene-queryparser-5.4.0.jar

e.    Create an index for the directory you want to search. In my case $HADOOP_HOME/etc/Hadoop. When you run the java code below, you will see a directory called index created where you ran the command

f.  java -cp $CLASSPATH org.apache.lucene.demo.IndexFiles -docs $HADOOP_HOME/etc/hadoop

g.    Then you can conduct search in index directory. For example I am looking for word ‘yarn’

h.  java -cp $CLASSPATH org.apache.lucene.demo.SearchFiles

Enter query:

yarn

Searching for: yarn

9 total matching documents

1. /home/hduser/hadoop-2.6.0/etc/hadoop/keep/mapred-site.xml

2. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-env.cmd

3. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-env.sh

4. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_ok

5. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-site.xml

6. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml

7. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_pre

8. /home/hduser/hadoop-2.6.0/etc/hadoop/hadoop-policy.xml

9. /home/hduser/hadoop-2.6.0/etc/hadoop/log4j.properties

Press (q)uit or enter number to jump to a page.

 

Pretty useful

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ritesh Kumar Singh [mailto:riteshoneinamillion@gmail.com] 
Sent: 21 January 2016 17:10
To: user@hive.apache.org
Subject: Re: File search by hashes in Hadoop

 

Yes, it's possible to do both

1. Index based searching : http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3

2. Wildcard based / Expression based searching : https://stackoverflow.com/questions/6297533/search-find-a-file-and-file-content-in-hadoop

 

Thanks,




Ritesh Kumar Singh,

 <https://riteshtoday.wordpress.com/> https://riteshtoday.wordpress.com/

 

On Thu, Jan 21, 2016 at 4:15 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Hi all,

 

Apologies for the nature of this question. 

 

Someone asked me whether it is possible to perform file search by hashes in Hadoop.

 

I am thinking that he means wildcard searches in HDFS?

 

Anyone has ideas what file search by hash means in Hadoop?

 

regards,

 

Mich

 

 




Re: File search by hashes in Hadoop

Posted by Ritesh Kumar Singh <ri...@gmail.com>.
Yes, it's possible to do both
1. Index based searching :
http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3
2. Wildcard based / Expression based searching :
https://stackoverflow.com/questions/6297533/search-find-a-file-and-file-content-in-hadoop

Thanks,

*Ritesh Kumar Singh,*
*https://riteshtoday.wordpress.com/* <https://riteshtoday.wordpress.com/>

On Thu, Jan 21, 2016 at 4:15 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi all,
>
>
>
> Apologies for the nature of this question.
>
>
>
> Someone asked me whether it is possible to perform file search by hashes
> in Hadoop.
>
>
>
> I am thinking that he means wildcard searches in HDFS?
>
>
>
> Anyone has ideas what file search by hash means in Hadoop?
>
>
>
> regards,
>
>
>
> Mich
>
>
>