You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Techy Teck <co...@gmail.com> on 2012/07/31 06:11:47 UTC

Find the files which contains a particular String

I have around 100 files and each file is of the size of 1GB. And I need to
find a String in all these 100 files and also which files contains that
particular String. I am working with Hadoop File System and all those 100
files are in Hadoop File System.

All the 100 files are under real folder, so If I do like this below, I will
be getting all the 100 files. And I need to find which files contains a
particular String *hello* under real folder.

bash-3.00$ hadoop fs -ls /technology/dps/real




And this is my data structure in hdfs-

row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile



How I can write MapReduce jobs to do this particular problem so that I can
find which files contains a particular string? Any simple example will be
of great help to me.

Re: Find the files which contains a particular String

Posted by Bob Gause <bo...@zyquest.com>.
We do a similar process with our log files in Hive. We only handle 30 to 60 files (similar structure) at a time, but it sounds like it would fit your model…..

We create an external table, then do hdfs puts to add the files to the table:

CREATE EXTERNAL TABLE log_import(
  date STRING,
  time STRING,
  url STRING,
  args STRING
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse//import';

dfs -put /data/clients/processed/20120616.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120617.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120618.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120619.txt.gz /user/hive/warehouse/import;

I don't know, but it seems like Hive/Hadoop treat the separate files in the table as clusters or buckets. We do see a good level of parallel tasks when we run queries against it…..

Thanks,
Bob

Robert Gause
Senior Systems Engineer
ZyQuest, Inc.
bob.gause@zyquest.com<ma...@zyquest.com>
920.617.7613

On Jul 30, 2012, at 11:11 PM, Techy Teck wrote:

I have around 100 files and each file is of the size of 1GB. And I need to find a String in all these 100 files and also which files contains that particular String. I am working with Hadoop File System and all those 100 files are in Hadoop File System.

All the 100 files are under real folder, so If I do like this below, I will be getting all the 100 files. And I need to find which files contains a particular String hello under real folder.

bash-3.00$ hadoop fs -ls /technology/dps/real



And this is my data structure in hdfs-

row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile


How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.


Re: Find the files which contains a particular String

Posted by Ravindra <ra...@gmail.com>.
If you can create table having schema similar to your files' structure. and
later add files as partition into the table-

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

then you can query your files using where clause.

This seems to be a time taking alternate and I have never tried it. So try
it at your own risk :)

--
Ravi.
*''We do not inherit the earth from our ancestors, we borrow it from our
children.'' PROTECT IT !*



On Tue, Jul 31, 2012 at 11:04 AM, Vinod Singh <vi...@vinodsingh.com> wrote:

> I believe Hive does not have any feature, which can provide this
> information. You may like to write a custom Map / Reduce program and get
> the file name being processed as shown below-
>
> ((FileSplit) context.getInputSplit()).getPath()
>
> and then emit the file name when an occurrence of the word is found.
>
> Thanks,
> Vinod
>
>
> On Tue, Jul 31, 2012 at 9:41 AM, Techy Teck <co...@gmail.com>wrote:
>
>> I have around 100 files and each file is of the size of 1GB. And I need
>> to find a String in all these 100 files and also which files contains that
>> particular String. I am working with Hadoop File System and all those 100
>> files are in Hadoop File System.
>>
>> All the 100 files are under real folder, so If I do like this below, I
>> will be getting all the 100 files. And I need to find which files contains
>> a particular String *hello* under real folder.
>>
>> bash-3.00$ hadoop fs -ls /technology/dps/real
>>
>>
>>
>>
>> And this is my data structure in hdfs-
>>
>> row format delimited
>> fields terminated by '\29'
>> collection items terminated by ','
>> map keys terminated by ':'
>> stored as textfile
>>
>>
>>
>> How I can write MapReduce jobs to do this particular problem so that I
>> can find which files contains a particular string? Any simple example will
>> be of great help to me.
>
>
>

Re: Find the files which contains a particular String

Posted by Vinod Singh <vi...@vinodsingh.com>.
I believe Hive does not have any feature, which can provide this
information. You may like to write a custom Map / Reduce program and get
the file name being processed as shown below-

((FileSplit) context.getInputSplit()).getPath()

and then emit the file name when an occurrence of the word is found.

Thanks,
Vinod

On Tue, Jul 31, 2012 at 9:41 AM, Techy Teck <co...@gmail.com> wrote:

> I have around 100 files and each file is of the size of 1GB. And I need to
> find a String in all these 100 files and also which files contains that
> particular String. I am working with Hadoop File System and all those 100
> files are in Hadoop File System.
>
> All the 100 files are under real folder, so If I do like this below, I
> will be getting all the 100 files. And I need to find which files contains
> a particular String *hello* under real folder.
>
> bash-3.00$ hadoop fs -ls /technology/dps/real
>
>
>
>
> And this is my data structure in hdfs-
>
> row format delimited
> fields terminated by '\29'
> collection items terminated by ','
> map keys terminated by ':'
> stored as textfile
>
>
>
> How I can write MapReduce jobs to do this particular problem so that I can
> find which files contains a particular string? Any simple example will be
> of great help to me.