You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Fengyun RAO <ra...@gmail.com> on 2013/12/30 08:58:57 UTC

any suggestions on IIS log storage and analysis?

Hi,

HDFS splits files into blocks, and mapreduce runs a map task for each
block. However, Fields could be changed in IIS log files, which means
fields in one block may depend on another, and thus make it not suitable
for mapreduce job. It seems there should be some preprocess before storing
and analyzing the IIS log files. We plan to parse each line to the same
fields and store in Avro files with compression. Any other alternatives?
Hbase?  or any suggestions on analyzing IIS log files?

thanks!

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

what do you mean by join the data sets?

a fake sample log file:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:03
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390


"#Fileds:" line is needed to parse the following IIS log, however, it may
change, which makes splitting not supported.


2013/12/30 Azuryy Yu <az...@gmail.com>

> You can run a mapreduce firstly, Join these data sets into one data set.
> then analyze the joined dataset.
>
>
> On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

what do you mean by join the data sets?

a fake sample log file:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:03
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390


"#Fileds:" line is needed to parse the following IIS log, however, it may
change, which makes splitting not supported.


2013/12/30 Azuryy Yu <az...@gmail.com>

> You can run a mapreduce firstly, Join these data sets into one data set.
> then analyze the joined dataset.
>
>
> On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

what do you mean by join the data sets?

a fake sample log file:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:03
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390


"#Fileds:" line is needed to parse the following IIS log, however, it may
change, which makes splitting not supported.


2013/12/30 Azuryy Yu <az...@gmail.com>

> You can run a mapreduce firstly, Join these data sets into one data set.
> then analyze the joined dataset.
>
>
> On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

what do you mean by join the data sets?

a fake sample log file:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:03
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390


"#Fileds:" line is needed to parse the following IIS log, however, it may
change, which makes splitting not supported.


2013/12/30 Azuryy Yu <az...@gmail.com>

> You can run a mapreduce firstly, Join these data sets into one data set.
> then analyze the joined dataset.
>
>
> On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Azuryy Yu <az...@gmail.com>.

You can run a mapreduce firstly, Join these data sets into one data set.
then analyze the joined dataset.


On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Azuryy Yu <az...@gmail.com>.

You can run a mapreduce firstly, Join these data sets into one data set.
then analyze the joined dataset.


On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Peyman. The problem is that the dependence is not simply a key,
instead it's so complicated that without "#Fields" line in one block, it's
not even able to parse any line in another block.


2014/1/1 Peyman Mohajerian <mo...@gmail.com>

> You can run a series of map-reduce jobs on your data, if some log line is
> related to another line, e.g. based on sessionId, you can emit the
> sessionId as the key of your mapper output with the value being on the rows
> associated with the sessionId, so on the reducer side data from different
> blocks will be coming together. Of course that is just one example, so the
> fact that you have file content being split doesn't impact your analysis if
> you have inter-dependencies.
>
>
> On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Thanks, I understand now, but I don't think this is what we need. The IIS
>> log files are very big (e.g, serveral GB per file), we need to split them
>> for parallel processing. However, this could be used as some sort of
>> preprocessing, to transform the original log files to splitable files such
>> as Avro files.
>>
>>
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>>> "
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>>> Subject: Re: any suggestions on IIS log storage and analysis?
>>>
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>> Thanks, Yong!
>>>
>>> The dependence never cross files, but since HDFS splits files into
>>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>>> don't quite understand what you mean by "WholeFileInputFormat ".
>>> Actually, I have no idea how to deal with dependence across blocks.
>>>
>>>
>>> 2013/12/31 java8964 <ja...@hotmail.com>
>>>
>>> I don't know any example of IIS log files. But from what you described,
>>> it looks like analyzing one line of log data depends on some previous lines
>>> data. You should be more clear about what is this dependence and what you
>>> are trying to do.
>>>
>>> Just based on your questions, you still have different options, which
>>> one is better depends on your requirements and data.
>>>
>>> 1) You know the existing default TextInputFormat not suitable for your
>>> case, you just need to find alternatives, or write your own.
>>> 2) If the dependences never cross the files, just cross lines, you can
>>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>>> easy implemented by yourself)
>>> 3) If the dependences cross the files, then you maybe have to enforce
>>> your business logics in reducer side, instead of mapper side. Without
>>> knowing your detail requirements of this dependence, it is hard to give you
>>> more detail, but you need to find out what are good KEY candidates for your
>>> dependence logic, send the data based on that to the reducers, and enforce
>>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>>> dependence, you may need chain several MR jobs together.
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>>> Subject: any suggestions on IIS log storage and analysis?
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Hi,
>>>
>>> HDFS splits files into blocks, and mapreduce runs a map task for each
>>> block. However, Fields could be changed in IIS log files, which means
>>> fields in one block may depend on another, and thus make it not suitable
>>> for mapreduce job. It seems there should be some preprocess before storing
>>> and analyzing the IIS log files. We plan to parse each line to the same
>>> fields and store in Avro files with compression. Any other alternatives?
>>> Hbase?  or any suggestions on analyzing IIS log files?
>>>
>>> thanks!
>>>
>>>
>>>
>>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Peyman. The problem is that the dependence is not simply a key,
instead it's so complicated that without "#Fields" line in one block, it's
not even able to parse any line in another block.


2014/1/1 Peyman Mohajerian <mo...@gmail.com>

> You can run a series of map-reduce jobs on your data, if some log line is
> related to another line, e.g. based on sessionId, you can emit the
> sessionId as the key of your mapper output with the value being on the rows
> associated with the sessionId, so on the reducer side data from different
> blocks will be coming together. Of course that is just one example, so the
> fact that you have file content being split doesn't impact your analysis if
> you have inter-dependencies.
>
>
> On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Thanks, I understand now, but I don't think this is what we need. The IIS
>> log files are very big (e.g, serveral GB per file), we need to split them
>> for parallel processing. However, this could be used as some sort of
>> preprocessing, to transform the original log files to splitable files such
>> as Avro files.
>>
>>
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>>> "
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>>> Subject: Re: any suggestions on IIS log storage and analysis?
>>>
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>> Thanks, Yong!
>>>
>>> The dependence never cross files, but since HDFS splits files into
>>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>>> don't quite understand what you mean by "WholeFileInputFormat ".
>>> Actually, I have no idea how to deal with dependence across blocks.
>>>
>>>
>>> 2013/12/31 java8964 <ja...@hotmail.com>
>>>
>>> I don't know any example of IIS log files. But from what you described,
>>> it looks like analyzing one line of log data depends on some previous lines
>>> data. You should be more clear about what is this dependence and what you
>>> are trying to do.
>>>
>>> Just based on your questions, you still have different options, which
>>> one is better depends on your requirements and data.
>>>
>>> 1) You know the existing default TextInputFormat not suitable for your
>>> case, you just need to find alternatives, or write your own.
>>> 2) If the dependences never cross the files, just cross lines, you can
>>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>>> easy implemented by yourself)
>>> 3) If the dependences cross the files, then you maybe have to enforce
>>> your business logics in reducer side, instead of mapper side. Without
>>> knowing your detail requirements of this dependence, it is hard to give you
>>> more detail, but you need to find out what are good KEY candidates for your
>>> dependence logic, send the data based on that to the reducers, and enforce
>>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>>> dependence, you may need chain several MR jobs together.
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>>> Subject: any suggestions on IIS log storage and analysis?
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Hi,
>>>
>>> HDFS splits files into blocks, and mapreduce runs a map task for each
>>> block. However, Fields could be changed in IIS log files, which means
>>> fields in one block may depend on another, and thus make it not suitable
>>> for mapreduce job. It seems there should be some preprocess before storing
>>> and analyzing the IIS log files. We plan to parse each line to the same
>>> fields and store in Avro files with compression. Any other alternatives?
>>> Hbase?  or any suggestions on analyzing IIS log files?
>>>
>>> thanks!
>>>
>>>
>>>
>>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Peyman. The problem is that the dependence is not simply a key,
instead it's so complicated that without "#Fields" line in one block, it's
not even able to parse any line in another block.


2014/1/1 Peyman Mohajerian <mo...@gmail.com>

> You can run a series of map-reduce jobs on your data, if some log line is
> related to another line, e.g. based on sessionId, you can emit the
> sessionId as the key of your mapper output with the value being on the rows
> associated with the sessionId, so on the reducer side data from different
> blocks will be coming together. Of course that is just one example, so the
> fact that you have file content being split doesn't impact your analysis if
> you have inter-dependencies.
>
>
> On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Thanks, I understand now, but I don't think this is what we need. The IIS
>> log files are very big (e.g, serveral GB per file), we need to split them
>> for parallel processing. However, this could be used as some sort of
>> preprocessing, to transform the original log files to splitable files such
>> as Avro files.
>>
>>
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>>> "
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>>> Subject: Re: any suggestions on IIS log storage and analysis?
>>>
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>> Thanks, Yong!
>>>
>>> The dependence never cross files, but since HDFS splits files into
>>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>>> don't quite understand what you mean by "WholeFileInputFormat ".
>>> Actually, I have no idea how to deal with dependence across blocks.
>>>
>>>
>>> 2013/12/31 java8964 <ja...@hotmail.com>
>>>
>>> I don't know any example of IIS log files. But from what you described,
>>> it looks like analyzing one line of log data depends on some previous lines
>>> data. You should be more clear about what is this dependence and what you
>>> are trying to do.
>>>
>>> Just based on your questions, you still have different options, which
>>> one is better depends on your requirements and data.
>>>
>>> 1) You know the existing default TextInputFormat not suitable for your
>>> case, you just need to find alternatives, or write your own.
>>> 2) If the dependences never cross the files, just cross lines, you can
>>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>>> easy implemented by yourself)
>>> 3) If the dependences cross the files, then you maybe have to enforce
>>> your business logics in reducer side, instead of mapper side. Without
>>> knowing your detail requirements of this dependence, it is hard to give you
>>> more detail, but you need to find out what are good KEY candidates for your
>>> dependence logic, send the data based on that to the reducers, and enforce
>>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>>> dependence, you may need chain several MR jobs together.
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>>> Subject: any suggestions on IIS log storage and analysis?
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Hi,
>>>
>>> HDFS splits files into blocks, and mapreduce runs a map task for each
>>> block. However, Fields could be changed in IIS log files, which means
>>> fields in one block may depend on another, and thus make it not suitable
>>> for mapreduce job. It seems there should be some preprocess before storing
>>> and analyzing the IIS log files. We plan to parse each line to the same
>>> fields and store in Avro files with compression. Any other alternatives?
>>> Hbase?  or any suggestions on analyzing IIS log files?
>>>
>>> thanks!
>>>
>>>
>>>
>>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Peyman. The problem is that the dependence is not simply a key,
instead it's so complicated that without "#Fields" line in one block, it's
not even able to parse any line in another block.


2014/1/1 Peyman Mohajerian <mo...@gmail.com>

> You can run a series of map-reduce jobs on your data, if some log line is
> related to another line, e.g. based on sessionId, you can emit the
> sessionId as the key of your mapper output with the value being on the rows
> associated with the sessionId, so on the reducer side data from different
> blocks will be coming together. Of course that is just one example, so the
> fact that you have file content being split doesn't impact your analysis if
> you have inter-dependencies.
>
>
> On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> Thanks, I understand now, but I don't think this is what we need. The IIS
>> log files are very big (e.g, serveral GB per file), we need to split them
>> for parallel processing. However, this could be used as some sort of
>> preprocessing, to transform the original log files to splitable files such
>> as Avro files.
>>
>>
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>>> "
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>>> Subject: Re: any suggestions on IIS log storage and analysis?
>>>
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>> Thanks, Yong!
>>>
>>> The dependence never cross files, but since HDFS splits files into
>>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>>> don't quite understand what you mean by "WholeFileInputFormat ".
>>> Actually, I have no idea how to deal with dependence across blocks.
>>>
>>>
>>> 2013/12/31 java8964 <ja...@hotmail.com>
>>>
>>> I don't know any example of IIS log files. But from what you described,
>>> it looks like analyzing one line of log data depends on some previous lines
>>> data. You should be more clear about what is this dependence and what you
>>> are trying to do.
>>>
>>> Just based on your questions, you still have different options, which
>>> one is better depends on your requirements and data.
>>>
>>> 1) You know the existing default TextInputFormat not suitable for your
>>> case, you just need to find alternatives, or write your own.
>>> 2) If the dependences never cross the files, just cross lines, you can
>>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>>> easy implemented by yourself)
>>> 3) If the dependences cross the files, then you maybe have to enforce
>>> your business logics in reducer side, instead of mapper side. Without
>>> knowing your detail requirements of this dependence, it is hard to give you
>>> more detail, but you need to find out what are good KEY candidates for your
>>> dependence logic, send the data based on that to the reducers, and enforce
>>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>>> dependence, you may need chain several MR jobs together.
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>>> Subject: any suggestions on IIS log storage and analysis?
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Hi,
>>>
>>> HDFS splits files into blocks, and mapreduce runs a map task for each
>>> block. However, Fields could be changed in IIS log files, which means
>>> fields in one block may depend on another, and thus make it not suitable
>>> for mapreduce job. It seems there should be some preprocess before storing
>>> and analyzing the IIS log files. We plan to parse each line to the same
>>> fields and store in Avro files with compression. Any other alternatives?
>>> Hbase?  or any suggestions on analyzing IIS log files?
>>>
>>> thanks!
>>>
>>>
>>>
>>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can run a series of map-reduce jobs on your data, if some log line is
related to another line, e.g. based on sessionId, you can emit the
sessionId as the key of your mapper output with the value being on the rows
associated with the sessionId, so on the reducer side data from different
blocks will be coming together. Of course that is just one example, so the
fact that you have file content being split doesn't impact your analysis if
you have inter-dependencies.


On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Thanks, I understand now, but I don't think this is what we need. The IIS
> log files are very big (e.g, serveral GB per file), we need to split them
> for parallel processing. However, this could be used as some sort of
> preprocessing, to transform the original log files to splitable files such
> as Avro files.
>
>
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>> "
>>
>> Yong
>>
>>
>> ------------------------------
>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>> Subject: Re: any suggestions on IIS log storage and analysis?
>>
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks, Yong!
>>
>> The dependence never cross files, but since HDFS splits files into
>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>> don't quite understand what you mean by "WholeFileInputFormat ".
>> Actually, I have no idea how to deal with dependence across blocks.
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>> I don't know any example of IIS log files. But from what you described,
>> it looks like analyzing one line of log data depends on some previous lines
>> data. You should be more clear about what is this dependence and what you
>> are trying to do.
>>
>> Just based on your questions, you still have different options, which one
>> is better depends on your requirements and data.
>>
>> 1) You know the existing default TextInputFormat not suitable for your
>> case, you just need to find alternatives, or write your own.
>> 2) If the dependences never cross the files, just cross lines, you can
>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>> easy implemented by yourself)
>> 3) If the dependences cross the files, then you maybe have to enforce
>> your business logics in reducer side, instead of mapper side. Without
>> knowing your detail requirements of this dependence, it is hard to give you
>> more detail, but you need to find out what are good KEY candidates for your
>> dependence logic, send the data based on that to the reducers, and enforce
>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>> dependence, you may need chain several MR jobs together.
>>
>> Yong
>>
>> ------------------------------
>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>> Subject: any suggestions on IIS log storage and analysis?
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can run a series of map-reduce jobs on your data, if some log line is
related to another line, e.g. based on sessionId, you can emit the
sessionId as the key of your mapper output with the value being on the rows
associated with the sessionId, so on the reducer side data from different
blocks will be coming together. Of course that is just one example, so the
fact that you have file content being split doesn't impact your analysis if
you have inter-dependencies.


On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Thanks, I understand now, but I don't think this is what we need. The IIS
> log files are very big (e.g, serveral GB per file), we need to split them
> for parallel processing. However, this could be used as some sort of
> preprocessing, to transform the original log files to splitable files such
> as Avro files.
>
>
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>> "
>>
>> Yong
>>
>>
>> ------------------------------
>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>> Subject: Re: any suggestions on IIS log storage and analysis?
>>
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks, Yong!
>>
>> The dependence never cross files, but since HDFS splits files into
>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>> don't quite understand what you mean by "WholeFileInputFormat ".
>> Actually, I have no idea how to deal with dependence across blocks.
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>> I don't know any example of IIS log files. But from what you described,
>> it looks like analyzing one line of log data depends on some previous lines
>> data. You should be more clear about what is this dependence and what you
>> are trying to do.
>>
>> Just based on your questions, you still have different options, which one
>> is better depends on your requirements and data.
>>
>> 1) You know the existing default TextInputFormat not suitable for your
>> case, you just need to find alternatives, or write your own.
>> 2) If the dependences never cross the files, just cross lines, you can
>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>> easy implemented by yourself)
>> 3) If the dependences cross the files, then you maybe have to enforce
>> your business logics in reducer side, instead of mapper side. Without
>> knowing your detail requirements of this dependence, it is hard to give you
>> more detail, but you need to find out what are good KEY candidates for your
>> dependence logic, send the data based on that to the reducers, and enforce
>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>> dependence, you may need chain several MR jobs together.
>>
>> Yong
>>
>> ------------------------------
>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>> Subject: any suggestions on IIS log storage and analysis?
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can run a series of map-reduce jobs on your data, if some log line is
related to another line, e.g. based on sessionId, you can emit the
sessionId as the key of your mapper output with the value being on the rows
associated with the sessionId, so on the reducer side data from different
blocks will be coming together. Of course that is just one example, so the
fact that you have file content being split doesn't impact your analysis if
you have inter-dependencies.


On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Thanks, I understand now, but I don't think this is what we need. The IIS
> log files are very big (e.g, serveral GB per file), we need to split them
> for parallel processing. However, this could be used as some sort of
> preprocessing, to transform the original log files to splitable files such
> as Avro files.
>
>
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>> "
>>
>> Yong
>>
>>
>> ------------------------------
>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>> Subject: Re: any suggestions on IIS log storage and analysis?
>>
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks, Yong!
>>
>> The dependence never cross files, but since HDFS splits files into
>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>> don't quite understand what you mean by "WholeFileInputFormat ".
>> Actually, I have no idea how to deal with dependence across blocks.
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>> I don't know any example of IIS log files. But from what you described,
>> it looks like analyzing one line of log data depends on some previous lines
>> data. You should be more clear about what is this dependence and what you
>> are trying to do.
>>
>> Just based on your questions, you still have different options, which one
>> is better depends on your requirements and data.
>>
>> 1) You know the existing default TextInputFormat not suitable for your
>> case, you just need to find alternatives, or write your own.
>> 2) If the dependences never cross the files, just cross lines, you can
>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>> easy implemented by yourself)
>> 3) If the dependences cross the files, then you maybe have to enforce
>> your business logics in reducer side, instead of mapper side. Without
>> knowing your detail requirements of this dependence, it is hard to give you
>> more detail, but you need to find out what are good KEY candidates for your
>> dependence logic, send the data based on that to the reducers, and enforce
>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>> dependence, you may need chain several MR jobs together.
>>
>> Yong
>>
>> ------------------------------
>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>> Subject: any suggestions on IIS log storage and analysis?
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can run a series of map-reduce jobs on your data, if some log line is
related to another line, e.g. based on sessionId, you can emit the
sessionId as the key of your mapper output with the value being on the rows
associated with the sessionId, so on the reducer side data from different
blocks will be coming together. Of course that is just one example, so the
fact that you have file content being split doesn't impact your analysis if
you have inter-dependencies.


On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Thanks, I understand now, but I don't think this is what we need. The IIS
> log files are very big (e.g, serveral GB per file), we need to split them
> for parallel processing. However, this could be used as some sort of
> preprocessing, to transform the original log files to splitable files such
> as Avro files.
>
>
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>> "
>>
>> Yong
>>
>>
>> ------------------------------
>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>> Subject: Re: any suggestions on IIS log storage and analysis?
>>
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks, Yong!
>>
>> The dependence never cross files, but since HDFS splits files into
>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>> don't quite understand what you mean by "WholeFileInputFormat ".
>> Actually, I have no idea how to deal with dependence across blocks.
>>
>>
>> 2013/12/31 java8964 <ja...@hotmail.com>
>>
>> I don't know any example of IIS log files. But from what you described,
>> it looks like analyzing one line of log data depends on some previous lines
>> data. You should be more clear about what is this dependence and what you
>> are trying to do.
>>
>> Just based on your questions, you still have different options, which one
>> is better depends on your requirements and data.
>>
>> 1) You know the existing default TextInputFormat not suitable for your
>> case, you just need to find alternatives, or write your own.
>> 2) If the dependences never cross the files, just cross lines, you can
>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>> easy implemented by yourself)
>> 3) If the dependences cross the files, then you maybe have to enforce
>> your business logics in reducer side, instead of mapper side. Without
>> knowing your detail requirements of this dependence, it is hard to give you
>> more detail, but you need to find out what are good KEY candidates for your
>> dependence logic, send the data based on that to the reducers, and enforce
>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>> dependence, you may need chain several MR jobs together.
>>
>> Yong
>>
>> ------------------------------
>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>> Subject: any suggestions on IIS log storage and analysis?
>> From: raofengyun@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, I understand now, but I don't think this is what we need. The IIS
log files are very big (e.g, serveral GB per file), we need to split them
for parallel processing. However, this could be used as some sort of
preprocessing, to transform the original log files to splitable files such
as Avro files.




2013/12/31 java8964 <ja...@hotmail.com>

> Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The
> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
> "
>
> Yong
>
>
> ------------------------------
> Date: Tue, 31 Dec 2013 09:39:58 +0800
> Subject: Re: any suggestions on IIS log storage and analysis?
>
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
> Thanks, Yong!
>
> The dependence never cross files, but since HDFS splits files into blocks,
> it may cross blocks, which makes it difficult to write MR job. I don't
> quite understand what you mean by "WholeFileInputFormat ". Actually, I
> have no idea how to deal with dependence across blocks.
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, I understand now, but I don't think this is what we need. The IIS
log files are very big (e.g, serveral GB per file), we need to split them
for parallel processing. However, this could be used as some sort of
preprocessing, to transform the original log files to splitable files such
as Avro files.




2013/12/31 java8964 <ja...@hotmail.com>

> Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The
> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
> "
>
> Yong
>
>
> ------------------------------
> Date: Tue, 31 Dec 2013 09:39:58 +0800
> Subject: Re: any suggestions on IIS log storage and analysis?
>
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
> Thanks, Yong!
>
> The dependence never cross files, but since HDFS splits files into blocks,
> it may cross blocks, which makes it difficult to write MR job. I don't
> quite understand what you mean by "WholeFileInputFormat ". Actually, I
> have no idea how to deal with dependence across blocks.
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, I understand now, but I don't think this is what we need. The IIS
log files are very big (e.g, serveral GB per file), we need to split them
for parallel processing. However, this could be used as some sort of
preprocessing, to transform the original log files to splitable files such
as Avro files.




2013/12/31 java8964 <ja...@hotmail.com>

> Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The
> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
> "
>
> Yong
>
>
> ------------------------------
> Date: Tue, 31 Dec 2013 09:39:58 +0800
> Subject: Re: any suggestions on IIS log storage and analysis?
>
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
> Thanks, Yong!
>
> The dependence never cross files, but since HDFS splits files into blocks,
> it may cross blocks, which makes it difficult to write MR job. I don't
> quite understand what you mean by "WholeFileInputFormat ". Actually, I
> have no idea how to deal with dependence across blocks.
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, I understand now, but I don't think this is what we need. The IIS
log files are very big (e.g, serveral GB per file), we need to split them
for parallel processing. However, this could be used as some sort of
preprocessing, to transform the original log files to splitable files such
as Avro files.




2013/12/31 java8964 <ja...@hotmail.com>

> Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The
> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
> "
>
> Yong
>
>
> ------------------------------
> Date: Tue, 31 Dec 2013 09:39:58 +0800
> Subject: Re: any suggestions on IIS log storage and analysis?
>
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
> Thanks, Yong!
>
> The dependence never cross files, but since HDFS splits files into blocks,
> it may cross blocks, which makes it difficult to write MR job. I don't
> quite understand what you mean by "WholeFileInputFormat ". Actually, I
> have no idea how to deal with dependence across blocks.
>
>
> 2013/12/31 java8964 <ja...@hotmail.com>
>
> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>
>

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The Definitive Guide"
Yong 

Date: Tue, 31 Dec 2013 09:39:58 +0800
Subject: Re: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks.

2013/12/31 java8964 <ja...@hotmail.com>

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.

Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.

Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com

To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The Definitive Guide"
Yong 

Date: Tue, 31 Dec 2013 09:39:58 +0800
Subject: Re: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks.

2013/12/31 java8964 <ja...@hotmail.com>

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.

Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.

Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com

To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The Definitive Guide"
Yong 

Date: Tue, 31 Dec 2013 09:39:58 +0800
Subject: Re: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks.

2013/12/31 java8964 <ja...@hotmail.com>

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.

Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.

Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com

To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The Definitive Guide"
Yong 

Date: Tue, 31 Dec 2013 09:39:58 +0800
Subject: Re: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks.

2013/12/31 java8964 <ja...@hotmail.com>

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.

Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.

Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com

To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.


2013/12/31 java8964 <ja...@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.


2013/12/31 java8964 <ja...@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.


2013/12/31 java8964 <ja...@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

Re: any suggestions on IIS log storage and analysis?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.


2013/12/31 java8964 <ja...@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.
Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.
Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

Re: any suggestions on IIS log storage and analysis?

Posted by Azuryy Yu <az...@gmail.com>.

You can run a mapreduce firstly, Join these data sets into one data set.
then analyze the joined dataset.


On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.
Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.
Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.
Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.
Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

RE: any suggestions on IIS log storage and analysis?

Posted by java8964 <ja...@hotmail.com>.

I don't know any example of IIS log files. But from what you described, it looks like analyzing one line of log data depends on some previous lines data. You should be more clear about what is this dependence and what you are trying to do.
Just based on your questions, you still have different options, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need to find alternatives, or write your own.2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat (No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the dependences cross the files, then you maybe have to enforce your business logics in reducer side, instead of mapper side. Without knowing your detail requirements of this dependence, it is hard to give you more detail, but you need to find out what are good KEY candidates for your dependence logic, send the data based on that to the reducers, and enforce your logic on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain several MR jobs together.
Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions on analyzing IIS log files?

thanks!

Re: any suggestions on IIS log storage and analysis?

Posted by Azuryy Yu <az...@gmail.com>.

You can run a mapreduce firstly, Join these data sets into one data set.
then analyze the joined dataset.


On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <ra...@gmail.com> wrote:

> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>